Thank you to arXiv for use of its open access interoperability.
Note: the date of arXiv entries announced right after publication holidays might incorrectly show up as the date of the publication holiday itself. This is due to our ad hoc method of inferring announcement dates, which are not returned by the arXiv API.
What the vitamin story tells us about reproducibility, discovery, and human nature.
This is Part 6 (of 7!) of a blogged essay “Steampunk Data Science.” A table of contents is here.
Having followed the tumultuous thirty-year journey from Eijkman’s chickens to Davis’ rats, let’s return to where we started: the question of reproducible, rigorous research. If there was an era of gold standard reproducible research, the early 20th century wasn’t it. I described multiple examples of important work that not only weren’t reproducible but were flat-out wrong.
McCollum’s first paper on nutrition couldn’t have been more wrong. Led astray by the work of Pavlov, McCollum was convinced that “the psychic influence of palatability is one of the most important factors in nutrition.” McCollum considered the possibility of vitamins, which he described as “certain organic complexes in the food given, which the body was not able to supply through its synthetic power from the materials at hand,” to be completely ruled out by his experiments. He would completely change his mind in the course of only a couple of years.
Was this paper published by McCollum bad for the scientific discovery of vitamins? The reality is the opposite. The fact that McCollum was dead wrong inspired further investigations, and the breakthroughs occurred in figuring out why he was wrong. Mendel’s team at Yale was inspired by McCollum’s synthetic diets and intrigued by his findings on palatability. In their replication attempts, they not only disproved McCollum’s hypothesis but also strengthened the case for the existence of essential amino acids. This work by Mendel’s team subsequently inspired McCollum and Davis’ investigations into the extracts of milk and egg yolks, resulting in their discovery of Vitamin A.
At the opposite end of the spectrum was German physician Wilhelm Stepp, who had conducted experiments removing the ether-soluble contents of bread and feeding them to mice. He claimed that after removing the ether-soluble contents, the mice quickly perished. When the ether-soluble materials were added as a supplement, the mice thrived. This sounds a lot like McCollum and Davis’ experimental setup, but Stepp’s results were deemed “far from conclusive” by Mendel and Osborne. His data was fishy. George Wolf and Kenneth Carpenter reanalyzed Stepp’s experiments from our contemporary understanding and found Stepp’s mice died far too quickly for the cause to be Vitamin A deficiency. What exactly Stepp had done remained unclear, and his work was not reproducible. But he had the right answer! He was clearly on the right track to finding Vitamin A.
We can and do learn from a lack of reproduction. Failure to reproduce tells us something about why our earlier assumptions were wrong, and digging into reproduction failures leads us to new discoveries. Nothing in the evidence points to malicious fraud or scientific misbehavior by those involved in the search for vitamins. It’s not clear how many of these errors would have been corrected by better statistical or scientific methodology. We should not expect science to be perfect and should be open to learning from mess.
And what about rigorous tools and research practices? Might these have accelerated our understanding of nutrition? Here, the evidence again points to no. The discovery of vitamins required a remarkably diverse set of investigative tools. As is always the case, well-controlled experiments designed to deliberately refute hypotheses were only one of many methods used to generate evidence. Natural experiments, such as Eijkman’s work with chickens and Vorderman’s prison observations, provided the initial clues that brown rice contained essential vitamins. The case studies initiated by Mendel, Osborne, and Ferry were experiments on single animals. They applied varied interventions over the course of the rat’s life, probing varied inputs into its diet, scouring for clues as they compared to a baseline. The individual case series of McCollum and Davis provided the definitive evidence that a simple organic compound could start and stop growth. Each of these methods provided a piece of the puzzle, but the researchers were learning how to do nutrition research as they went.
And though it was clear to Christiaan Eijkman that white and brown rice were different from each other, it took thirty years for that difference to be given a name. Sure, we can point back to single experiments that are as clear as day in hindsight, but the vitamins weren’t “discovered” until they were named. It took Funk’s bold survey article to name the problem (deficiency diseases) and the cure (vitamines). Only after Funk did everyone converge on the answer. The clean articulation of the problem and solution, of the cause and the effect, marked the actual discovery.
Perhaps the only pattern I can extract from the scientific processes here is that everyone involved was driven by a definitive purpose. The discovery of vitamins arose out of a deliberate, interventionist mindset. The researchers in Wisconsin wanted to identify the best diet for raising cattle. The researchers in the Dutch colonies sought a cure for beriberi. Nutrition research wasn’t aimed at breaking down the world for understanding, but rather at identifying interventions. They were trying to figure out cause and effect so that they could do something. The entire purpose was intervening, whether to save farmers money or cure terrible diseases. In finding what worked for their problems, they also discovered new chemistry and biology.
Would vitamins have been discovered sooner if the nutrition scientists had a more rigorous set of scientific tools? Could we imagine a counterfactual acceleration had they had access to computers loaded with spreadsheets and statistical software? We could also ask this question differently without assuming the present was wiser than the past. Was this discovery made possible by a rigorous practice that we can learn from?
The vitamin saga suggests the answer to all of these questions is probably no! If anything, the confines of scientific rigor of the day, like Koch’s rigid postulates for determining disease etiology, would have left us stuck with germ theory. I worry that contemporary quests for standardization and formalization of research practice lose sight of the value of creative experimentation and investigation. We can’t let reproducibility checklists stifle creative exploration.
What I also love about this story is how there’s no single hero. I understand the motivational power of the simple great-man science stories, like John Snow and his Broad Street Pump, or Alexander Fleming and his Petri dishes. But historians of science have been scolding us for decades that most of these scientific beatifications are far oversimplified. The muddled mess of vitamin discovery is more the rule than the exception. Though there were multiple Nobel Prizes, it’s really hard to extract a single hero.
On the other hand, it’s easy to find lots of petty fights. Babcock chided Atwater about his protein theories. McCollum was humiliated by Mendel’s work, which proved his palatability hypothesis erroneous. In his autobiography, he admits embarrassment and revenge were among his motivations for studying milk extracts. Harry Steenbock felt like he should have been on McCollum and Davis’ Vitamin A paper and held a grudge for years. He even wrote a letter to Science magazine in 1918, accusing McCollum of academic misconduct when he moved his lab from Wisconsin to Johns Hopkins.
The process of searching for vitamins was a mess. But we learned from the mess. And when we did, we found undeniable effects that completely transformed our understanding of food and our ability to treat disease.
Quadratic functions are the workhorse of the analysis of iterative algorithms such as gradient-based optimization. They lead, in discrete and continuous time, to closed-form dynamics that treat all eigensubspaces of the Hessian matrix independently. This leads to simple math for understanding convergence behaviors (maximal step-size, condition number, acceleration, scaling laws for gradient descent or its...
Quadratic functions are the workhorse of the analysis of iterative algorithms such as gradient-based optimization. They lead, in discrete and continuous time, to closed-form dynamics that treat all eigensubspaces of the Hessian matrix independently. This leads to simple math for understanding convergence behaviors (maximal step-size, condition number, acceleration, scaling laws for gradient descent or its stochastic version, etc.).
An important success of the optimization field is to have shown that the same behaviors hold for convex functions, where the analysis now relies on Lyapunov functions rather than closed-form computations (see, e.g., this post on automated proofs), with some extensions to non-convex functions, but only regarding convergence to stationary points.
Going (globally) non-convex is of major interest in different fields, including machine learning for neural networks or matrix factorizations. In this post, I try to answer the following question: What are dynamics that lead to closed form or enough of it to get relevant insights?
Over the years, several have emerged, allowing us to give some nice and explicit intuitive results on the global behavior of gradient descent in certain non-convex cases. This blog post is about them. Note that this complements the analysis of diagonal networks from [1, 2] that we described in an earlier post.
One-hidden layer linear network.
We focus on linear neural networks, with a prediction function \(x \mapsto A_2^\top A_1^\top x\), where \(x \in \mathbb{R}^d\) is the input, \(A_1 \in \mathbb{R}^{d \times m}\) is the matrix of input weights, and \(A_2 \in \mathbb{R}^{m \times k}\) is the matrix of output weights, with an overall weight matrix \(\Theta = A_1 A_2 \in \mathbb{R}^{d \times k}\). This is a linear model corresponding to multi-task / multi-category classification / multi-label supervised problems, or an auto-encoder for unsupervised learning. Clearly, using the parameterization \(\Theta = A_1 A_2\) is not the most efficient for optimization (as we could simply use convex optimization directly on \(\Theta\)), but we use it for understanding non-convex dynamics.
Given data matrices \(Y \in \mathbb{R}^{n \times k}\) and \(X \in \mathbb{R}^{n \times d}\), the empirical risk is $$ L(A_1,A_2) = \frac{1}{2n} \big\| Y – X A_1 A_2 \big\|_{\rm F}^2,$$ where \(\| M \|_{\rm F} = \sqrt{ {\rm tr} (M^\top M)}\) is the Frobenius norm of \(M\).
The gradient flow dynamics are then $$ \left\{ \begin{array}{lcrcr} \dot{A}_1 & = & – \nabla_{A_1} L(A_1,A_2) & = & \frac{1}{n} X^\top ( Y – X A_1 A_2) A_2^\top \\ \dot{A}_2 & = &- \nabla_{A_2} L(A_1,A_2) & = & \frac{1}{n} A_1^\top X^\top ( Y – X A_1 A_2). \\ \end{array} \right.$$
We denote by \(B_1 \in \mathbb{R}^{d \times m}, B_2 \in \mathbb{R}^{m \times k}\) the initialization of the dynamics. As first outlined by [3] and used in most subsequence analyses (e.g., [4, 6, 7]), there is an invariant of the dynamics, as $$\frac{d}{dt} ( A_1^\top A_1 ) = 2 {\rm Sym} \big( A_1^\top \frac{1}{n} X^\top ( Y – X A_1 A_2) A_2^\top \big) = \frac{d}{dt} ( A_2 A_2^\top ), $$ where \({\rm Sym}(M) = \frac{1}{2} ( M + M^\top)\) is the symmetrization of the square matrix \(M\). This implies that \(A_1^\top A_1 – A_2 A_2^\top \in \mathbb{R}^{m \times m}\) is constant along the dynamics. This will allow considerable simplifications (the presence of such invariants is not limited to linear networks, see [16] for a nice detailed account on these conservation laws).
There are several lines of work to go further. I will use a non-chronological presentation and start with the simplest case where singular vectors are aligned, before moving to the real magic first detailed by Kenji Fukumizu in [3].
Aligned initialization and commuting moments
The first set of assumptions that lead to a closed form is, from [4] with \(X^\top X \) proportional to identity, and [5] for this present version:
Commuting moments: \(X^\top X\) and \(X^\top YY^\top X\) commute, which is equivalent to aligned singular value decompositions \(C = \frac{1}{n} X^\top Y = U {\rm Diag}(c) V^\top\) and \(\Sigma = \frac{1}{n} X^\top X = U {\rm Diag}(s) U^\top\) for matrices \(U \in \mathbb{R}^{d \times r},V \in \mathbb{R}^{k \times r}\) with orthonormal columns (that is, \(U^\top U = V^\top V = I\)), and \(c,s \in \mathbb{R}_+^r\) non-negative vectors of singular values (this is thus the “economy-size” version of the SVD where \(U\) and \(V\) have the same number of columns). We only need to take \(r\) as big as \(\max \{ {\rm rank}(C), {\rm rank}(\Sigma) \}\). This occurs, for example, when \(X^\top X \propto I\) (whitened data), or for auto-encoders, where \(Y = X\).
Aligned initialization: \(B_1 = U {\rm Diag}(b_1) W^\top\) and \(B_2 = W {\rm Diag}(b_2) V^\top\) are the singular value decompositions, with \(W \in \mathbb{R}^{m \times r}\) with orthonormal columns (which implies that we need to assume \(m \geqslant r\)), with singular vectors of \(B_1 B_2 = U {\rm Diag}(b_1 b_2) V^\top\) aligned with the one of \(X^\top Y\).
Note that throughout the blog post, as done just above, multiplication and divisions of the (singular) vectors of the same size will be considered component-wise.
With these assumptions, throughout the dynamics, it turns out that we keep aligned singular vectors, and thus, if \(a_1, a_2\) are the vectors of singular values of \(A_1, A_2\), we get the coupled ODEs: $$ \left\{ \begin{array}{lcr} \dot{a}_1 & = & ( c \, – s a_1 a_2 ) a_2 \\ \dot{a}_2 & = & ( c \, – s a_1 a_2 ) a_1 . \\ \end{array} \right.$$ The invariant mentioned above leads to $$ a_1 a_1 – a_2 a_2 = {\rm cst}.$$ Overall we get the dynamics for \(\theta = a_1 a_2\), $$ \dot{\theta}= ( c \, – s a_1 a_2 ) ( a_1 a_1 + a_2 a_2).$$
If \(b_1 = b_2 = \alpha \) at initialization (for all singular values) with \(\alpha\) a positive scalar, then this remains true throughout the dynamics, and we get $$ \dot{\theta} = 2 ( c \, – s \theta ) \theta, $$ with a closed form dynamics by integration $$\theta = \frac{c}{s} \Big( \frac{e^{2ct} {\alpha^2 }}{\frac{c}{s} – \alpha^2+e^{2ct} {\alpha^2}}\Big) = \frac{1}{ \frac{1}{\alpha^2} e^{-2ct} + \frac{s}{c} ( 1 – e^{-2ct} )}.$$
Insights. We see the sequential learning of each singular values when the initialization scale \(\alpha\) tends to zero, as then we have $$ \theta \sim \frac{1}{ \frac{1}{\alpha^2} e^{-2ct} + \frac{s}{c}},$$ with a sharp jump from \(0\) to \(\frac{c}{s}\), around the time \(\frac{1}{c} \log \frac{1}{\alpha}\).
To prove it, we rescale time as \(t= u \log \frac{1}{\alpha}\) and we get $$\theta = \frac{1}{ \alpha^{2(cu-1)} + \frac{s}{c}},$$ which, with \(u\) fixed, has its \(i\)-th component that converges to \(0\) for \(u < \frac{1}{c_i}\) and to \(c_i / s_i\) for \(u > \frac{1}{c_i}\).
Linear networks with aligned initialization. Plotting the singular values \(\theta_t\) as a function of (rescaled) time \(t\), for various initialization scales \(\alpha\).
Extension to deeper models. Following [4], we can consider more layers and minimize $$ \frac{1}{2n} \| Y – X A_1 \cdots A_L\|_{\rm F}^2.$$ If at initialization, all singular vectors of all \(A_i\)’s align, then we get dynamics directly on the singular values, and starting from equal values, we get after a few lines $$\dot{\theta} = L (c-s\theta) \theta^{2 – 2/L},$$ which can also be integrated and analyzed. See [4] for details, and [18] for deep linear networks beyond aligned initialization.
Other dynamics. In this simple case, we can easily compare the dynamics of learning for three common algorithms, which all leads to dynamics on singular values \(\theta\) in this aligned case:
Gradient descent on \((A_1,A_2)\): As seen above \(\displaystyle \theta = \frac{1}{ \frac{1}{\alpha^2} e^{-2ct} + \frac{s}{c} ( 1 – e^{-2ct} )}\). The singular values \(c\) of \(\frac{1}{n} X^\top Y\) characterize the dynamics, with each singular value appearing one by one when \(\alpha\) is small.
Gradient descent on \(\Theta = A_1 A_2\) starting from zero (e.g., \(\alpha=0\)): \(\theta = \frac{c}{s} ( 1 – e^{-st})\). The singular values \(s\) of \(\frac{1}{n} X^\top X\) characterize the dynamics.
Muon: this is the new cool algorithm (“stylé” my kids would say). It corresponds to orthogonalizing the matrix updates, and thus, to do sign descent with vanishing step-size on the singular values, that is, \(\dot{a}_1 = \dot{a}_2 = 1\) until reaching the solution. This leads to \(\theta = \min \{ t^2, c/s\}\), with a conditioning that does not depend on \(s\) or \(c\). No singular values is impacting the dynamics, hence the “automatic” reconditioning, which extends what can be seen for sign descent (see previous post). Note that if we implement Muon directly on \(\Theta\), we would obtain instead \(\theta = \min \{ t, c/s\}\).
Comparison of three dynamics. Top: singular values \(\theta_t\), bottom: loss functions \(R(\theta_t)\) as a function of time \(t\).
Matrix Riccati equation
The second set of assumptions is:
White moment matrix: \(\Sigma = \frac{1}{n} X^\top X = I\). Thus, the only data matrix left is \(C = \frac{1}{n} X^\top Y\).
Balanced initialization: \(B_1^\top B_1 = B_2 B_2^\top\), which implies that throughout the dynamics, we have \(A_1^\top A_1 = A_2 A_2^\top\) (because of the invariant).
Note that there are no assumptions regarding the alignments of singular vectors at initialization. We also assume that \(m \leqslant r = {\rm rank}(C) \leqslant \min\{d,k\}\). See [17] for closed-form expressions with larger width \(m\).
Then something magical happens, first outlined by Kenji Fukumizu [3], and then further used in several works (e.g., [7,17, 18]). Defining \(R = { A_1 \choose A_2^\top} \in \mathbb{R}^{(d+k) \times m}\), we have $$R^\top R = A_1^\top A_1 + A_2 A_2^\top = 2A_1^\top A_1 = 2 A_2 A_2^\top,$$ which leads to the rewriting of the dynamics as: $$\dot{R} = \Delta R \, – \frac{1}{2} RR^\top R \mbox{ with } \Delta = \left( \begin{array}{cc} 0 & \!\frac{1}{n}X^\top Y \!\\ \! \frac{1}{n} Y^\top X \! & 0 \end{array} \right)= \left( \begin{array}{cc} 0 & \! C \!\\ \! C^\top \! & 0 \end{array} \right),$$ initialized with \(R_0 = { B_1 \choose B_2^\top} \in \mathbb{R}^{(d+k) \times m}\).
This is now a matrix Riccati equation [8, 9], with a closed form and nice links with several branches of mathematics, such as control (linear-quadratic regulator), probabilistic modelling (Kalman filtering), and machine learning (Oja flow for principal component analysis [10]). In this present post, we will only scratch the surface of this fascinating topic, but a future post will most likely do it justice.
Denoting \(P = RR^\top\), we get $$\dot{P} = \Delta P + P \Delta\, – P^2,$$ and we can then write $$P = NM^{-1}$$ with \(M\) initialized at \(I\) and \(N\) initialized at \(R_0 R_0^\top = { B_1 \choose B_2^\top}{ B_1 \choose B_2^\top}^\top\), and the linear dynamics $$\frac{d}{dt} { M \choose N} = \left( \begin{array}{cc} \! – \Delta \! & \! I \! \\ \! 0 \! & \! \Delta \! \end{array} \right){ M \choose N}.$$ The magical proof (see [10]) is really simple: $$\dot{P} = \dot{N} M^{-1} – N M^{-1} \dot{M} M^{-1} = \Delta N M^{-1} – N M^{-1} ( -\Delta M + N) M^{-1} = \Delta P + P \Delta\, – P^2.$$ This dynamics can be rewritten as $$ \left\{ \begin{array}{rcl} \frac{d}{dt} N & = & \Delta N \\ \frac{d}{dt} ( N\, – 2 \Delta M) & = & – \Delta ( N \, – 2 \Delta M), \end{array} \right.$$ and thus leads to the closed form $$ P = e^{\Delta t} R_0 R_0^\top \Big( e^{-\Delta t} + \frac{e^{\Delta t} – e^{-\Delta t}}{2\Delta } R_0 R_0^\top \Big)^{-1} = e^{\Delta t} R_0 \Big( I + R_0^\top \frac{e^{2\Delta t}-I}{2\Delta } R_0 \Big)^{-1} R_0^\top e^{\Delta t}.$$
We can now analyze this closed-form expression precisely with the following eigenvalue decomposition of \(\Delta\) $$\Delta = \frac{1}{2} { U \choose V} {\rm Diag}(c) { U \choose V}^\top +\frac{1}{2} { U \choose -V} {\rm Diag}(-c) { U \choose -V}^\top,$$ classically obtained for the Jordan-Wielandt matrix \(\Delta\) associated with \(C\) (see an earlier post) with singular value decomposition \(C = U {\rm Diag}(c) V^\top\).
Recovering aligned initialization. We start with aligned initialization (as before, and with the same notations), that is, \(b_1= b_2 =a\). By taking \(B_1 = U {\rm Diag}(a) W^\top\) and \(B_2 = W {\rm Diag}(a) V^\top\), we get \(R_0 = { U \choose V} {\rm Diag}(a) W^\top\), and for any function \(f: \mathbb{R} \to \mathbb{R}\), we have \(f(\Delta) R_0 = { U \choose V} {\rm Diag}(a f(c)) W^\top\), where \(c\) is the vector of singular values of \(C =\frac{1}{n} X^\top Y\), leading to $$P = { U \choose V} {\rm Diag} \Big( \frac{a^2 e^{2 c t}}{ 1 + a^2 \frac{e^{2c t}-1}{c}} \Big){ U \choose V}^\top,$$ which leads to $$ A_1 A_2 = U {\rm Diag} \Big( \frac{a^2 e^{2 c t}}{ 1 + a^2 \frac{e^{2c t}-1}{c}} \Big) V^\top,$$ which is exactly the same as the “aligned” result (OUF!). Note that the aligned initialization only requires that \(C\) and \(\Sigma\) commute, which is weaker than the assumption \(\Sigma =I\) that is made here (but we can go much further below).
Beyond aligned initialization. When \(R_0\) is initialized randomly (and thus with full rank on all subspaces) with balanced initialization and norm proportional to \(\alpha^2\), then, with \(t = u \log \frac{1}{\alpha}\), we also can prove sequential learning of all singular subspaces. More precisely, if \(c_1 > \cdots > c_r\), then, when \(\alpha\) tends to zero (see proof below the reference section):
If \(u<\frac{1}{c_1}\), \(P \sim 0\). This is the initial phase with no apparent movements, but an alignment of singular vectors, nicely cast as silent alignment by [7].
When \(u \in ( \frac{1}{c_i}, \frac{1}{c_{i+1}})\) for \(i \in \{1,\dots,m-1\}\), we get \(P \sim C_i\), the optimal rank \(i\) approximation of \(C\).
When \(u > \frac{1}{c_m}\), \(P \sim C_m\), which is the global minimizer of the cost function.
Note that when \(m = r\), we get \(C_m = C\). Overall, this is exactly sequential learning, with both the singular values and the singular vectors learned from scratch. In the plot below, we see how singular vectors align early in the dynamics when \(\alpha\) is small, while singular values are learned step-by-step.
Dynamics of linear network learning as a function of rescaled time: components of \(\Theta\) (left, one curve per component) with their limit as crosses, singular values of \(\Theta\) (middle), absolute projections of singular vectors of \(\Theta\) onto the ones of \(C\) (right). \(\alpha\) is going from \(10^{-1}\) to \(10^{-9}\).
Conclusion
Overall, we obtain a precise behavior for data with an identity covariance matrix, showing that features (singular vector directions) are learned early on, while their weights (singular values) are learned sequentially. This was done using a balanced initialization (see [11] for an extension beyond balanced initialization and large layer width).
However, this study only applies to linear neural networks (without any non-linear activation function). Extensions to the simplest architectures remain an active area of research (see, e.g., [12] and references therein), with in particular similar early alignment behaviors [13, 14]. In addition, matrix product parameterizations also arise in attention architectures through the product of the key and query matrices. As a result, sequential learning phenomena can also be observed in this setting; see [15] for an analysis of sequential learning in linear attention.
Acknowledgements. I would like to thank Raphaël Berthier, Pierre Marion, Loucas Pillaud-Vivien, and Aditya Varre for proofreading this blog post and making good, clarifying suggestions.
We can rewrite the dynamics as (note that \(\Delta\) is not positive semidefinite) $$ P = \Big( \frac{ 2 \Delta}{1-e^{-2\Delta t}}\Big)^{1/2} H ( H + I)^{-1} \Big( \frac{ 2 \Delta}{1-e^{-2\Delta t}}\Big)^{1/2} ,$$ with $$H = \Big( \frac{e^{2\Delta t}-I}{2\Delta } \Big)^{1/2} R_0 R_0^\top \Big( \frac{e^{2\Delta t}-I}{2\Delta } \Big)^{1/2}.$$ We note that the eigenvalues of \(\Delta\) are singular values of \(C\) as well as their opposite, together with zeros.
The term \(\big( \frac{ 2 \Delta}{1-e^{-2\Delta t}}\big)^{1/2}\) is a spectral function of \(\Delta\) which will set to zero all negative eigenvalues and set to \(( 2 c)^{1/2}\) all positive ones \(c\). We can therefore only consider the positive ones, and consider these terms as identity. Zeros eigenvalues of \(\Delta\) also lead to decay, this time in \(1/t\), and can be discarded as well in the asymptotic analysis.
The term \(\big( \frac{e^{2\Delta t}-I}{2\Delta } \big)^{1/2}\), with strictly positive eigenvalues, is equivalent to \(e^{\Delta t} ( 2 \Delta)^{-1/2}\), and we are faced with studying the dynamics of \(H(I+H)^{-1}\) for \(H = e^{\Delta t} \alpha^2 M e^{\Delta t}\) for \(M = \Delta^{-1} R_0 R_0^\top \Delta^{-1}\) a rank-\(m\) \(r \times r\) matrix, when \(\alpha\) tends to zero. We rewrite \(t = u \log \frac{1}{\alpha}\), and we then have, once in the basis of singular vectors $$H = {\rm Diag}(\alpha^{1-cu}) M {\rm Diag}(\alpha^{1-cu}),$$ with \(N \in \mathbb{R}^{r\times r}\) a rank-\(m\) matrix. We assume \(c_1 > \cdots > c_r\) for simplicity and consider three cases.
Case 1. If \(1- c_i u > 0\) for all \(i\), that is, \(u < 1/c_1\), then the matrix \({\rm Diag}(\alpha^{1-cu}\) goes to zero, and thus \(P\) goes to zero.
Case 2. For \(j \in \{1,\dots,m\}\), and \(u \in [\frac{1}{c_j}, \frac{1}{c_{j+1}}]\), then we have $$H = \left( \begin{array}{cc} D_\infty A D_\infty & D_\infty B D_0 \\ D_0 B^\top D_\infty & D_0 C D_0 \end{array} \right),$$ for a diagonal matrix \(D_\infty\) (of size \(j\)) that goes to infinity and a diagonal matrix \(D_0\) goes to zero. Using \(H (H+I)^{-1} = I – (H+I)^{-1}\) and using Schur complements, we have $$ (H+I)^{-1} = \left( \begin{array}{cc} \bar{A} & \bar{B} \\ \bar{B}^\top & \bar{C} \end{array} \right),$$ with $$\bar{A}^{-1} = I + D_\infty A D_\infty – D_\infty B D_0 ( D_0 C D_0 +I )^{-1}D_0 B^\top D_\infty \sim I + D_\infty A D_\infty,$$ so that \(\bar{A}\) tends to zero (since \(A\) is invertible). We also have $$\bar{C}^{-1} = I + D_0 C D_0 – D_0 B^\top D_\infty ( D_\infty C D_\infty +I )^{-1}D_\infty B D_\infty \sim I,$$ with a similar limit for the off-diagonal block. Therefore \((H+I)^{-1}\) tends to \(\left( \begin{array}{cc} 0 & 0 \\ 0 &I \end{array} \right)\), which leads to the desired result.
Case 3. For \(u > 1/c_m\), using the same formula for \((H+I)^{-1}\), with now \(D_0\) not tending to zero, we can use the fact that \(C = B^\top A^{-1} B\) to obtain the desired limit.
A symbolic determinant under rank-one restriction computes a polynomial of the form $\det(A_0+A_1y_1+\ldots+A_ny_n)$, where $A_0,A_1,\ldots,A_n$ are square matrices over a field $\mathbb{F}$ and $rank(A_i)=1$ for each $i\in[n]$. This class of polynomials has been studied extensively, since the work of Edmonds (1967), in the context of linear matroids, matching, matrix completion and polynomial identity testing. We study the following learning problem for this class: Given black-box access to an $n$-variate polynomial $f=\det(A_0+A_1y_1+ \ldots+A_ny_n)$, where $A_0,A_1,\ldots,A_n$ are unknown square matrices over $\mathbb{F}$ and rank$(A_i)=1$ for each $i\in[n]$, find a square matrix $B_0$ and rank-one square matrices $B_1,\ldots,B_n$ over $\mathbb{F}$ such that $f=\det(B_0+B_1y_1+\ldots+B_ny_n)$. In this work, we give a randomized poly(n) time algorithm to solve this problem. As the above-mentioned class is known to be equivalent to the class of read-once determinants (RODs), we will refer to the problem as learning RODs. The algorithm for learning RODs is obtained by connecting with a well-known open problem in linear algebra, namely the Principal Minor Assignment Problem (PMAP), which asks to find (if possible) a matrix having prescribed principal minors. PMAP has also been studied in machine learning to learn the kernel matrix of a determinantal point process. Here, we study a natural black-box version of PMAP: Given black-box access to an $n$-variate polynomial $f = \det(A + Y)$, where $A \in \mathbb{F}^{n \times n}$ is unknown and $Y = diag(y_1,\ldots,y_n)$, find a $B\in\mathbb{F}^{n\times n}$ such that $f=det(B+Y)$. We show that black-box PMAP can be solved in randomized poly(n) time, and further, it is randomized polynomial-time equivalent to learning RODs. We resolve black-box PMAP by investigating a property of dense matrices that we call the rank-one extension property.
A symbolic determinant under rank-one restriction computes a polynomial of the form $\det(A_0+A_1y_1+\ldots+A_ny_n)$, where $A_0,A_1,\ldots,A_n$ are square matrices over a field $\mathbb{F}$ and $rank(A_i)=1$ for each $i\in[n]$. This class of polynomials has been studied extensively, since the work of Edmonds (1967), in the context of linear matroids, matching, matrix completion and polynomial identity testing. We study the following learning problem for this class: Given black-box access to an $n$-variate polynomial $f=\det(A_0+A_1y_1+ \ldots+A_ny_n)$, where $A_0,A_1,\ldots,A_n$ are unknown square matrices over $\mathbb{F}$ and rank$(A_i)=1$ for each $i\in[n]$, find a square matrix $B_0$ and rank-one square matrices $B_1,\ldots,B_n$ over $\mathbb{F}$ such that $f=\det(B_0+B_1y_1+\ldots+B_ny_n)$. In this work, we give a randomized poly(n) time algorithm to solve this problem. As the above-mentioned class is known to be equivalent to the class of read-once determinants (RODs), we will refer to the problem as learning RODs. The algorithm for learning RODs is obtained by connecting with a well-known open problem in linear algebra, namely the Principal Minor Assignment Problem (PMAP), which asks to find (if possible) a matrix having prescribed principal minors. PMAP has also been studied in machine learning to learn the kernel matrix of a determinantal point process. Here, we study a natural black-box version of PMAP: Given black-box access to an $n$-variate polynomial $f = \det(A + Y)$, where $A \in \mathbb{F}^{n \times n}$ is unknown and $Y = diag(y_1,\ldots,y_n)$, find a $B\in\mathbb{F}^{n\times n}$ such that $f=det(B+Y)$. We show that black-box PMAP can be solved in randomized poly(n) time, and further, it is randomized polynomial-time equivalent to learning RODs. We resolve black-box PMAP by investigating a property of dense matrices that we call the rank-one extension property.
Formal reasoning about inductively defined relations and structures is widely recognized not only for its mathematical interest but also for its importance in computer science, and has applications in verifying properties of programs and algorithms. Recently, several proof systems of inductively defined predicates based on sequent calculus including the cyclic proof system CLKID-omega and the infinite-descent proof system LKID-omega have attracted much attention. Although the relation among their provabilities has been clarified so far, the logical complexity of these systems has not been much studied. The infinite-descent proof system LKID-omega is an infinite proof system for inductive definitions and allows infinite paths in proof figures. It serves as a basis for the cyclic proof system. This paper shows that the logical complexity of the provability in LKID-omega is (Pi-1-1)-complete. To show this, first it is shown that the validity for inductive definitions in standard models is equivalent to the validity for inductive definitions in standard term models. Next, using this equivalence, this paper extends the truth predicate of omega-languages, as given in Girard's textbook, to inductive definitions by employing arithmetical coding of inductive definitions. This shows that the validity of inductive definitions in standard models is a (Pi-1-1) relation. Then, using the completeness of LKID-omega for standard models, it is shown that the logical complexity of the provability in LKID-omega is (Pi-1-1)-complete.
Formal reasoning about inductively defined relations and structures is widely recognized not only for its mathematical interest but also for its importance in computer science, and has applications in verifying properties of programs and algorithms. Recently, several proof systems of inductively defined predicates based on sequent calculus including the cyclic proof system CLKID-omega and the infinite-descent proof system LKID-omega have attracted much attention. Although the relation among their provabilities has been clarified so far, the logical complexity of these systems has not been much studied. The infinite-descent proof system LKID-omega is an infinite proof system for inductive definitions and allows infinite paths in proof figures. It serves as a basis for the cyclic proof system. This paper shows that the logical complexity of the provability in LKID-omega is (Pi-1-1)-complete. To show this, first it is shown that the validity for inductive definitions in standard models is equivalent to the validity for inductive definitions in standard term models. Next, using this equivalence, this paper extends the truth predicate of omega-languages, as given in Girard's textbook, to inductive definitions by employing arithmetical coding of inductive definitions. This shows that the validity of inductive definitions in standard models is a (Pi-1-1) relation. Then, using the completeness of LKID-omega for standard models, it is shown that the logical complexity of the provability in LKID-omega is (Pi-1-1)-complete.
Error-correcting codes are a method for representing data, so that one can recover the original information even if some parts of it were corrupted. The basic idea, which dates back to the revolutionary work of Shannon and Hamming about a century ago, is to encode the data into a redundant form, so that the original information can be decoded from the redundant encoding even in the presence of some noise or corruption. One prominent family of error-correcting codes are Reed-Solomon Codes which encode the data using evaluations of low-degree polynomials. Nearly six decades after they were introduced, Reed-Solomon Codes, as well as some related families of polynomial-based codes, continue to be widely studied, both from a theoretical perspective and from the point of view of applications.
Besides their obvious use in communication, error-correcting codes such as Reed-Solomon Codes are also useful for various applications in theoretical computer science. These applications often require the ability to cope with many errors, much more than what is possible information-theoretically. List-decodable codes are a special class of error-correcting codes that enable correction from more errors than is traditionally possible by allowing a small list of candidate decodings. These codes have turned out to be extremely useful in various applications across theoretical computer science and coding theory.
In recent years, there have been significant advances in list decoding of Reed-Solomon Codes and related families of polynomial-based codes. This includes efficient list decoding of such codes up to the information-theoretic capacity, with optimal list-size, and using fast nearly-linear time, and even sublinear-time, algorithms. In this book, we survey these developments.
Error-correcting codes are a method for representing data, so that one can recover the original information even if some parts of it were corrupted. The basic idea, which dates back to the revolutionary work of Shannon and Hamming about a century ago, is to encode the data into a redundant form, so that the original information can be decoded from the redundant encoding even in the presence of some noise or corruption. One prominent family of error-correcting codes are Reed-Solomon Codes which encode the data using evaluations of low-degree polynomials. Nearly six decades after they were introduced, Reed-Solomon Codes, as well as some related families of polynomial-based codes, continue to be widely studied, both from a theoretical perspective and from the point of view of applications.
Besides their obvious use in communication, error-correcting codes such as Reed-Solomon Codes are also useful for various applications in theoretical computer science. These applications often require the ability to cope with many errors, much more than what is possible information-theoretically. List-decodable codes are a special class of error-correcting codes that enable correction from more errors than is traditionally possible by allowing a small list of candidate decodings. These codes have turned out to be extremely useful in various applications across theoretical computer science and coding theory.
In recent years, there have been significant advances in list decoding of Reed-Solomon Codes and related families of polynomial-based codes. This includes efficient list decoding of such codes up to the information-theoretic capacity, with optimal list-size, and using fast nearly-linear time, and even sublinear-time, algorithms. In this book, we survey these developments.
Locally decodable codes (LDCs) are error correction codes that allow recovery of any single message symbol by probing only a small number of positions from the (possibly corrupted) codeword. Relaxed locally decodable codes (RLDCs) further allow the decoder to output a special failure symbol $\bot$ on a corrupted codeword. While known constructions of RLDCs achieve much better parameters than standard LDCs, it is intriguing to understand the relationship between LDCs and RLDCs. Separation results (i.e., the existence of $q$-query RLDCs that are not $q$-query LDCs) are known for $q=3$ (Gur, Minzer, Weissenberg, and Zheng, arXiv:2512.12960, 2025) and $q \geq 15$ (Grigorescu, Kumar, Manohar, and Mon, arXiv:2511.02633, 2025), while any $2$-query RLDC also gives a $2$-query LDC (Block, Blocki, Cheng, Grigorescu, Li, Zheng, and Zhu, CCC 2023).
In this work, we generalize and strengthen the main result in Grigorescu, Kumar, Manohar, and Mon (arXiv:2511.02633, 2025), by removing the requirement of linear codes. Specifically, we show that any $q$-query RLDC with soundness error below some threshold $s(q)$ also yields a $q$-query LDC with comparable parameters. This holds even if the RLDC has imperfect completeness but with a non-adaptive decoder. Our results also extend to the setting of locally correctable codes (LCCs) and relaxed locally correctable codes (RLCCs). Using our results, we further derive improved lower bounds for arbitrary RLDCs and RLCCs, as well as probabilistically checkable proofs of proximity (PCPPs).
Locally decodable codes (LDCs) are error correction codes that allow recovery of any single message symbol by probing only a small number of positions from the (possibly corrupted) codeword. Relaxed locally decodable codes (RLDCs) further allow the decoder to output a special failure symbol $\bot$ on a corrupted codeword. While known constructions of RLDCs achieve much better parameters than standard LDCs, it is intriguing to understand the relationship between LDCs and RLDCs. Separation results (i.e., the existence of $q$-query RLDCs that are not $q$-query LDCs) are known for $q=3$ (Gur, Minzer, Weissenberg, and Zheng, arXiv:2512.12960, 2025) and $q \geq 15$ (Grigorescu, Kumar, Manohar, and Mon, arXiv:2511.02633, 2025), while any $2$-query RLDC also gives a $2$-query LDC (Block, Blocki, Cheng, Grigorescu, Li, Zheng, and Zhu, CCC 2023).
In this work, we generalize and strengthen the main result in Grigorescu, Kumar, Manohar, and Mon (arXiv:2511.02633, 2025), by removing the requirement of linear codes. Specifically, we show that any $q$-query RLDC with soundness error below some threshold $s(q)$ also yields a $q$-query LDC with comparable parameters. This holds even if the RLDC has imperfect completeness but with a non-adaptive decoder. Our results also extend to the setting of locally correctable codes (LCCs) and relaxed locally correctable codes (RLCCs). Using our results, we further derive improved lower bounds for arbitrary RLDCs and RLCCs, as well as probabilistically checkable proofs of proximity (PCPPs).
Authors: William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal
The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.
The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.
Authors: MIT Hardness Group, Zachary Abel, Erik D. Demaine, Jenny Diomidova, Jeffery Li, Zixiang Zhou
Given a graph, when can we orient the edges to satisfy local constraints at the vertices, where each vertex specifies which local orientations of its incident edges are allowed? This family of graph orientation problems is a special kind of SAT problem, where each variable (edge orientation) appears in exactly two clauses (vertex constraints) -- once positively and once negatively. We analyze the complexity of many natural vertex types (patterns of allowed vertex neighborhoods), most notably all sets of symmetric vertex types which depend on only the number of incoming edges. In many scenarios, including Planar and Non-Planar Symmetric Graph Orientation with constants, we give a full dichotomy characterizing P vs. NP-complete problem classes. We apply our results to obtain new polynomial-time algorithms, resolving a 20-year-old open problem about KPlumber; to simplify existing NP-hardness proofs for tiling with trominoes; and to prove new NP-completeness results for tiling with tetrominoes.
Given a graph, when can we orient the edges to satisfy local constraints at the vertices, where each vertex specifies which local orientations of its incident edges are allowed? This family of graph orientation problems is a special kind of SAT problem, where each variable (edge orientation) appears in exactly two clauses (vertex constraints) -- once positively and once negatively. We analyze the complexity of many natural vertex types (patterns of allowed vertex neighborhoods), most notably all sets of symmetric vertex types which depend on only the number of incoming edges. In many scenarios, including Planar and Non-Planar Symmetric Graph Orientation with constants, we give a full dichotomy characterizing P vs. NP-complete problem classes. We apply our results to obtain new polynomial-time algorithms, resolving a 20-year-old open problem about KPlumber; to simplify existing NP-hardness proofs for tiling with trominoes; and to prove new NP-completeness results for tiling with tetrominoes.
Pangenomics uses graph-based models to represent and study the genetic variation between individuals of the same species or between different species. In such variation graphs, a path through the graph represents one individual genome. Subgraphs that encode locally distinct paths are therefore genomic regions with distinct genetic variation and detecting such subgraphs is integral for studying genetic variation. Biedged graphs is a type of variation graph that use two types of edges, black and grey, to represent genomic sequences and adjacencies between sequences, respectively. Ultrabubbles in biedged graphs are minimal subgraphs that represent a finite set of sequence variants that all start and end with two distinct sequences; that is, ultrabubbles are acyclic and all paths in an ultrabubble enter and exit through two distinct black edges. Ultrabubbles are therefore a special case of snarls, which are minimal subgraphs that are connected with two black edges to the rest of the graph. Here, we show that any bidirected graph can be transformed to a bipartite biedged graph in which lowest common ancestor queries can determine whether a snarl is an ultrabubble. This leads to an O(Kn) algorithm for finding all ultrabubbles in a set of K snarls, improving on the prior naive approach of O(K(n + m)) in a biedged graph with n nodes and m edges. Accordingly, our benchmark experiments on real and synthetic variation graphs show improved run times on graphs with few cycles and dead end paths, and dense graphs with many edges.
Pangenomics uses graph-based models to represent and study the genetic variation between individuals of the same species or between different species. In such variation graphs, a path through the graph represents one individual genome. Subgraphs that encode locally distinct paths are therefore genomic regions with distinct genetic variation and detecting such subgraphs is integral for studying genetic variation. Biedged graphs is a type of variation graph that use two types of edges, black and grey, to represent genomic sequences and adjacencies between sequences, respectively. Ultrabubbles in biedged graphs are minimal subgraphs that represent a finite set of sequence variants that all start and end with two distinct sequences; that is, ultrabubbles are acyclic and all paths in an ultrabubble enter and exit through two distinct black edges. Ultrabubbles are therefore a special case of snarls, which are minimal subgraphs that are connected with two black edges to the rest of the graph. Here, we show that any bidirected graph can be transformed to a bipartite biedged graph in which lowest common ancestor queries can determine whether a snarl is an ultrabubble. This leads to an O(Kn) algorithm for finding all ultrabubbles in a set of K snarls, improving on the prior naive approach of O(K(n + m)) in a biedged graph with n nodes and m edges. Accordingly, our benchmark experiments on real and synthetic variation graphs show improved run times on graphs with few cycles and dead end paths, and dense graphs with many edges.
We propose a novel computational framework for analyzing electroencephalography (EEG) time series using methods from stringology, the study of efficient algorithms for string processing, to systematically identify and characterize recurrent temporal patterns in neural signals. The primary aim is to introduce quantitative measures to understand neural signal dynamics, with the present findings serving as a proof-of-concept. The framework adapts order-preserving matching (OPM) and Cartesian tree matching (CTM) to detect temporal motifs that preserve relative ordering and hierarchical structure while remaining invariant to amplitude scaling. This approach provides a temporally precise representation of EEG dynamics that complements traditional spectral and global complexity analyses. To evaluate its utility, we applied the framework to multichannel EEG recordings from individuals with attention-deficit/hyperactivity disorder (ADHD) and matched controls using a publicly available dataset. Highly recurrent, group-specific motifs were extracted and quantified using both OPM and CTM. The ADHD group exhibited significantly higher motif frequencies, suggesting increased repetitiveness in neural activity. OPM analysis revealed shorter motif lengths and greater gradient instability in ADHD, reflected in larger mean and maximal inter-sample amplitude changes. CTM analysis further demonstrated reduced hierarchical complexity in ADHD, characterized by shallower tree structures and fewer hierarchical levels despite comparable motif lengths. These findings suggest that ADHD-related EEG alterations involve systematic differences in the structure, stability, and hierarchical organization of recurrent temporal patterns. The proposed stringology-based motif framework provides a complementary computational tool with potential applications for objective biomarker development in neurodevelopmental disorders.
We propose a novel computational framework for analyzing electroencephalography (EEG) time series using methods from stringology, the study of efficient algorithms for string processing, to systematically identify and characterize recurrent temporal patterns in neural signals. The primary aim is to introduce quantitative measures to understand neural signal dynamics, with the present findings serving as a proof-of-concept. The framework adapts order-preserving matching (OPM) and Cartesian tree matching (CTM) to detect temporal motifs that preserve relative ordering and hierarchical structure while remaining invariant to amplitude scaling. This approach provides a temporally precise representation of EEG dynamics that complements traditional spectral and global complexity analyses. To evaluate its utility, we applied the framework to multichannel EEG recordings from individuals with attention-deficit/hyperactivity disorder (ADHD) and matched controls using a publicly available dataset. Highly recurrent, group-specific motifs were extracted and quantified using both OPM and CTM. The ADHD group exhibited significantly higher motif frequencies, suggesting increased repetitiveness in neural activity. OPM analysis revealed shorter motif lengths and greater gradient instability in ADHD, reflected in larger mean and maximal inter-sample amplitude changes. CTM analysis further demonstrated reduced hierarchical complexity in ADHD, characterized by shallower tree structures and fewer hierarchical levels despite comparable motif lengths. These findings suggest that ADHD-related EEG alterations involve systematic differences in the structure, stability, and hierarchical organization of recurrent temporal patterns. The proposed stringology-based motif framework provides a complementary computational tool with potential applications for objective biomarker development in neurodevelopmental disorders.
I have just completed a draft of the book. It is the March 4 version on my homepage. CLICK HERE TO DOWNLOAD. (If you downloaded already, check you don’t have the earlier version cached.) I would very much appreciate any feedback on the book, including suggestions on things to include, especially if they fit well […]
I have just completed a draft of the book. It is the March 4 version on my homepage. CLICK HERE TO DOWNLOAD. (If you downloaded already, check you don’t have the earlier version cached.) I would very much appreciate any feedback on the book, including suggestions on things to include, especially if they fit well and are not overly technical. Please feel completely free to suggest your own results.
After collecting feedback, I plan to do one more pass and then it’s off to the press, so speak now ;-)
Here is the table of contents in case you want to know what the book is about before downloading:
0 Introduction 0.1 Teasers 0.1.1 Computing with three bits of memory 0.1.2 Randomness and derandomization 0.1.3 Proofs and delegating computation 0.1.4 Changing base losing no time and no space 0.2 Notation 0.3 Contents and comparisons 0.4 How to use this book 0.5 Acknowledgments
1 Time 1.1 Word programs 1.2 Complexity classes 1.3 You don’t need much to have it all 1.4 Composing programs 1.5 Universal programs 1.6 The fastest algorithm for Factoring 1.7 On the word size 1.7.1 Factoring with large words 1.8 The grand challenge 1.8.1 Undecidability and diagonalization 1.8.2 The hierarchy of Time 1.9 Problems 1.10 Notes
2 Circuits 2.1 The grand challenge for circuits 2.2 Circuits vs. Time 2.3 Cosmological, non-asymptotic impossibility 2.4 Problems 2.5 Notes
3 Randomness 3.1 Error reduction for one-sided algorithm 3.2 Error reduction for BPTime 3.3 The power of randomness 3.3.1 Verifying matrix multiplication 3.3.2 Checking if a circuit represents zero 3.4 Does randomness really buy time? 3.5 The hierarchy of BPTime 3.6 Problems 3.7 Notes
4 Reductions 4.1 Types of reductions 4.2 Multiplication 4.3 3Sum 4.4 Satisfiability 4.4.1 3Sat to Clique 4.4.2 3Sat to Subset-Sum 4.4.3 3Sat to 3Color 4.4.4 More 4.5 Power hardness from SETH 4.6 Search problems 4.7 Gap-Sat: The PCP theorem 4.8 Problems 4.9 Notes
5 Nondeterminism 5.1 Nondeterministic computation 5.2 Completeness 5.3 From programs to 3Sat in quasi-linear time 5.3.1 Efficient sorting circuits 5.4 Power from completeness 5.4.1 Max-3Sat 5.4.2 NP is as easy as detecting unique solutions 5.5 Alternation 5.5.1 Does the hierarchy collapse? 5.6 Problems 5.7 Notes
6 Space 6.1 Branching programs 6.2 The power of L 6.2.1 Arithmetic 6.2.2 Graphs 6.2.3 Linear algebra 6.3 Checkpoints 6.4 The grand challenge for space 6.5 Randomness 6.6 Reductions 6.6.1 P vs. PSpace 6.6.2 L vs. P 6.7 Nondeterministic space 6.8 An impossibility result for 3Sat 6.9 TiSp 6.10 Computing with a full memory: Catalytic space 6.11 Problems 6.12 Notes
7 Depth 7.1 Depth vs space 7.2 The power of NC2: Linear algebra 7.3 Formulae 7.3.1 The grand challenge for formulae 7.4 The power of NC1: Arithmetic 7.5 Computing with 3 bits of memory 7.6 Group programs 7.7 The power NC0: Cryptography 7.8 Word circuits 7.8.1 Simulating circuits with square-root space 7.9 Uniformity 7.10 Problems 7.11 Notes
8 Majority 8.1 The power of TC0: Arithmetic 8.2 Neural networks 8.3 Amplifying lower bounds by means of self-reducibility 8.4 The power of Majority: Boosting correlation 8.5 Uniformity 8.6 Problems 8.7 Notes
9 Alternation 9.1 The polynomial method over F2 9.1.1 AC0 correlates with low-degree polynomials modulo 2 9.1.2 Using the correlation to show that Majority is hard 9.2 The polynomial method over R 9.2.1 AC0 correlates with low-degree real polynomials 9.2.2 Sign-Approximating Maj-AC0 9.2.3 Using the correlation to show impossibility for Maj-AC0 9.2.4 AC0 has small correlation with parity 9.3 Switching lemmas 9.3.1 Switching I 9.3.2 Switching II 9.3.3 Proof of switching II 9.4 AC0 vs L, NC1, and TC0 9.4.1 L 9.4.2 Linear-size log-depth 9.4.3 TC0 9.5 The power of AC0: Gap majority 9.5.1 Back to the PH 9.6 Mod 6 9.6.1 The power of ACC0 9.7 Impossibility results for ACC0 9.8 The power of AC0: sampling 9.9 Problems 9.10 Notes
11 Pseudorandomness 11.1 Basic PRGs 11.1.1 Local tests 11.1.2 Low-degree polynomials 11.1.3 Local tests, II 11.1.4 Local small bias 11.2 PH is a random low-degree polynomial 11.3 Pseudorandom generators from hard functions 11.3.1 Stretching correlation bounds: The bounded-intersection generator 11.3.2 Turning hardness into correlation bounds 11.3.3 Hard-core sets 11.3.4 Derandomizing the XOR lemma 11.3.5 Encoding the whole truth-table 11.3.6 Monotone amplification within NP 11.4 Finding the hard core 11.5 Cryptographic pseudorandom generators 11.5.1 AC0 11.5.2 Circuits 11.6 Problems 11.7 Notes
12 Expansion 12.1 Edge expansion 12.2 Spectral expansion 12.3 Undirected reachability in L 12.4 What do expanders fool? 12.5 On the proof of the PCP theorem 12.6 Problems 12.7 Notes
13 Communication 13.1 Two parties 13.1.1 The rectangle method and the Equality function 13.1.2 Rounds: Pointer chasing 13.1.3 Randomness 13.1.4 Arbitrary partition 13.2 Number-on-forehead 13.2.1 Generalized inner product 13.2.2 The power of logarithmic parties 13.2.3 Pointer chasing 13.3 Problems 13.4 Notes
14 Arithmetic 14.1 Linear transformations 14.1.1 The power of depth-2 xor circuits 14.2 Integers 14.3 Univariate polynomials 14.4 Multivariate polynomials 14.5 Depth reduction and completeness 14.6 Alternation 14.6.1 The power of depth 3 14.6.2 Impossibility results 14.7 Problems 14.8 Notes
16 Tapes 16.1 On the alphabet 16.1.1 The universal TM 16.2 Multi-tape machines 16.2.1 Time vs. TM-Time 16.3 TMs vs circuits 16.4 The grand challenge for TMs 16.4.1 Communication bottleneck: crossing sequences 16.5 TM-Time hierarchy 16.6 Randomness 16.7 Sub-logarithmic space 16.7.1 The power of sub-logarithmic space 16.8 Problems 16.9 Notes
17 Barriers 17.1 Black-box 17.2 Natural proofs 17.2.1 Examples of proofs that are natural 17.2.2 Ruling out natural proofs under popular conjectures 17.3 Problems 17.4 Notes
18 Speculation 18.1 Critique of common arguments for P ≠ NP
A Miscellanea A.1 Logic A.2 Integers A.3 Sums A.4 Inequalities A.5 Probability theory A.5.1 Deviation bounds for the sum of random variables A.6 Squaring tricks A.7 Duality A.8 Groups A.9 Fields A.10 Linear algebra A.11 Polynomials A.12 Notes
Assembling the consensus on the necessity of vitamins.
This is Part 5 of 7 of a blogged essay “Steampunk Data Science.” A table of contents is here.
At the same time that Stephen Babcock was moving from New York to Wisconsin, Christian Eijkman departed the Netherlands for a new position in the Dutch Colony of Batavia. Eijkman would be credited with discovering Vitamin B, but his path to discovery was markedly different than the one taken in Wisconsin.
Eijkman was sent to Batavia, which we now know as Jakarta, to investigate the disease Beriberi. Plaguing much of Asia, beriberi caused nervous system disorders, including loss of the ability to walk and loss of reflexes. As was the style at the time, Eijkman was convinced beriberi was an infectious disease, and he set out upon a series of experiments to isolate the responsible germ.
Eijkman’s first attempts involved transfusing human blood into animals in the hope of inducing Beriberi. He first tried to infect monkeys, to no avail. He next experimented with rabbits but also failed to induce beriberi. His third animal model was chickens. Even in the 1800s, a non-mammalian test subject was outlandish, but chickens could succumb to a beriberi-like illness called polyneuritis, which caused a similar pattern of neural dysfunction. Moreover, Eijkman had grown frustrated with failures.
To his surprise and delight, Eijkman found that chickens often developed polyneuritis after receiving his beriberi transfusions. However, there was one perplexing glitch with his experiment: He observed an equal incidence of polyneuritis in the treatment and control groups. How was this possible? Eijkman tried and failed to cultivate the infecting substance from the afflicted chickens. None of the germ theory he had learned studying with Robert Koch could explain the outbreak of polyneuritis in his experimental fowl. Miraculously, in one of those mythologized moments of scientific luck, Eijkman happened upon an impossible answer by accident.
To save precious research funds, the lab assistant assigned to care for the chickens had been feeding them the leftover rice from the military mess hall adjacent to Eijkman’s facility. But the food supply was cut off after a new barracks chef, in Eijkman’s words, “refused to allow military rice to be taken for civilian chickens.” The lab keeper had to resort to purchasing bargain-basement brown rice for the chickens. Once the rice was switched, the chickens were cured of polyneuritis.
What was different about the barracks rice and the new civilian rice? The military rice was white. White rice is polished brown rice. Eijkman was convinced that something about the polishing turned the rice into a beriberi vector.
Unfortunately, he had no idea what that something was. He had a difficult time shaking the germ theory instilled in him by Koch. Perhaps polishing the rice allowed some bug to flourish in the white rice. Perhaps the brown rice contained something that killed an unknown pathogen. Determining what precisely was special about the brown rice would take decades.
Multiple investigations confirmed Eijkman’s observations over the subsequent years. For example, in 1897, Adolphe Vorderman observed that prisons that fed prisoners white rice had a considerably higher incidence of beriberi than those that fed brown rice. You might say that Eijkman’s observation in chickens had been “replicated” in humans in this observational study. But researchers at the time were as wary of confounding in non-interventional explanations as they are today.
It wasn’t until 1901 that an experiment seemed to definitively confirm that there was something valuable in the rice bran itself. Eijkman’s successor in Batavia, Gerrit Grijns, demonstrated that chickens would not develop polyneuritis on a diet of meat and rice bran. Grijns confidently (and correctly) declared there was something essential to the diet in the rice bran. It took another decade for scientific consensus to accept Grijns was right.
The case for vitamins would finally be closed by Casimir Funk, who wrote an audacious meta-analysis in 1912. In his report, he boldly claimed that beriberi, scurvy, and pellagra were all caused by the lack of some food substance in a person’s diet. Part of Funk’s evidence was epidemiological. He noted these afflictions only occurred in countries where people ate unvarying, monotonous diets, like those based on polished rice. However, not all monotonous diets were necessarily perilous. Russians who lived on a diet of cabbage, potatoes, and bacon seemed to avoid this cluster of illnesses. Some diets were missing yet-to-be-identified nutrients, and people couldn’t live without them. He called the missing components “vital amines” or “vitamines” for short.
Funk, a biochemist at the Lister Institute of Preventive Medicine in London, had been investigating the properties of rice polishing, his interest piqued by the findings from Indonesia. He devised multiple chemical mechanisms to extract the curative component from the discarded rice shavings. He found that it would take tremendous quantities of rice to yield tiny amounts of this substance, but only a small amount was required to cure pigeons of polyneuritis .
These experiments and the evidence from Asia convinced Funk. Germ theory, toxic theory, and hormonal theory were each insufficient to explain the evidence. Instead, Funk invented a new classification: deficiency diseases. He proposes that small changes in diet could reliably cure these diseases.
Wisconsin investigators McCollum and Davis soon confirmed Funk’s vitamin theory. Their fat-soluble compound could not sustain rats fed only polished rice. However, adding the water-soluble compound found in the rice polishings allowed the rats to thrive.
There are necessary for normal nutrition during growth two classes of unknown accessory substances, one soluble in fats and accompanying these in the process of isolation of fats from certain foodstuffs, and the other soluble in water, but apparently not in fats.
Thus, McCollum and Davis showed that multiple compounds were necessary to sustain the life of their rats, and these compounds were not proteins, fats, or carbohydrates. Having grown more adept with rat experiments, Davis now included a much more extensive collection of rodent growth curves, demonstrating the existence of at least two vitamins.
The fat-soluble vitamin was A. The water-soluble vitamin was B.1
Funk also called out Scurvy as a deficiency disease. Though the British Navy knew that scurvy could be treated with lemons, they did not understand its etiology at all. Axel Holst and Theodor Frolich had induced scurvy in guinea pigs by a deficient diet and then cured the condition using lemon juice or cabbage. Funk concluded that the vitamin associated with scurvy is distinct from that of beriberi, as it was much more sensitive to boiling. Though most were convinced that this evidence showed scurvy was a deficiency disease, Vitamin C would not be chemically isolated until 1928.
Funk went even further out on a limb. He asserted that Pellagra, a disease endemic to northern Italy, was also a deficiency disease. Pellagra was associated with starch-heavy diets based on corn. It wouldn’t be until two decades later, in 1937, that the associated vitamin would be discovered. Pellagra is caused by a deficiency of Niacin, also known as Vitamin B3.
In passing in the penultimate paragraph of his manuscript, Funk hypothesizes that Rickets, characterized by weakened bones and deformed legs, was also likely a deficiency disease. This hypothesis would also prove correct. Rickets is cured by vitamin D, a vitamin discovered in 1922 by McCollum. A year later, Harry Steenbock, another collaborator on the Single-Grain experiment at Wisconsin, found that Vitamin D could be synthesized by irradiating food with UV light and that this irradiated food cured rickets in mice.2
Though many now point to the serendipitous diet change in chicken as the “discovery” of vitamins, Funk’s assembly of 30 years of evidence and naming of a cause had more immediate impact. Funk’s theorizing flew in the face of germ theory and asserted something new. He completely revolutionized how we think about illness, as much as germ theory itself. The evidence had been piling up. People knew that treatments for these diseases involved dietary changes. The rodent studies assuredly pointed to particular necessary factors in food, but Funk was the first to put it all together and state it outright.
Once Funk formulated the concepts of vitamins and deficiency diseases, the rest of the scientific community jumped on board. They had a whole new way to think about what causes and cures illness. With sufficient dietary diversity, humankind was empowered to cure deficiency diseases plaguing large parts of the world.
This history of the discovery of Vitamin B in the Dutch colonies synthesizes Eijkman’s Nobel Prize address and the accounts of Carpenter, Combs and McClung, and Vandenbroucke. Combs and McClung’s text on vitamins also discusses Funk’s legacy in vitamins and his connection to McCollum and Davis. Griminger provides additional details about Funk’s biography.
Specifically, the vitamin whose absence causes beriberi is B1. Though they didn’t realize it at the time, nutritionists soon realized there were many water-soluble vitamins. We now have 12 named B vitamins, and all of them are water-soluble.
Steenbock would subsequently patent the method of irradiating milk to fortify it with Vitamin D. This patent would generate untold sums of money for the University of Wisconsin. It’s one of the earliest and most successful examples of Universities funding themselves through licensing their intellectual property.
In discussions of AI and Mathematics, the discussion often goes to mathematical proofs, such as the the First Proof challenge. So let's look at the role of proofs in mathematics.
Without a proof, you don't even know whether a theorem is true or false. It's not even a theorem until you have a proof, just a conjecture or hypothesis. You might have some intuition but you don't know the hardness of a proof until you find it. Even then that only gives you an upper bound on hardness as someone might find a simpler alternative proof in the future,
Proofs give an understanding of why a theorem is true. A proof can look like a piece of art, especially if the proof works in a new or clever way, maybe even a proof from the book like Kelly's Proof of the Sylvester–Gallai theorem. A mathematician often has their most satisfying experiences finding a proof of a theorem they or others have had challenge proving earlier.
Conjectures drive the need for proofs, but often it goes the other direction. A failed proof leads to a new conjecture and maybe a different theorem altogether. My time-space tradeoffs came out of a failed attempt to show NP different from L. Too often I see papers with theorems that are really just what the best proof they can find achieves.
Proofs are supposedly objective, either it is a valid proof or it isn't, the biggest reason AI tackles proofs as we can grade them as right or wrong. Most academic papers have high-level proofs and a referee however needs to make a subjective decision whether the proof has enough details to count as a legitimate proofs. If a proof has minor fixable errors, then it isn't a proof but we usually count it as one.
Now we have proof systems like Lean where we can make proofs truly objective. There's a large mathlib project to formalize much of mathematics in Lean, and talk of a cslib as well.
Lean excites me less, I'm not sure what we gain by formalizing proofs besides certainty. Will Fermat's last theorem be any more true once we formalize it? One could argue Lean also helps AI reason about mathematics in a grounded way, though could AI just get lost in the logical weeds?
I worry about the heavy task of converting proofs to Lean, even with AI help, takes away from time spent finding and proving new theorems and we lose the beauty of the proofs. If there was a fully lean verified proof of P ≠ NP that you couldn't understand, would you find that satisfying?
By Lance Fortnow
In discussions of AI and Mathematics, the discussion often goes to mathematical proofs, such as the the First Proof challenge. So let's look at the role of proofs in mathematics.
Without a proof, you don't even know whether a theorem is true or false. It's not even a theorem until you have a proof, just a conjecture or hypothesis. You might have some intuition but you don't know the hardness of a proof until you find it. Even then that only gives you an upper bound on hardness as someone might find a simpler alternative proof in the future,
Proofs give an understanding of why a theorem is true. A proof can look like a piece of art, especially if the proof works in a new or clever way, maybe even a proof from the book like Kelly's Proof of the Sylvester–Gallai theorem. A mathematician often has their most satisfying experiences finding a proof of a theorem they or others have had challenge proving earlier.
Conjectures drive the need for proofs, but often it goes the other direction. A failed proof leads to a new conjecture and maybe a different theorem altogether. My time-space tradeoffs came out of a failed attempt to show NP different from L. Too often I see papers with theorems that are really just what the best proof they can find achieves.
Proofs are supposedly objective, either it is a valid proof or it isn't, the biggest reason AI tackles proofs as we can grade them as right or wrong. Most academic papers have high-level proofs and a referee however needs to make a subjective decision whether the proof has enough details to count as a legitimate proofs. If a proof has minor fixable errors, then it isn't a proof but we usually count it as one.
Now we have proof systems like Lean where we can make proofs truly objective. There's a large mathlib project to formalize much of mathematics in Lean, and talk of a cslib as well.
Lean excites me less, I'm not sure what we gain by formalizing proofs besides certainty. Will Fermat's last theorem be any more true once we formalize it? One could argue Lean also helps AI reason about mathematics in a grounded way, though could AI just get lost in the logical weeds?
I worry about the heavy task of converting proofs to Lean, even with AI help, takes away from time spent finding and proving new theorems and we lose the beauty of the proofs. If there was a fully lean verified proof of P ≠ NP that you couldn't understand, would you find that satisfying?
[These are my own opinions, and not representing OpenAI. Crossposted on Lesswrong] AI has so many applications, and AI companies have limited resources and attention span. Hence if it was up to me, I’d prefer we focus on applications that are purely beneficial— science, healthcare, education — or even commercial, before working on anything related … Continue reading Mass surveillance, red lines, and a crazy weekend
[These are my own opinions, and not representing OpenAI. Crossposted on Lesswrong]
AI has so many applications, and AI companies have limited resources and attention span. Hence if it was up to me, I’d prefer we focus on applications that are purely beneficial— science, healthcare, education — or even commercial, before working on anything related to weapons or spying. If someone has to do it, I’d prefer it not to be my own company. Alas, we can’t always get what we want.
This is a long-ish post, but the TL;DR is:
I believe that harm to democracy is one of the most important risks of AI, and one that is not sufficiently highlighted.
While I wish it would have proceeded under different circumstances, the conditions in the deal signed between OpenAI and the DoW, as well as the publicity that accompanied it, provide a chance to move forward on this issue, and make using AI to collect, analyze, or de-anonymize people’s data at mass scale a risk that we track in a similar manner to other risks such as cybersecurity and bioweapons.
It is also too soon to “declare victory.” The true test of this deal is not the contract language. It will be in the safeguards we build, and the way we are able to monitor and ensure a model that is safe for our troops and does not enable mass surveillance of our people.
[Also; The possibility of Anthropic’s designation as a supply-chain risk is terrible. I hope it will be resolved asap.]
Country of IRS agents in a datacenter
How can AI destroy democracy? Throughout history, authoritarian regimes required a large obedient bureaucracy to spy and control their citizens. In East Germany, in addition to the full-time Stasi staff, one percent of the population served as informants. The KGB famously had multiple “purges” to ensure loyalty.
AI can potentially lead to a government bureaucracy loyal to whomever has control of the models training or prompting, ensuring an army of agents that will not leak, whistleblow, or disobey an illegal order. Moreover, since the government has the monopoly on violence, we don’t need advances in the “world of atoms” to implement that, nor do we need a “Nobel laureate” level of intelligence.
As an example, imagine that the IRS was replaced with an AI workforce. Arguably current models are already at or near the capability of automating much of those functions. In such a case the leaders of the agency could commence tax investigations at a large scale of their political enemies. Furthermore, even if each AI agent was individually aligned, it might not be possible for it to know that the person it received an order to audit was selected for political reasons. A human being goes home, reads the news, and can understand the broader context. A language model is born at the beginning of a task and dies at its end.
Historically, mass surveillance of a country’s own citizens was key for authoritarian governments. This is why so much of U.S. history is about preventing this, including the fourth amendment. AI opens new possibilities for analysis and de-anonymization of people’s data at a larger scale than ever before. For example, just recently, Lermen et al showed that LLMs can be used to perform large scale autonomic de-anonymization on unstructured data.
While all surveillance is problematic, given the unique power that governments have over their own citizens and residence, restricting domestic surveillance by governments is of particular importance. This is why personally I view it as even more crucial to prevent than privacy violations by foreign governments or corporations. But the latter is important too, especially since governments sometimes “launder” surveillance by purchasing commercially available information.
It is not a lost cause – we can implement and regulate approaches for preventing this. AI can scale oversight and monitoring just as it can scale surveillance. We can also build in privacy and cryptographic protections to AI to empower individuals. But we urgently need to do this work.
Just like with the encryption debates, there will always be people that propose trading our freedoms for protection against our adversaries. But I hope we have learned our lesson from the PATRIOT act and the Snowden revelations. While I don’t agree with its most expansive interpretations, I think the second amendment is also a good illustration that we Americans have always been willing to trade some safety to protect our freedom. Even in the world of advanced AI, we still have two oceans, thousands of nukes, and a military with a budget larger than China’s and Russia’s combined. We don’t need to give up our freedoms and privacy to protect ourselves.
OpenAI’s deal with the Department of War
While the potential for AI abuse in government is always present, it is amplified in the classified settings, since by their nature, this could make abuse much harder to detect. (E.g., we might have never heard of the NSA overreach if it wasn’t for Snowden.) For this reason, I am glad for the heightened scrutiny our deal with the DoW received (even if that scrutiny has not been so easy for me personally).
I feel that too much of the focus has been on the “legalese”, with people parsing every word of the contract excerpts we posted. I do not dispute the importance of the contract, but as Thomas Jefferson said “The execution of the laws is more important than the making of them.” The importance of a contract is a shared understanding between OpenAI and the DoW on what the models will and will not be used to do. I am happy that we are explicit in our understanding that our models will not be used for domestic mass surveillance, including via analysis of commercially available information of U.S. people. I am even happier that for the time being we will not be working with the intelligence agencies of the DoW, such as NSA, DIA, etc. Our leadership committed to announcing publicly if this changes, and of course this contract has nothing to do with domestic agencies such as DHS, ICE, or FBI. The intelligence agencies have the most sensitive workloads, and so I completely agree it is best to start in the easier cases. This also somewhat mitigates my worry about not ruling out mass surveillance of international citizens. (In addition to the fact that spying on one’s own people is inherently more problematic.)
I am also happy the contract prohibits using our models to direct lethal autonomous weapons, though realistically I do not think powering a killer drone via a cloud-based large model was ever a real possibility. A general purpose frontier model is an extremely poor fit for autonomously directing a weapon; also, the main selling point of autonomous drones is to evade jamming, which requires an on-device model. Given our current state of safety and alignment, lethal autonomous weapons are a very bad idea. But regardless, it would not have happened through this deal.
That said, there is a possibility that eventually our models will be used to help humans in target selection, as is reportedly happening in Iran right now. This is a very heavy burden, and it is up to us to ensure that we do not scale to this use case without very extensive testing of safety and reliability.
The contract enables the necessary conditions for success but it is too soon to know if they are sufficient. It allows us to build in our safety stack to ensure the safe operation of the model and our red lines, as well as have our own forward deployed engineers (FDEs) in place. No safety stack can be perfect, but given the “mass” nature of mass surveillance, it does not need to be perfect to prevent it. That said, this is going to be a challenging enterprise: building safety for applications we are less familiar with, with the added complexities of clearance. Sam has said that we will deploy gradually, starting in the least risky and most familiar domains first. I think this is essential.
Can we make lemonade out of this lemon?
The previous defense contract of the DoW and Anthropic attracted relatively little attention. I hope that the increased salience of this issue can be used to elevate our standards as an industry. Just like we do with other risks such as bioweapons and cybersecurity, we need to build best practices for avoiding the risk of AI-enabled takeover of democracy, including mass domestic surveillance and high-stakes automated decisions (for example, selective prosecution or “social credit”). These risks are no less catastrophic than bioweapons, and should be tracked and reported as such. While, due to the classified nature of the domain, not everything can be reported, we can and should at least be public about the process.
If there is one thing that AI researchers are good at, it is measuring and optimizing quantities. If we can build the evaluations and turn tracking these risks into a science, we have a much better chance at combatting them. I am confident that it can be done given sufficient time. I am less confident that time will be sufficient.
We study the hardness of the $γ$-approximate decisional Covering Radius Problem on lattices in the $\ell_p$ norm ($γ$-$\text{GapCRP}_p$). Specifically, we prove that there is an explicit function $γ(p)$, with $γ(p) > 1$ for $p > p_0 \approx 35.31$ and $\lim_{p \to \infty} γ(p) = 9/8$, such that for any constant $\varepsilon > 0$, $(γ(p) - \varepsilon)$-$\text{GapCRP}_p$ is $\mathsf{NP}$-hard. This shows the first hardness of $\text{GapCRP}_p$ for explicit $p < \infty$. Work of Haviv and Regev (CCC, 2006 and CJTCS, 2012) previously showed $Π_2$-hardness of approximation for $\text{GapCRP}_p$ for all sufficiently large (but non-explicit) finite $p$ and for $p = \infty$.
In fact, our hardness results hold for a variant of $\text{GapCRP}$ called the Binary Covering Radius Problem ($\text{BinGapCRP}$), which trivially reduces to both $\text{GapCRP}$ and the decisional Linear Discrepancy Problem ($\text{LinDisc}$) in any norm in an approximation-preserving way. We also show $Π_2$-hardness of $(9/8 - \varepsilon)$-$\text{BinGapCRP}$ in the $\ell_{\infty}$ norm for any constant $\varepsilon > 0$.
Our work extends and heavily uses the work of Manurangsi (IPL, 2021), which showed $Π_2$-hardness of $(9/8 - \varepsilon)$-$\text{LinDisc}$ in the $\ell_{\infty}$ norm.
We study the hardness of the $γ$-approximate decisional Covering Radius Problem on lattices in the $\ell_p$ norm ($γ$-$\text{GapCRP}_p$). Specifically, we prove that there is an explicit function $γ(p)$, with $γ(p) > 1$ for $p > p_0 \approx 35.31$ and $\lim_{p \to \infty} γ(p) = 9/8$, such that for any constant $\varepsilon > 0$, $(γ(p) - \varepsilon)$-$\text{GapCRP}_p$ is $\mathsf{NP}$-hard. This shows the first hardness of $\text{GapCRP}_p$ for explicit $p < \infty$. Work of Haviv and Regev (CCC, 2006 and CJTCS, 2012) previously showed $Π_2$-hardness of approximation for $\text{GapCRP}_p$ for all sufficiently large (but non-explicit) finite $p$ and for $p = \infty$.
In fact, our hardness results hold for a variant of $\text{GapCRP}$ called the Binary Covering Radius Problem ($\text{BinGapCRP}$), which trivially reduces to both $\text{GapCRP}$ and the decisional Linear Discrepancy Problem ($\text{LinDisc}$) in any norm in an approximation-preserving way. We also show $Π_2$-hardness of $(9/8 - \varepsilon)$-$\text{BinGapCRP}$ in the $\ell_{\infty}$ norm for any constant $\varepsilon > 0$.
Our work extends and heavily uses the work of Manurangsi (IPL, 2021), which showed $Π_2$-hardness of $(9/8 - \varepsilon)$-$\text{LinDisc}$ in the $\ell_{\infty}$ norm.
Proving the NP-completeness of pencil-and-paper puzzles typically relies on reductions from combinatorial problems such as the satisfiability problem (SAT).
Although the properties of these problems are well studied, their purely combinatorial nature often does not align well with the geometric constraints of puzzles.
In this paper, we introduce the Required-edge Cycle Cover Problem (RCCP) -- a variant of the Cycle Cover Problem (CCP) on mixed graphs.
CCP on mixed graphs was studied by Seta (2002) to establish the ASP-completeness (i.e., NP-completeness under parsimonious reductions) of the puzzle Kakuro (a.k.a.~Cross Sum), and is known to be ASP-complete under certain conditions.
We prove the ASP-completeness of RCCP under certain conditions, and strengthen known ASP-completeness results of CCP on mixed graphs as a corollary.
Using these results, we resolve the ASP-completeness of Constraint Graph Satisfiability (CGS) in a certain case, addressing an open problem posed by the MIT Hardness Group (2024).
We also show that Kakuro remains ASP-complete even when the available digit set is $\{1, 2, 3\}$, consequently completing its complexity classification regarding the maximum available digit and the maximum lengths of contiguous blank cells.
It strengthens previously known results of Seta (2002) and Ruepp and Holzer (2010).
Furthermore, we introduce a flow model equivalent to the constrained RCCP; this model allows gadgets to be tiled densely on a rectangular grid, which enables us to reduce RCCP to various pencil-and-paper puzzles in a parsimonious manner.
By applying this framework, we prove the ASP-completeness of several puzzles, including Chocona, Four Cells, Hinge, and Shimaguni, and strengthen existing NP-completeness results for Choco Banana and Five Cells to ASP-completeness.
Proving the NP-completeness of pencil-and-paper puzzles typically relies on reductions from combinatorial problems such as the satisfiability problem (SAT).
Although the properties of these problems are well studied, their purely combinatorial nature often does not align well with the geometric constraints of puzzles.
In this paper, we introduce the Required-edge Cycle Cover Problem (RCCP) -- a variant of the Cycle Cover Problem (CCP) on mixed graphs.
CCP on mixed graphs was studied by Seta (2002) to establish the ASP-completeness (i.e., NP-completeness under parsimonious reductions) of the puzzle Kakuro (a.k.a.~Cross Sum), and is known to be ASP-complete under certain conditions.
We prove the ASP-completeness of RCCP under certain conditions, and strengthen known ASP-completeness results of CCP on mixed graphs as a corollary.
Using these results, we resolve the ASP-completeness of Constraint Graph Satisfiability (CGS) in a certain case, addressing an open problem posed by the MIT Hardness Group (2024).
We also show that Kakuro remains ASP-complete even when the available digit set is $\{1, 2, 3\}$, consequently completing its complexity classification regarding the maximum available digit and the maximum lengths of contiguous blank cells.
It strengthens previously known results of Seta (2002) and Ruepp and Holzer (2010).
Furthermore, we introduce a flow model equivalent to the constrained RCCP; this model allows gadgets to be tiled densely on a rectangular grid, which enables us to reduce RCCP to various pencil-and-paper puzzles in a parsimonious manner.
By applying this framework, we prove the ASP-completeness of several puzzles, including Chocona, Four Cells, Hinge, and Shimaguni, and strengthen existing NP-completeness results for Choco Banana and Five Cells to ASP-completeness.
Authors: Orazio Nicolosi, Federico Pisciotta, Lorenzo Bresolin
Motivated by the results for Magic: The Gathering presented in [CBH20] and [Bid20], we study a computability problem related to optimal play in Yu-Gi-Oh! Trading Card Game, a popular card game developed and published by Konami. We show that the problem of establishing whether, from a given game state, a computable strategy is winning is undecidable. In particular, not only do we prove that the Halting Problem can be reduced to this problem, but also that the same holds for the Kleene's $\mathcal{O}$, thereby demonstrating that this problem is actually $Π^1_1$-complete. We extend this last result to all strategies with a reduction on the set of countable well orders, a classic $\boldsymbolΠ^1_1$-complete set. For these reductions we present two legal decks (according to the current Forbidden & Limited List of Yu-Gi-Oh! Trading Card Game) that can be used by the player who goes first to perform them.
Motivated by the results for Magic: The Gathering presented in [CBH20] and [Bid20], we study a computability problem related to optimal play in Yu-Gi-Oh! Trading Card Game, a popular card game developed and published by Konami. We show that the problem of establishing whether, from a given game state, a computable strategy is winning is undecidable. In particular, not only do we prove that the Halting Problem can be reduced to this problem, but also that the same holds for the Kleene's $\mathcal{O}$, thereby demonstrating that this problem is actually $Π^1_1$-complete. We extend this last result to all strategies with a reduction on the set of countable well orders, a classic $\boldsymbolΠ^1_1$-complete set. For these reductions we present two legal decks (according to the current Forbidden & Limited List of Yu-Gi-Oh! Trading Card Game) that can be used by the player who goes first to perform them.
The graph isomorphism problem asks whether two graphs are identical up to vertex relabeling. While the exact problem admits quasi-polynomial-time classical algorithms, many applications in molecular comparison, noisy network analysis, and pattern recognition require a flexible notion of structural similarity. We study the quantum query complexity of approximate graph isomorphism testing, where two graphs on $n$ vertices drawn from the Erdős--Rényi distribution $\mathcal{G} (n,1/2)$ are considered approximately isomorphic if they can be made isomorphic by at most $k$ edge edits.
We present a quantum algorithm based on MNRS quantum walk search over the product graph $Γ(G,H)$ of the two input graphs. When the graphs are approximately isomorphic, the quantum walk search detects vertex pairs belonging to a dense near isomorphic matching set; candidate pairings are then reconstructed via local consistency propagation and verified via a Grover-accelerated consistency check. We prove that this approach achieves query complexity $\mathcal{O}(n^{3/2} \log n/\varepsilon)$, where $\varepsilon$ parameterizes the approximation threshold. We complement this with an $Ω(n^2)$ classical lower bound for constant approximation, establishing a genuine polynomial quantum speedup in the query model.
We extend the framework to spectral similarity measures based on graph Laplacian eigenvalues, as well as weighted and attributed graphs. Small-scale simulation results on quantum simulators for graphs with up to twenty vertices demonstrate compatibility with near-term quantum devices.
The graph isomorphism problem asks whether two graphs are identical up to vertex relabeling. While the exact problem admits quasi-polynomial-time classical algorithms, many applications in molecular comparison, noisy network analysis, and pattern recognition require a flexible notion of structural similarity. We study the quantum query complexity of approximate graph isomorphism testing, where two graphs on $n$ vertices drawn from the Erdős--Rényi distribution $\mathcal{G} (n,1/2)$ are considered approximately isomorphic if they can be made isomorphic by at most $k$ edge edits.
We present a quantum algorithm based on MNRS quantum walk search over the product graph $Γ(G,H)$ of the two input graphs. When the graphs are approximately isomorphic, the quantum walk search detects vertex pairs belonging to a dense near isomorphic matching set; candidate pairings are then reconstructed via local consistency propagation and verified via a Grover-accelerated consistency check. We prove that this approach achieves query complexity $\mathcal{O}(n^{3/2} \log n/\varepsilon)$, where $\varepsilon$ parameterizes the approximation threshold. We complement this with an $Ω(n^2)$ classical lower bound for constant approximation, establishing a genuine polynomial quantum speedup in the query model.
We extend the framework to spectral similarity measures based on graph Laplacian eigenvalues, as well as weighted and attributed graphs. Small-scale simulation results on quantum simulators for graphs with up to twenty vertices demonstrate compatibility with near-term quantum devices.
The fastest known algorithms for dealing with structured matrices, in the sense of the displacement rank measure, are randomized. For handling classical displacement structures, they achieve the complexity bounds $\tilde{O}(α^{ω-1} n)$ for solving linear systems and $\tilde{O}(α^2 n)$ for computing the nullspace. Here $n \times n$ is the size of the square matrix, $α$ is its displacement rank, $ω> 2$ is a feasible exponent for matrix multiplication, and the notation $\tilde{O}(\cdot)$ counts arithmetic operations in the base field while hiding logarithmic factors. These algorithms rely on an adaptation of Strassen's divide and conquer Gaussian elimination to the context of structured matrices. This approach requires the input matrix to have generic rank profile; this constraint is lifted via pre- and post-multiplications by special matrices generated from random coefficients chosen in a sufficiently large subset of the base field.
This work introduces a fast and deterministic approach, which solves both problems within $\tilde{O}(α^{ω-1} (m+n))$ operations in the base field for an arbitrary rectangular $m \times n$ input matrix. We provide explicit algorithms that instantiate this approach for Toeplitz-like, Vandermonde-like, and Cauchy-like structures. The starting point of the approach is to reformulate a structured linear system as a modular equation on univariate polynomials. Then, a description of all solutions to this equation is found in three steps, all using fast and deterministic operations on polynomial matrices. Specifically, one first computes a basis of solutions to a vector M-Padé approximation problem; then one performs linear system solving over the polynomials to isolate away unwanted unknowns and restrict to those that are actually sought; and finally the latter are found by simultaneous M-Padé approximation.
The fastest known algorithms for dealing with structured matrices, in the sense of the displacement rank measure, are randomized. For handling classical displacement structures, they achieve the complexity bounds $\tilde{O}(α^{ω-1} n)$ for solving linear systems and $\tilde{O}(α^2 n)$ for computing the nullspace. Here $n \times n$ is the size of the square matrix, $α$ is its displacement rank, $ω> 2$ is a feasible exponent for matrix multiplication, and the notation $\tilde{O}(\cdot)$ counts arithmetic operations in the base field while hiding logarithmic factors. These algorithms rely on an adaptation of Strassen's divide and conquer Gaussian elimination to the context of structured matrices. This approach requires the input matrix to have generic rank profile; this constraint is lifted via pre- and post-multiplications by special matrices generated from random coefficients chosen in a sufficiently large subset of the base field.
This work introduces a fast and deterministic approach, which solves both problems within $\tilde{O}(α^{ω-1} (m+n))$ operations in the base field for an arbitrary rectangular $m \times n$ input matrix. We provide explicit algorithms that instantiate this approach for Toeplitz-like, Vandermonde-like, and Cauchy-like structures. The starting point of the approach is to reformulate a structured linear system as a modular equation on univariate polynomials. Then, a description of all solutions to this equation is found in three steps, all using fast and deterministic operations on polynomial matrices. Specifically, one first computes a basis of solutions to a vector M-Padé approximation problem; then one performs linear system solving over the polynomials to isolate away unwanted unknowns and restrict to those that are actually sought; and finally the latter are found by simultaneous M-Padé approximation.
Authors: Sabine Cornelsen, Jan Kratochvíl, Miriam Münch, Giacomo Ortali, Alexandra Weinberger, Alexander Wolff
In a {\em grounded string representation} of a graph there is a horizontal line $\ell$ and each vertex is represented as a simple curve below $\ell$ with one end point on $\ell$ such that two curves intersect if and only if the respective vertices are adjacent. A grounded string representation is a {\em grounded L-reverseL-representation} if each vertex is represented by a 1-bend orthogonal polyline. It is a {\em grounded L-representation} if in addition all curves are L-shaped. We show that every biconnected series-parallel graph without edges between the two vertices of a separation pair (i.e., {\em transitive edges}) admits a grounded L-reverseL-representation if and only if it admits a grounded string representation. Moreover, we can test in linear time whether such a representation exists. We also construct a biconnected series-parallel graph without transitive edges that admits a grounded L-reverseL-representation, but no grounded L-representation.
In a {\em grounded string representation} of a graph there is a horizontal line $\ell$ and each vertex is represented as a simple curve below $\ell$ with one end point on $\ell$ such that two curves intersect if and only if the respective vertices are adjacent. A grounded string representation is a {\em grounded L-reverseL-representation} if each vertex is represented by a 1-bend orthogonal polyline. It is a {\em grounded L-representation} if in addition all curves are L-shaped. We show that every biconnected series-parallel graph without edges between the two vertices of a separation pair (i.e., {\em transitive edges}) admits a grounded L-reverseL-representation if and only if it admits a grounded string representation. Moreover, we can test in linear time whether such a representation exists. We also construct a biconnected series-parallel graph without transitive edges that admits a grounded L-reverseL-representation, but no grounded L-representation.
Authors: Sándor P. Fekete, Jonas Friemel, Peter Kramer, Jan-Marc Reinhardt, Christian Rieck, Christian Scheffer
Motivated by targeted drug delivery, we investigate the gathering of particles in the full tilt model of externally controlled motion planning: A set of particles is located at the tiles of a polyomino with all particles reacting uniformly to an external force by moving as far as possible in one of the four axis-parallel directions until they hit the boundary. The goal is to choose a sequence of directions that moves all particles to a common position. Our results include a polynomial-time algorithm for gathering in a completely filled polyomino as well as hardness reductions for approximating shortest gathering sequences and for determining whether the particles in a partially filled polyomino can be gathered. We pay special attention to the impact of restricted geometry, particularly polyominoes without holes. As corollaries, we make progress on an open question from [Balanza-Martinez et al., SODA 2020] by showing that deciding whether a given position can be occupied remains NP-hard in polyominoes without holes and provide initial results on the parameterized complexity of tilt problems. Our results build on a connection we establish between tilt models and the theory of synchronizing automata.
Motivated by targeted drug delivery, we investigate the gathering of particles in the full tilt model of externally controlled motion planning: A set of particles is located at the tiles of a polyomino with all particles reacting uniformly to an external force by moving as far as possible in one of the four axis-parallel directions until they hit the boundary. The goal is to choose a sequence of directions that moves all particles to a common position. Our results include a polynomial-time algorithm for gathering in a completely filled polyomino as well as hardness reductions for approximating shortest gathering sequences and for determining whether the particles in a partially filled polyomino can be gathered. We pay special attention to the impact of restricted geometry, particularly polyominoes without holes. As corollaries, we make progress on an open question from [Balanza-Martinez et al., SODA 2020] by showing that deciding whether a given position can be occupied remains NP-hard in polyominoes without holes and provide initial results on the parameterized complexity of tilt problems. Our results build on a connection we establish between tilt models and the theory of synchronizing automata.
Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James' bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.
Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James' bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.
Authors: Jean-Marie Favreau, Yan Gerard, Pascal Lafourcade, Léo Robert
The Calisson puzzle is a tiling puzzle in which one must tile a triangular grid inside a hexagon with lozenges, under the constraint that certain prescribed edges remain tile boundaries and that adjacent lozenges along these edges have different orientations. We present the first polynomial-time algorithm for this problem, with cubic running time. This algorithm, called the advancing surface algorithm, can be executed in a simple and intuitive way, even by hand with a pencil and an eraser. Its apparent simplicity conceals a deeper algorithmic reinterpretation of the classical ideas of John Conway and William Thurston, revisited here from a theoretical computer science perspective.
We introduce a graph-theoretic overlay based on directed cuts and systems of difference constraints that complements Thurston's theory of lozenge tilings and makes its algorithmic structure explicit. In Thurston's approach, lozenge tilings are lifted to monotone stepped surfaces in the three-dimensional cubic lattice and projected back to the plane using height functions, reducing tilability to the computation of heights. We show that selecting a monotone surface corresponds to selecting a directed cut in a periodic directed graph, while height functions arise as solutions of a system of difference constraints. In this formulation, a region is tilable if and only if the associated weighted directed graph contains no cycle of strictly negative weight. This additional graph layer shows that the Bellman-Ford algorithm suffices to decide feasibility and compute solutions. In particular, our framework allows one to decide whether the infinite triangular grid can be tiled while respecting a finite set of prescribed local constraints, even in the absence of boundary conditions.
The Calisson puzzle is a tiling puzzle in which one must tile a triangular grid inside a hexagon with lozenges, under the constraint that certain prescribed edges remain tile boundaries and that adjacent lozenges along these edges have different orientations. We present the first polynomial-time algorithm for this problem, with cubic running time. This algorithm, called the advancing surface algorithm, can be executed in a simple and intuitive way, even by hand with a pencil and an eraser. Its apparent simplicity conceals a deeper algorithmic reinterpretation of the classical ideas of John Conway and William Thurston, revisited here from a theoretical computer science perspective.
We introduce a graph-theoretic overlay based on directed cuts and systems of difference constraints that complements Thurston's theory of lozenge tilings and makes its algorithmic structure explicit. In Thurston's approach, lozenge tilings are lifted to monotone stepped surfaces in the three-dimensional cubic lattice and projected back to the plane using height functions, reducing tilability to the computation of heights. We show that selecting a monotone surface corresponds to selecting a directed cut in a periodic directed graph, while height functions arise as solutions of a system of difference constraints. In this formulation, a region is tilable if and only if the associated weighted directed graph contains no cycle of strictly negative weight. This additional graph layer shows that the Bellman-Ford algorithm suffices to decide feasibility and compute solutions. In particular, our framework allows one to decide whether the infinite triangular grid can be tiled while respecting a finite set of prescribed local constraints, even in the absence of boundary conditions.
The low-degree polynomial framework has been highly successful in predicting computational versus statistical gaps for high-dimensional problems in average-case analysis and machine learning. This success has led to the low-degree conjecture, which posits that this method captures the power and limitations of efficient algorithms for a wide class of high-dimensional statistical problems. We identify a natural and basic hypothesis testing problem in $\mathbb{R}^n$ which is polynomial time solvable, but for which the low-degree polynomial method fails to predict its computational tractability even up to degree $k=n^{Ω(1)}$. Moreover, the low-degree moments match exactly up to degree $k=O(\sqrt{\log n/\log\log n})$. Our problem is a special case of the well-studied robust subspace recovery problem. The lower bounds suggest that there is no polynomial time algorithm for this problem. In contrast, we give a simple and robust polynomial time algorithm that solves the problem (and noisy variants of it), leveraging anti-concentration properties of the distribution. Our results suggest that the low-degree method and low-degree moments fail to capture algorithms based on anti-concentration, challenging their universality as a predictor of computational barriers.
The low-degree polynomial framework has been highly successful in predicting computational versus statistical gaps for high-dimensional problems in average-case analysis and machine learning. This success has led to the low-degree conjecture, which posits that this method captures the power and limitations of efficient algorithms for a wide class of high-dimensional statistical problems. We identify a natural and basic hypothesis testing problem in $\mathbb{R}^n$ which is polynomial time solvable, but for which the low-degree polynomial method fails to predict its computational tractability even up to degree $k=n^{Ω(1)}$. Moreover, the low-degree moments match exactly up to degree $k=O(\sqrt{\log n/\log\log n})$. Our problem is a special case of the well-studied robust subspace recovery problem. The lower bounds suggest that there is no polynomial time algorithm for this problem. In contrast, we give a simple and robust polynomial time algorithm that solves the problem (and noisy variants of it), leveraging anti-concentration properties of the distribution. Our results suggest that the low-degree method and low-degree moments fail to capture algorithms based on anti-concentration, challenging their universality as a predictor of computational barriers.
Many complex systems and datasets are characterized by multiway interactions of different categories, and can be modeled as edge-colored hypergraphs. We focus on clustering such datasets using the NP-hard edge-colored clustering problem, where the goal is to assign colors to nodes in such a way that node colors tend to match edge colors. A key focus in prior work has been to develop approximation algorithms for the problem that are combinatorial and easier to scale. In this paper, we present the first combinatorial approximation algorithm with an approximation factor better than 2.
Many complex systems and datasets are characterized by multiway interactions of different categories, and can be modeled as edge-colored hypergraphs. We focus on clustering such datasets using the NP-hard edge-colored clustering problem, where the goal is to assign colors to nodes in such a way that node colors tend to match edge colors. A key focus in prior work has been to develop approximation algorithms for the problem that are combinatorial and easier to scale. In this paper, we present the first combinatorial approximation algorithm with an approximation factor better than 2.
We introduce and study the problem of detecting short races in an observed trace. Specifically, for a race type $R$, given a trace $σ$ and window size $w$, the task is to determine whether there exists an $R$-race $(e_1, e_2)$ in $σ$ such that the subtrace starting with $e_1$ and ending with $e_2$ contains at most $w$ events. We present a monitoring framework for short-race prediction and instantiate the framework for happens-before and sync-preserving races, yielding efficient detection algorithms. Our happens-before algorithm runs in the same time as FastTrack but uses space that scales with $\log w$ as opposed to $\log |σ|$. For sync-preserving races, our algorithm runs faster and consumes significantly less space than SyncP. Our experiments validate the effectiveness of these short-race detection algorithms: they run more efficiently, use less memory, and detect significantly more races under the same budget, offering a reasonable balance between resource usage and predictive power.
We introduce and study the problem of detecting short races in an observed trace. Specifically, for a race type $R$, given a trace $σ$ and window size $w$, the task is to determine whether there exists an $R$-race $(e_1, e_2)$ in $σ$ such that the subtrace starting with $e_1$ and ending with $e_2$ contains at most $w$ events. We present a monitoring framework for short-race prediction and instantiate the framework for happens-before and sync-preserving races, yielding efficient detection algorithms. Our happens-before algorithm runs in the same time as FastTrack but uses space that scales with $\log w$ as opposed to $\log |σ|$. For sync-preserving races, our algorithm runs faster and consumes significantly less space than SyncP. Our experiments validate the effectiveness of these short-race detection algorithms: they run more efficiently, use less memory, and detect significantly more races under the same budget, offering a reasonable balance between resource usage and predictive power.
Authors: Kanstantsin Pashkovich, Marta Pozzi, Laura Sanità
We study the Directed Steiner Tree (DST) problem in layered graphs through a simple path-based linear programming relaxation. This relaxation achieves an integrality gap of O(l log k), where k is the number of terminals and l is the number of layers, which matches the best known bounds for DST previously obtained via lift-and-project hierarchies. Our formulation bypasses hierarchy machinery, offering a more transparent route to the state-of-the-art bound, and it can be exploited to provide an alternative simpler proof that O(l) rounds of the Sherali-Adams hierarchy suffice for reducing the integrality gap on layered instances of DST.
We study the Directed Steiner Tree (DST) problem in layered graphs through a simple path-based linear programming relaxation. This relaxation achieves an integrality gap of O(l log k), where k is the number of terminals and l is the number of layers, which matches the best known bounds for DST previously obtained via lift-and-project hierarchies. Our formulation bypasses hierarchy machinery, offering a more transparent route to the state-of-the-art bound, and it can be exploited to provide an alternative simpler proof that O(l) rounds of the Sherali-Adams hierarchy suffice for reducing the integrality gap on layered instances of DST.
Authors: Joakim Blikstad, Yannic Maus, Tijn de Vos
As the main contribution of this work we present deterministic edge coloring algorithms in the CONGEST model. In particular, we present an algorithm that edge colors any $n$-node graph with maximum degree $Δ$ with with $(1+\varepsilon)Δ+O(\sqrt{\log n})$ colors in $\tilde{O}(\log^{2.5} n+\log^2 Δ\log n)$ rounds. This brings the upper bound polynomially close to the lower bound of $Ω(\log n/\log\log n)$ rounds that also holds in the more powerful LOCAL model [Chang, He, Li, Pettie, Uitto; SODA'18]. As long as $Δ\geq c\sqrt{\log n}$ our algorithm uses fewer than $2Δ-1$ colors and to the best of our knowledge is the first polylogarithmic-round CONGEST algorithm achieving this for any range of $Δ$.
As a corollary we also improve the complexity of edge coloring with $2Δ-1$ colors for all ranges of $Δ$ to $\tilde{O}(\log^{2.5} n+\log^2 Δ\log n)$. This improves upon the previous $O(\log^8 n)$-round algorithm from [Fischer, Ghaffari, Kuhn; FOCS'17].
Our approach builds on a refined analysis and extension of the online edge-coloring algorithm of Blikstad, Svensson, Vintan, and Wajc [FOCS'25], and more broadly on new connections between online and distributed graph algorithms. We show that their algorithm exhibits very low locality and, if it can additionally have limited local access to future edges (as distributed algorithms can), it can be derandomized for smaller degrees. Under this additional power, we are able to bypass classical online lower bounds and translate the results to efficient distributed algorithms. This leads to our CONGEST algorithm for $(1+\varepsilon)Δ+O(\sqrt{\log n})$-edge coloring. Since the modified online algorithm can be implemented more efficiently in the LOCAL model, we also obtain (marginally) improved complexity bounds in that model.
As the main contribution of this work we present deterministic edge coloring algorithms in the CONGEST model. In particular, we present an algorithm that edge colors any $n$-node graph with maximum degree $Δ$ with with $(1+\varepsilon)Δ+O(\sqrt{\log n})$ colors in $\tilde{O}(\log^{2.5} n+\log^2 Δ\log n)$ rounds. This brings the upper bound polynomially close to the lower bound of $Ω(\log n/\log\log n)$ rounds that also holds in the more powerful LOCAL model [Chang, He, Li, Pettie, Uitto; SODA'18]. As long as $Δ\geq c\sqrt{\log n}$ our algorithm uses fewer than $2Δ-1$ colors and to the best of our knowledge is the first polylogarithmic-round CONGEST algorithm achieving this for any range of $Δ$.
As a corollary we also improve the complexity of edge coloring with $2Δ-1$ colors for all ranges of $Δ$ to $\tilde{O}(\log^{2.5} n+\log^2 Δ\log n)$. This improves upon the previous $O(\log^8 n)$-round algorithm from [Fischer, Ghaffari, Kuhn; FOCS'17].
Our approach builds on a refined analysis and extension of the online edge-coloring algorithm of Blikstad, Svensson, Vintan, and Wajc [FOCS'25], and more broadly on new connections between online and distributed graph algorithms. We show that their algorithm exhibits very low locality and, if it can additionally have limited local access to future edges (as distributed algorithms can), it can be derandomized for smaller degrees. Under this additional power, we are able to bypass classical online lower bounds and translate the results to efficient distributed algorithms. This leads to our CONGEST algorithm for $(1+\varepsilon)Δ+O(\sqrt{\log n})$-edge coloring. Since the modified online algorithm can be implemented more efficiently in the LOCAL model, we also obtain (marginally) improved complexity bounds in that model.
Authors: Syamantak Kumar, Purnamrita Sarkar, Kevin Tian, Peiyuan Zhang
Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$.
We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.
Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$.
We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.
Authors: Soham Nagawanshi, Shalini Panthangi, Chen Wang, David P. Woodruff, Samson Zhou
Motivated by the prevalence and success of machine learning, a line of recent work has studied learning-augmented algorithms in the streaming model. These results have shown that for natural and practical oracles implemented with machine learning models, we can obtain streaming algorithms with improved space efficiency that are otherwise provably impossible. On the other hand, our understanding is much more limited when items are weighted unequally, for example, in the sliding-window model, where older data must be expunged from the dataset, e.g., by privacy regulation laws. In this paper, we utilize an oracle for the heavy-hitters of datasets to give learning-augmented algorithms for a number of fundamental problems, such as norm/moment estimation, frequency estimation, cascaded norms, and rectangular moment estimation, in the time-decay setting. We complement our theoretical results with a number of empirical evaluations that demonstrate the practical efficiency of our algorithms on real and synthetic datasets.
Motivated by the prevalence and success of machine learning, a line of recent work has studied learning-augmented algorithms in the streaming model. These results have shown that for natural and practical oracles implemented with machine learning models, we can obtain streaming algorithms with improved space efficiency that are otherwise provably impossible. On the other hand, our understanding is much more limited when items are weighted unequally, for example, in the sliding-window model, where older data must be expunged from the dataset, e.g., by privacy regulation laws. In this paper, we utilize an oracle for the heavy-hitters of datasets to give learning-augmented algorithms for a number of fundamental problems, such as norm/moment estimation, frequency estimation, cascaded norms, and rectangular moment estimation, in the time-decay setting. We complement our theoretical results with a number of empirical evaluations that demonstrate the practical efficiency of our algorithms on real and synthetic datasets.
We extend the study of the 2-Solo Chess problem which was first introduced by Aravind, Misra, and Mittal in 2022. 2-Solo Chess is a single-player variant of chess in which the player must clear the board via captures such that only one piece remains, with each piece capturing at most twice. It is known that the problem is solvable in polynomial time for instances containing only pawns, while it becomes NP-complete for instances restricted to rooks, bishops, or queens. In this work, we complete the complexity classification by proving that 2-Solo Chess is NP-complete if the instance contains only knights or only kings.
We extend the study of the 2-Solo Chess problem which was first introduced by Aravind, Misra, and Mittal in 2022. 2-Solo Chess is a single-player variant of chess in which the player must clear the board via captures such that only one piece remains, with each piece capturing at most twice. It is known that the problem is solvable in polynomial time for instances containing only pawns, while it becomes NP-complete for instances restricted to rooks, bishops, or queens. In this work, we complete the complexity classification by proving that 2-Solo Chess is NP-complete if the instance contains only knights or only kings.
Authors: Taisei Otsuji, Peter Fulla, Takuro Fukunaga
Hotaru Beam is a logic puzzle which objective is to connect circles placed on a grid by drawing only lines with specified starting points and numbers of bends. A zero-knowledge proof is a communication protocol that allows one player to persuade the other that they are in possession of a certain piece of information without actually revealing it. We show that Hotaru Beam is NP-complete and present a physical zero-knowledge proof (i.e. implementable using physical items) for proving that one knows a solution to the puzzle.
Hotaru Beam is a logic puzzle which objective is to connect circles placed on a grid by drawing only lines with specified starting points and numbers of bends. A zero-knowledge proof is a communication protocol that allows one player to persuade the other that they are in possession of a certain piece of information without actually revealing it. We show that Hotaru Beam is NP-complete and present a physical zero-knowledge proof (i.e. implementable using physical items) for proving that one knows a solution to the puzzle.
Many popular puzzle and matching games have been analyzed through the lens of computational complexity. Prominent examples include Sudoku, Candy Crush, and Flood-It. A common theme among these widely played games is that their generalized decision versions are NP-hard, which is often thought of as a source of their inherent difficulty and addictive appeal to human players. In this paper, we study a popular single-player stacking game commonly known as Hexasort. The game can be modelled as placing colored stacks onto the vertices of a graph, where adjacent stacks of the same color merge and vanish according to deterministic rules. We prove that Hexasort is NP-hard, even when restricted to single-color stacks and progressively more constrained classes of graphs, culminating in strong NP-hardness on trees of either bounded height or degree. Towards fixed-parameter tractable algorithms, we identify settings in which the problem becomes polynomial-time solvable and present dynamic programming algorithms.
Many popular puzzle and matching games have been analyzed through the lens of computational complexity. Prominent examples include Sudoku, Candy Crush, and Flood-It. A common theme among these widely played games is that their generalized decision versions are NP-hard, which is often thought of as a source of their inherent difficulty and addictive appeal to human players. In this paper, we study a popular single-player stacking game commonly known as Hexasort. The game can be modelled as placing colored stacks onto the vertices of a graph, where adjacent stacks of the same color merge and vanish according to deterministic rules. We prove that Hexasort is NP-hard, even when restricted to single-color stacks and progressively more constrained classes of graphs, culminating in strong NP-hardness on trees of either bounded height or degree. Towards fixed-parameter tractable algorithms, we identify settings in which the problem becomes polynomial-time solvable and present dynamic programming algorithms.
Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The goal of this study is to implement and evaluate the PARWiS algorithm, which shows spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. This work have extended the PARWiS with a contextual variant (Contextual PARWiS) and a reinforcement learning-based variant (RL PARWiS), comparing them against baselines, including Double Thompson Sampling and a random selection strategy. This evaluation spans synthetic and real-world datasets (Jester and MovieLens), using budgets of 40, 60, and 80 comparisons for 20 items. The performance is measured through recovery fraction, true rank of reported winner, reported rank of true winner, and cumulative regret, alongside the separation metric \(Δ_{1,2}\). Results show that PARWiS and RL PARWiS outperform baselines across all datasets, particularly in the Jester dataset with a higher \(Δ_{1,2}\), while performance gaps narrow in the more challenging MovieLens dataset with a smaller \(Δ_{1,2}\). Contextual PARWiS shows comparable performance to PARWiS, indicating that contextual features may require further tuning to provide significant benefits.
Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The goal of this study is to implement and evaluate the PARWiS algorithm, which shows spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. This work have extended the PARWiS with a contextual variant (Contextual PARWiS) and a reinforcement learning-based variant (RL PARWiS), comparing them against baselines, including Double Thompson Sampling and a random selection strategy. This evaluation spans synthetic and real-world datasets (Jester and MovieLens), using budgets of 40, 60, and 80 comparisons for 20 items. The performance is measured through recovery fraction, true rank of reported winner, reported rank of true winner, and cumulative regret, alongside the separation metric \(Δ_{1,2}\). Results show that PARWiS and RL PARWiS outperform baselines across all datasets, particularly in the Jester dataset with a higher \(Δ_{1,2}\), while performance gaps narrow in the more challenging MovieLens dataset with a smaller \(Δ_{1,2}\). Contextual PARWiS shows comparable performance to PARWiS, indicating that contextual features may require further tuning to provide significant benefits.
In our previous papers we sketched proofs of the equality NP = coNP = PSPACE. These results have been obtained by proof theoretic tree-to-dag compressing techniques adapted to Prawitz's Natural Deduction (ND) for implicational minimal logic with references to Hudelmaier's cutfree sequent calculus. In this note we comment on Jeřábek's approach that claimed to refute our results by providing exponential lower bounds on the implicational minimal logic. This claim is wrong and misleading, which is briefly demonstrated by Basis example below.
In our previous papers we sketched proofs of the equality NP = coNP = PSPACE. These results have been obtained by proof theoretic tree-to-dag compressing techniques adapted to Prawitz's Natural Deduction (ND) for implicational minimal logic with references to Hudelmaier's cutfree sequent calculus. In this note we comment on Jeřábek's approach that claimed to refute our results by providing exponential lower bounds on the implicational minimal logic. This claim is wrong and misleading, which is briefly demonstrated by Basis example below.
Authors: Peyman Afshani, Boris Aronov, Kevin Buchin, Maike Buchin, Otfried Cheong, Katharina Klost, Carolin Rehs, Günter Rote
Let $P$ and $Q$ be simple polygons with $n$ vertices each. We wish to compute triangulations of $P$ and $Q$ that are combinatorially equivalent, if they exist. We consider two versions of the problem: if a triangulation of $P$ is given, we can decide in $O(n\log n + nr)$ time if $Q$ has a compatible triangulation, where $r$ is the number of reflex vertices of $Q$. If we are already given the correspondence between vertices of $P$ and $Q$ (but no triangulation), we can find compatible triangulations of $P$ and $Q$ in time $O(M(n))$, where $M(n)$ is the running time for multiplying two $n\times n$ matrices.
Let $P$ and $Q$ be simple polygons with $n$ vertices each. We wish to compute triangulations of $P$ and $Q$ that are combinatorially equivalent, if they exist. We consider two versions of the problem: if a triangulation of $P$ is given, we can decide in $O(n\log n + nr)$ time if $Q$ has a compatible triangulation, where $r$ is the number of reflex vertices of $Q$. If we are already given the correspondence between vertices of $P$ and $Q$ (but no triangulation), we can find compatible triangulations of $P$ and $Q$ in time $O(M(n))$, where $M(n)$ is the running time for multiplying two $n\times n$ matrices.
Authors: Elena Farahbakhsh Touli, Ingrid Hotz, Talha Bin Masood
The interleaving distance is a key tool for comparing merge trees, which provide topological summaries of scalar functions. In this work, we define an average merge tree for a pair of merge trees using the interleaving distance. Since such an average is not unique, we propose a method to construct a representative average merge tree. We further prove that the resulting merge tree indeed satisfies a natural notion of averaging for the two given merge trees. To demonstrate the structure of the average merge tree, we include illustrative examples.
The interleaving distance is a key tool for comparing merge trees, which provide topological summaries of scalar functions. In this work, we define an average merge tree for a pair of merge trees using the interleaving distance. Since such an average is not unique, we propose a method to construct a representative average merge tree. We further prove that the resulting merge tree indeed satisfies a natural notion of averaging for the two given merge trees. To demonstrate the structure of the average merge tree, we include illustrative examples.
Authors: Yi Zhou, Haocheng Fu, Yiping Liu, Jian Mao, Zhang-Hua Fu, Yuyi Wang
The two-dimensional irregular bin packing problem (2DIBPP) aims to pack a given set of irregular polygons, referred to as pieces, into fixed-size rectangular bins without overlap, while maximizing bin utilization. Although numerous metaheuristic algorithms have been proposed for the 2DIBPP, many industrial applications favor simpler constructive heuristics due to their deterministic behavior and low computational overhead. Among such methods, the DJD algorithm proposed by L'opez-Camacho et al. is one of the most competitive constructive heuristics for the 2DIBPP. However, DJD is less effective for cutting instances, in which many pieces can be seamlessly combined into larger polygons. To address the issue, we propose MergeDJD, a novel constructive algorithm that integrates and extends the DJD framework. MergeDJD first preprocesses the instance by iteratively identifying groups of pieces that can be combined into larger and more regular piece. It then employs an improved version of DJD, in which the placement strategy is enhanced to better handle non-convex and combined shapes, to pack all resulting pieces into bins. Computational experiments on 1,089 well-known benchmark instances show that MergeDJD consistently outperforms DJD on 1,083 instances while maintaining short runtimes. Notably, MergeDJD attains new best known values on 515 instances. Ablation studies further confirm the effectiveness of the proposed components. To facilitate reproducibility and future research, we have open-sourced the complete implementation and provided interfaces for visualizing packing results.
The two-dimensional irregular bin packing problem (2DIBPP) aims to pack a given set of irregular polygons, referred to as pieces, into fixed-size rectangular bins without overlap, while maximizing bin utilization. Although numerous metaheuristic algorithms have been proposed for the 2DIBPP, many industrial applications favor simpler constructive heuristics due to their deterministic behavior and low computational overhead. Among such methods, the DJD algorithm proposed by L'opez-Camacho et al. is one of the most competitive constructive heuristics for the 2DIBPP. However, DJD is less effective for cutting instances, in which many pieces can be seamlessly combined into larger polygons. To address the issue, we propose MergeDJD, a novel constructive algorithm that integrates and extends the DJD framework. MergeDJD first preprocesses the instance by iteratively identifying groups of pieces that can be combined into larger and more regular piece. It then employs an improved version of DJD, in which the placement strategy is enhanced to better handle non-convex and combined shapes, to pack all resulting pieces into bins. Computational experiments on 1,089 well-known benchmark instances show that MergeDJD consistently outperforms DJD on 1,083 instances while maintaining short runtimes. Notably, MergeDJD attains new best known values on 515 instances. Ablation studies further confirm the effectiveness of the proposed components. To facilitate reproducibility and future research, we have open-sourced the complete implementation and provided interfaces for visualizing packing results.
We present a unified framework for proving memory lower bounds for multi-pass streaming algorithms that detect planted structures. Planted structures -- such as cliques or bicliques in graphs, and sparse signals in high-dimensional data -- arise in numerous applications, and our framework yields multi-pass memory lower bounds for many such fundamental settings. We show memory lower bounds for the planted $k$-biclique detection problem in random bipartite graphs and for detecting sparse Gaussian means. We also show the first memory-sample tradeoffs for the sparse principal component analysis (PCA) problem in the spiked covariance model. For all these problems to which we apply our unified framework, we obtain bounds which are nearly tight in the low, $O(\log n)$ memory regime. We also leverage our bounds to establish new multi-pass streaming lower bounds, in the vertex arrival model, for two well-studied graph streaming problems: approximating the size of the largest biclique and approximating the maximum density of bounded-size subgraphs.
To show these bounds, we study a general distinguishing problem over matrices, where the goal is to distinguish a null distribution from one that plants an outlier distribution over a random submatrix. Our analysis builds on a new distributed data processing inequality that provides sufficient conditions for memory hardness in terms of the likelihood ratio between the averaged planted and null distributions. This result generalizes the inequality of [Braverman et al., STOC 2016] and may be of independent interest. The inequality enables us to measure information cost under the null distribution -- a key step for applying subsequent direct-sum-type arguments and incorporating the multi-pass information cost framework of [Braverman et al., STOC 2024].
We present a unified framework for proving memory lower bounds for multi-pass streaming algorithms that detect planted structures. Planted structures -- such as cliques or bicliques in graphs, and sparse signals in high-dimensional data -- arise in numerous applications, and our framework yields multi-pass memory lower bounds for many such fundamental settings. We show memory lower bounds for the planted $k$-biclique detection problem in random bipartite graphs and for detecting sparse Gaussian means. We also show the first memory-sample tradeoffs for the sparse principal component analysis (PCA) problem in the spiked covariance model. For all these problems to which we apply our unified framework, we obtain bounds which are nearly tight in the low, $O(\log n)$ memory regime. We also leverage our bounds to establish new multi-pass streaming lower bounds, in the vertex arrival model, for two well-studied graph streaming problems: approximating the size of the largest biclique and approximating the maximum density of bounded-size subgraphs.
To show these bounds, we study a general distinguishing problem over matrices, where the goal is to distinguish a null distribution from one that plants an outlier distribution over a random submatrix. Our analysis builds on a new distributed data processing inequality that provides sufficient conditions for memory hardness in terms of the likelihood ratio between the averaged planted and null distributions. This result generalizes the inequality of [Braverman et al., STOC 2016] and may be of independent interest. The inequality enables us to measure information cost under the null distribution -- a key step for applying subsequent direct-sum-type arguments and incorporating the multi-pass information cost framework of [Braverman et al., STOC 2024].
We formulate and analyze a heterogeneous random hypergraph model, and we provide an achieveability result for recovery of hyperedges from the observed projected graph. We observe a projected graph which combines random hyperedges across all degrees, where a projected edge appears if and only if both vertices appear in at least one hyperedge. Our goal is to reconstruct the original set of hyperedges of degree $d_j$ for some $j$. Our achievability result is based on the idea of selecting maximal cliques of size $d_j$ in the projected graph, and we show that this algorithm succeeds under a natural condition on the densities. This achievability condition generalizes a known threshold for $d$-uniform hypergraphs with noiseless and noisy projections. We conjecture the threshold to be optimal for recovering hyperedges with the largest degree.
We formulate and analyze a heterogeneous random hypergraph model, and we provide an achieveability result for recovery of hyperedges from the observed projected graph. We observe a projected graph which combines random hyperedges across all degrees, where a projected edge appears if and only if both vertices appear in at least one hyperedge. Our goal is to reconstruct the original set of hyperedges of degree $d_j$ for some $j$. Our achievability result is based on the idea of selecting maximal cliques of size $d_j$ in the projected graph, and we show that this algorithm succeeds under a natural condition on the densities. This achievability condition generalizes a known threshold for $d$-uniform hypergraphs with noiseless and noisy projections. We conjecture the threshold to be optimal for recovering hyperedges with the largest degree.
Authors: Gerth Stølting Brodal, John Iacono, Casper Moldrup Rysgaard, Sebastian Wild
We introduce a new family of priority-queue data structures: partition-based simple heaps. The structures consist of $O(\log n)$ doubly-linked lists; order is enforced among data in different lists, but the individual lists are unordered. Our structures have amortized $O(\log n)$ time extract-min and amortized $O(\log \log n)$ time insert and decrease-key.
The structures require nothing beyond binary search over $O(\log n)$ elements, as well as binary partitions and concatenations of linked lists in natural ways as the linked lists get too big or small. We present three different ways that these lists can be maintained in order to obtain the stated amortized running times.
We introduce a new family of priority-queue data structures: partition-based simple heaps. The structures consist of $O(\log n)$ doubly-linked lists; order is enforced among data in different lists, but the individual lists are unordered. Our structures have amortized $O(\log n)$ time extract-min and amortized $O(\log \log n)$ time insert and decrease-key.
The structures require nothing beyond binary search over $O(\log n)$ elements, as well as binary partitions and concatenations of linked lists in natural ways as the linked lists get too big or small. We present three different ways that these lists can be maintained in order to obtain the stated amortized running times.
Randomized parallel algorithms for many fundamental problems achieve optimal linear work in expectation, but upgrading this guarantee to hold with high probability (whp) remains a recurring theoretical challenge. In this paper, we address this gap for several core parallel primitives.
First, we present the first parallel semisort algorithm achieving $O(n)$ work and $O(\text{polylog } n)$ depth whp, improving upon the $O(n)$ expected work bound of Gu et al. [SPAA 2015]. Our analysis introduces new concentration arguments based on simple tabulation hashing and tail bounds for weighted sums of geometric random variables. As a corollary, we obtain an integer sorting algorithm for keys in $[n]$ matching the same bounds.
Second, we introduce a framework for boosting randomized parallel graph algorithms from expected to high probability linear work. The framework applies to \emph{locally extendable} problems -- those admitting a deterministic procedure that extends a solution across a graph cut in work proportional to the cut size. We combine this with a \emph{culled balanced partition} scheme: an iterative culling phase removes a polylogarithmic number of high-degree vertices, after which the remaining graph admits a balanced random vertex whp via a bounded-differences argument. Applying work-inefficient whp subroutines to the small pieces and deterministic extension across cuts yields overall linear work whp. We instantiate this framework to obtain $O(m)$ work and polylogarithmic depth whp algorithms for $(Δ+1)$-vertex coloring and maximal independent set.
Randomized parallel algorithms for many fundamental problems achieve optimal linear work in expectation, but upgrading this guarantee to hold with high probability (whp) remains a recurring theoretical challenge. In this paper, we address this gap for several core parallel primitives.
First, we present the first parallel semisort algorithm achieving $O(n)$ work and $O(\text{polylog } n)$ depth whp, improving upon the $O(n)$ expected work bound of Gu et al. [SPAA 2015]. Our analysis introduces new concentration arguments based on simple tabulation hashing and tail bounds for weighted sums of geometric random variables. As a corollary, we obtain an integer sorting algorithm for keys in $[n]$ matching the same bounds.
Second, we introduce a framework for boosting randomized parallel graph algorithms from expected to high probability linear work. The framework applies to \emph{locally extendable} problems -- those admitting a deterministic procedure that extends a solution across a graph cut in work proportional to the cut size. We combine this with a \emph{culled balanced partition} scheme: an iterative culling phase removes a polylogarithmic number of high-degree vertices, after which the remaining graph admits a balanced random vertex whp via a bounded-differences argument. Applying work-inefficient whp subroutines to the small pieces and deterministic extension across cuts yields overall linear work whp. We instantiate this framework to obtain $O(m)$ work and polylogarithmic depth whp algorithms for $(Δ+1)$-vertex coloring and maximal independent set.
The burning number $b(G)$ of a graph $G$ is the minimum number of rounds required to burn all vertices when, at each discrete step, existing fires spread to neighboring vertices and one new fire may be ignited at an unburned vertex. This parameter measures the speed of influence propagation in a network and has been studied as a model for information diffusion and resource allocation in distributed systems. A central open problem, the Burning Number Conjecture (BNC), asserts that every graph on $n$ vertices can be burned in at most $\lceil \sqrt n\rceil$ rounds, a bound known to be sharp for paths and verified for several structured families of trees. We investigate rooted graph products, focusing on comb graphs obtained by attaching a path (a ``tooth'') to each vertex of a path (the ``spine''). Unlike classical symmetric graph products, rooted products introduce hierarchical bottlenecks: communication between local subnetworks must pass through designated root vertices, providing a natural model for hub-and-spoke or chain-of-command architectures. We prove that the BNC holds for all comb graphs and determine the precise asymptotic order of their burning number in every parameter regime, including exact formulas in the spine-dominant case that generalize the known formula for paths. Our approach is constructive, based on an explicit greedy algorithm that is optimal or near-optimal depending on the regime.
The burning number $b(G)$ of a graph $G$ is the minimum number of rounds required to burn all vertices when, at each discrete step, existing fires spread to neighboring vertices and one new fire may be ignited at an unburned vertex. This parameter measures the speed of influence propagation in a network and has been studied as a model for information diffusion and resource allocation in distributed systems. A central open problem, the Burning Number Conjecture (BNC), asserts that every graph on $n$ vertices can be burned in at most $\lceil \sqrt n\rceil$ rounds, a bound known to be sharp for paths and verified for several structured families of trees. We investigate rooted graph products, focusing on comb graphs obtained by attaching a path (a ``tooth'') to each vertex of a path (the ``spine''). Unlike classical symmetric graph products, rooted products introduce hierarchical bottlenecks: communication between local subnetworks must pass through designated root vertices, providing a natural model for hub-and-spoke or chain-of-command architectures. We prove that the BNC holds for all comb graphs and determine the precise asymptotic order of their burning number in every parameter regime, including exact formulas in the spine-dominant case that generalize the known formula for paths. Our approach is constructive, based on an explicit greedy algorithm that is optimal or near-optimal depending on the regime.
We introduce and study the problem of consistent low-rank approximation, in which rows of an input matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ arrive sequentially and the goal is to provide a sequence of subspaces that well-approximate the optimal rank-$k$ approximation to the submatrix $\mathbf{A}^{(t)}$ that has arrived at each time $t$, while minimizing the recourse, i.e., the overall change in the sequence of solutions. We first show that when the goal is to achieve a low-rank cost within an additive $\varepsilon\cdot||\mathbf{A}^{(t)}||_F^2$ factor of the optimal cost, roughly $\mathcal{O}\left(\frac{k}{\varepsilon}\log(nd)\right)$ recourse is feasible. For the more challenging goal of achieving a relative $(1+\varepsilon)$-multiplicative approximation of the optimal rank-$k$ cost, we show that a simple upper bound in this setting is $\frac{k^2}{\varepsilon^2}\cdot\text{poly}\log(nd)$ recourse, which we further improve to $\frac{k^{3/2}}{\varepsilon^2}\cdot\text{poly}\log(nd)$ for integer-bounded matrices and $\frac{k}{\varepsilon^2}\cdot\text{poly}\log(nd)$ for data streams with polynomial online condition number. We also show that $Ω\left(\frac{k}{\varepsilon}\log\frac{n}{k}\right)$ recourse is necessary for any algorithm that maintains a multiplicative $(1+\varepsilon)$-approximation to the optimal low-rank cost, even if the full input is known in advance. Finally, we perform a number of empirical evaluations to complement our theoretical guarantees, demonstrating the efficacy of our algorithms in practice.
We introduce and study the problem of consistent low-rank approximation, in which rows of an input matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ arrive sequentially and the goal is to provide a sequence of subspaces that well-approximate the optimal rank-$k$ approximation to the submatrix $\mathbf{A}^{(t)}$ that has arrived at each time $t$, while minimizing the recourse, i.e., the overall change in the sequence of solutions. We first show that when the goal is to achieve a low-rank cost within an additive $\varepsilon\cdot||\mathbf{A}^{(t)}||_F^2$ factor of the optimal cost, roughly $\mathcal{O}\left(\frac{k}{\varepsilon}\log(nd)\right)$ recourse is feasible. For the more challenging goal of achieving a relative $(1+\varepsilon)$-multiplicative approximation of the optimal rank-$k$ cost, we show that a simple upper bound in this setting is $\frac{k^2}{\varepsilon^2}\cdot\text{poly}\log(nd)$ recourse, which we further improve to $\frac{k^{3/2}}{\varepsilon^2}\cdot\text{poly}\log(nd)$ for integer-bounded matrices and $\frac{k}{\varepsilon^2}\cdot\text{poly}\log(nd)$ for data streams with polynomial online condition number. We also show that $Ω\left(\frac{k}{\varepsilon}\log\frac{n}{k}\right)$ recourse is necessary for any algorithm that maintains a multiplicative $(1+\varepsilon)$-approximation to the optimal low-rank cost, even if the full input is known in advance. Finally, we perform a number of empirical evaluations to complement our theoretical guarantees, demonstrating the efficacy of our algorithms in practice.
We introduce \textbf{Kruskal-EDS} (\emph{Edge Dynamic Stratification}), a distribution-adaptive variant of Kruskal's minimum spanning tree (MST) algorithm that replaces the mandatory $Θ(m\log m)$ global sort with a three-phase procedure inspired by Birkhoff's ergodic theorem. In Phase 1, a sample of $\sqrt{m}$ edges estimates the weight distribution in $Θ(\sqrt{m}\log m)$ time. In Phase 2, all $m$ edges are assigned to $k$ strata in $Θ(m\log k)$ time via binary search on quantile boundaries -- no global sort. In Phase 3, strata are sorted and processed in order; execution halts as soon as $n{-}1$ MST edges are accepted. We prove an effective complexity of $Θ(m + p\cdot(m/k)\log(m/k))$, where $p \leq k$ is the number of strata actually processed. On sparse graphs or heavy-tailed weight distributions, $p \ll k$ and the algorithm achieves near-$Θ(m)$ behaviour. We further derive the optimal strata count $k^* = \lceil\sqrt{m/\ln(m+1)}\,\rceil$, balancing partition overhead against intra-stratum sort cost. An extensive benchmark on 14 graph families demonstrates correctness on 12 test cases and practical speedups reaching $\mathbf{10\times}$ in wall-clock time and $\mathbf{33\times}$ in sort operations over standard Kruskal. A 3-dimensional TikZ visualisation of the complexity landscape illustrates the algorithm's adaptive behaviour as a function of graph density and weight distribution skewness.
We introduce \textbf{Kruskal-EDS} (\emph{Edge Dynamic Stratification}), a distribution-adaptive variant of Kruskal's minimum spanning tree (MST) algorithm that replaces the mandatory $Θ(m\log m)$ global sort with a three-phase procedure inspired by Birkhoff's ergodic theorem. In Phase 1, a sample of $\sqrt{m}$ edges estimates the weight distribution in $Θ(\sqrt{m}\log m)$ time. In Phase 2, all $m$ edges are assigned to $k$ strata in $Θ(m\log k)$ time via binary search on quantile boundaries -- no global sort. In Phase 3, strata are sorted and processed in order; execution halts as soon as $n{-}1$ MST edges are accepted. We prove an effective complexity of $Θ(m + p\cdot(m/k)\log(m/k))$, where $p \leq k$ is the number of strata actually processed. On sparse graphs or heavy-tailed weight distributions, $p \ll k$ and the algorithm achieves near-$Θ(m)$ behaviour. We further derive the optimal strata count $k^* = \lceil\sqrt{m/\ln(m+1)}\,\rceil$, balancing partition overhead against intra-stratum sort cost. An extensive benchmark on 14 graph families demonstrates correctness on 12 test cases and practical speedups reaching $\mathbf{10\times}$ in wall-clock time and $\mathbf{33\times}$ in sort operations over standard Kruskal. A 3-dimensional TikZ visualisation of the complexity landscape illustrates the algorithm's adaptive behaviour as a function of graph density and weight distribution skewness.
The limits of optimal control, from the maximalist and minimalist perspectives
This is a live blog of Lecture 4 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here. The dramatic conclusion of Steampunk Data Science will appear later this week.
Though ostensibly chasing similar questions, the fields of artificial intelligence and control theory couldn’t be more temperamentally different. In AI, much like the rest of computer science, there’s a mindset to “just do stuff” and patch bugs as they come. There’s a spirit of tinkering and benchmaxxing, and the expectation that systems will get more robust as they get more capable. I mean, the AI Safety people seem to think that the way to get safe AI is to spend billions of dollars building a superintelligent AI. They say this out loud to anyone willing to listen.
Control theory, by contrast, wants you to learn non-self-adjoint operator theory before you think about building anything. If you’re designing an airplane or any other system where failure can lead to catastrophe, safety has to come first. And the path to safety is always more math.
Finding a nice middle ground between these two mindsets has been much harder than I’d like. And yet, the two fields connect at optimal control. The foundation of reinforcement learning (and its successor, recursive self-improvement) is optimal control. Modern control theory emerged in the Cold War planning of the 50s, where aerospace engineers developed optimal control to plot rocket trajectories. In class today, we’ll look at both perspectives and find why they not only can’t meet in the middle, but both leave us with rather unsatisfactory views of the role of optimization in sequential decision-making.
In its most grandiose formulation, one often adopted by rationalist AI maximalists, optimal control claims a universal model for making decisions in the face of uncertainty. You just need the right cost function and proper Bayesian updating.
This view falls apart once you move beyond the toy problems people nerd out about in online rationalist forums. Most problems have more stages than the Monty Hall problem. And once you have a modestly long horizon, exact solutions of optimal control problems are beyond impossible. We all quickly learn that with any reasonable complexity, you have to move from dynamic programming to approximate dynamic programming, meaning you are never sure if you are actually solving your original problem. Moreover, as soon as you have to incorporate measurement uncertainty and those clever Bayesian updates, the associated optimization problems are at best PSPACE-complete or at worst undecidable. You could try to argue to me that your POMDP problem isn’t PSPACE-complete, and enough iterations of GFYPO will find the answer. But this means that your superintelligent optimizer isn’t actually solving optimization problems. What it’s actually doing is anyone’s guess.
Perhaps we can retreat to the more modest view promoted in contemporary control classes. Optimal control is a nice scaffolding for engineering design. You get a framework to locally make sense of a huge parameter space. If you’re trying to tune multiple PID loops at once, you’d probably rather link things together with a Q and an R than a few dozen controller gains without a clear relation between them. But it’s a local framework (a small world framework, if you will), not a global system.
And the optimization framework often gives you some robustness. In the LQR problem, we know that if we find a control policy, it is stabilizing. We know that LQR gives us a way to bound the amount of errors we’ll accumulate if unmodeled noise perturbs the system. And we know that near-optimal solutions are often pretty good. Some early exciting work showed that LQR was robust to misspecification of the system model.
The gain margins, sadly, turned out to be a bit of a mathematical illusion. As soon as you incorporate the Bayesian reasoning that rationally summarizes the uncertainty in measurement, the policy becomes exceptionally fragile to model misspecification. This is Doyle’s famous “There Are None” example, and we’ll work through the details today.
You can incorporate robustness directly into the formulation, but this comes at its own computational and conceptual costs. Robust control is not easy and is seldom taught. My colleagues can correct me, but I’m not sure we ever teach it at Berkeley. Why is a topic for a future blog post.
So from both the maximalist RL perspective and the minimalist control perspective, we’re stuck…with LQR. Any other optimal control problem of reasonable complexity is intractable and inherently fragile to model mismatch and measurement uncertainty. This feels like a funny place to lay the foundation of an engineering philosophy!
So why would we build an entire system around optimization? I suppose I understand the promising allure of just writing down cost functions and having robots come out. But even people who don’t work on optimization know that there are always tradeoffs in designing systems and making decisions.1 If the goal is literally just “minimize cost,” we get cheap, fragile garbage out. We have to account for all the things we might care about, and write these explicitly into the cost, the constraints, or the solution heuristics.
Ultimately, real robustness gets added in with engineering expertise rather than mathematical rigor. There’s no getting around the years of hard work in simulation and pilot programs before you can convince everyone your system is actually ready for prime time. Optimal control gives you a false sense of interpretability and rigor, and it’s important to be periodically reminded of its illusoriness.
Goodhart's law: When a measure becomes a target, it stops being a measure.
I was watching the show Masterminds where Ken Jennings is one of the Masterminds. Here is what happened:
Brook Burns (the host): The only vice president in the 20th century with initials H.H. was Hubert Humphrey. Who was the only vice president in the 19th century with initials H.H?
Bill Gasarch shouting at the screen: Hannibal Hamlin, Lincoln's VP for his first term! This is really obscure so this may be a rare case where I get a question right that Ken Jennings does not know!
Ken Jennings: Hannibal Hamlin.
Bill Gasarch: Darn! However, I suspect he studied lists of presidents and vice presidents for the purpose of doing well on Jeopardy and now other shows. My knowledge is more natural in that I read books on presidents and vice presidents (best book, maybe the only book, on Vice Presidents: Bland Ambition). My knowledge of it is different from his. I tend to think my knowledge is more legitimate, though it would be hard to make that statement rigorous. If on quiz shows they asked follow-up questions, that might help alleviate this problem, if it is indeed a problem.
Misc:
Ken Jennings is a Mormon so he does not drink or know about drinks. However, before going on Jeopardy he studied drinks for the show.
Ken Jennings does know novelty songs and that is legit since novelty songs comes up rarely on Jeopardy so I doubt he studied them in his preparation for going on Jeopardy.
a) He knew that Shel Silverstein wrote A boy named Sue which was sung by Johnny Cash
b) He knew that Johnny Cash did a cover of Shel Silverstein's I'm being swallowed by a boa constrictor.
c) During his streak there was a category on novelty songs. He got 4 out of the 5 correct, and the one he got wrong I could tell he really knew. I got them all right, and faster, but Ken Jennings gets my respect for his legit knowledge of novelty songs. See here for the questions and answers.
Goodharts law (maybe): Jeopardy is supposed to be about what people know. Is it supposed to be about what people study? Does studying for it help you for things other than quiz shows?
When I watch a quiz show and get a question right there are levels of legitimacy:
1) I know the area naturally.
2) I saw the question and answer on a different show and and I then looked it up and know more about it. So this is now legit.
3) I saw the question and answer on a different show and just know it as stimulus-response with very little understanding. Example: Kelly Clarkson was the first winner on American Idol, but I have never seen the show and only vaguely know how it works.
4) I saw the question and answer on the show I am watching, the exact episode, and I am watching a repeat. Sometimes I know stuff I don't normally know and then say OH, I've seen this episode before.
Back to my point:
Is Ken Jennings memorizing lists a less-legit form of knowledge? If so, how to make that statement rigorous?
Is this what AI does? See also Chinese Room since that's also about a device that gets the right answers but perhaps for the ''wrong'' reason.
By gasarch
Goodhart's law: When a measure becomes a target, it stops being a measure.
I was watching the show Masterminds where Ken Jennings is one of the Masterminds. Here is what happened:
Brook Burns (the host): The only vice president in the 20th century with initials H.H. was Hubert Humphrey. Who was the only vice president in the 19th century with initials H.H?
Bill Gasarch shouting at the screen: Hannibal Hamlin, Lincoln's VP for his first term! This is really obscure so this may be a rare case where I get a question right that Ken Jennings does not know!
Ken Jennings: Hannibal Hamlin.
Bill Gasarch: Darn! However, I suspect he studied lists of presidents and vice presidents for the purpose ofdoing well on Jeopardy and now othershows. My knowledge is more natural in that I read books on presidents and vice presidents (best book, maybe the only book, on Vice Presidents: Bland Ambition). My knowledge of it is different from his. I tend to think my knowledge is more legitimate, though it would be hard to make that statement rigorous. If on quiz shows they asked follow-up questions, that might help alleviate this problem, if it is indeed a problem.
Misc:
Ken Jennings is a Mormon so he does not drink or know about drinks. However, before going on Jeopardy he studied drinks for the show.
Ken Jennings does know novelty songs and that is legit since novelty songs comes up rarely on Jeopardy so I doubt he studied them in his preparation for going on Jeopardy.
a) He knew that Shel Silverstein wrote A boy named Sue which was sung by Johnny Cash
c) During his streak there was a category on novelty songs. He got 4 out of the 5 correct, and the one he got wrong I could tell he really knew. I got them all right, and faster, but Ken Jennings gets my respect for his legit knowledge of novelty songs. See here for the questions and answers.
Goodharts law (maybe): Jeopardy is supposed to be about what people know. Is it supposed to be about what people study? Does studying for it help you for things other than quiz shows?
When I watch a quiz show and get a question right there are levels of legitimacy:
1) I know the area naturally.
2) I saw the question and answer on a different show and and I then looked it up and know more about it. So this is now legit.
3) I saw the question and answer on a different show and just know it as stimulus-response with very little understanding. Example: Kelly Clarkson was the first winner on American Idol, but I have never seen the show and only vaguely know how it works.
4) I saw the question and answer on the show I am watching, the exact episode, and I am watching a repeat. Sometimes I know stuff I don't normally know and then say OH, I've seen this episode before.
Back to my point:
Is Ken Jennings memorizing lists a less-legit form of knowledge? If so, how to make that statement rigorous?
Is this what AI does? See also Chinese Room since that's also about a device that gets the right answers but perhaps for the ''wrong'' reason.
Authors: Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan
Recent work has shown the surprising power of low-degree sandwiching polynomial approximators in the context of challenging learning settings such as learning with distribution shift, testable learning, and learning with contamination. A pair of sandwiching polynomials approximate a target function in expectation while also providing pointwise upper and lower bounds on the function's values. In this paper, we give a new method for constructing low-degree sandwiching polynomials that yield greatly improved degree bounds for several fundamental function classes and marginal distributions. In particular, we obtain degree $\mathrm{poly}(k)$ sandwiching polynomials for functions of $k$ halfspaces under the Gaussian distribution, improving exponentially over the prior $2^{O(k)}$ bound. More broadly, our approach applies to function classes that are low-dimensional and have smooth boundary.
In contrast to prior work, our proof is relatively simple and directly uses the smoothness of the target function's boundary to construct sandwiching Lipschitz functions, which are amenable to results from high-dimensional approximation theory. For low-dimensional polynomial threshold functions (PTFs) with respect to Gaussians, we obtain doubly exponential improvements without applying the FT-mollification method of Kane used in the best previous result.
Recent work has shown the surprising power of low-degree sandwiching polynomial approximators in the context of challenging learning settings such as learning with distribution shift, testable learning, and learning with contamination. A pair of sandwiching polynomials approximate a target function in expectation while also providing pointwise upper and lower bounds on the function's values. In this paper, we give a new method for constructing low-degree sandwiching polynomials that yield greatly improved degree bounds for several fundamental function classes and marginal distributions. In particular, we obtain degree $\mathrm{poly}(k)$ sandwiching polynomials for functions of $k$ halfspaces under the Gaussian distribution, improving exponentially over the prior $2^{O(k)}$ bound. More broadly, our approach applies to function classes that are low-dimensional and have smooth boundary.
In contrast to prior work, our proof is relatively simple and directly uses the smoothness of the target function's boundary to construct sandwiching Lipschitz functions, which are amenable to results from high-dimensional approximation theory. For low-dimensional polynomial threshold functions (PTFs) with respect to Gaussians, we obtain doubly exponential improvements without applying the FT-mollification method of Kane used in the best previous result.
We establish that adaptive collision-finding queries are strictly more powerful than non-adaptive ones by proving that the complexity class PWPP (Polynomial Weak Pigeonhole Principle) is not closed under adaptive Turing reductions relative to a random oracle. Previously, PWPP was known to be closed under non-adaptive Turing reductions (Jeřábek 2016). We demonstrate this black-box separation by introducing the NESTED-COLLISION problem, a natural collision-finding problem defined on a pair of shrinking functions. We show that while this problem is solvable via two adaptive calls to a PWPP oracle, its random instances cannot be solved via a black-box non-adaptive reduction to the canonical PWPP-complete problem COLLISION.
We establish that adaptive collision-finding queries are strictly more powerful than non-adaptive ones by proving that the complexity class PWPP (Polynomial Weak Pigeonhole Principle) is not closed under adaptive Turing reductions relative to a random oracle. Previously, PWPP was known to be closed under non-adaptive Turing reductions (Jeřábek 2016). We demonstrate this black-box separation by introducing the NESTED-COLLISION problem, a natural collision-finding problem defined on a pair of shrinking functions. We show that while this problem is solvable via two adaptive calls to a PWPP oracle, its random instances cannot be solved via a black-box non-adaptive reduction to the canonical PWPP-complete problem COLLISION.