Common Knowledge and Aumann’s Agreement Theorem

August 16th, 2015

The following is the prepared version of a talk that I gave at SPARC: a high-school summer program about applied rationality held in Berkeley, CA for the past two weeks.  I had a wonderful time in Berkeley, meeting new friends and old, but I’m now leaving to visit the CQT in Singapore, and then to attend the AQIS conference in Seoul.

Common Knowledge and Aumann’s Agreement Theorem

August 14, 2015

Thank you so much for inviting me here!  I honestly don’t know whether it’s possible to teach applied rationality, the way this camp is trying to do.  What I know is that, if it is possible, then the people running SPARC are some of the awesomest people on earth to figure out how.  I’m incredibly proud that Chelsea Voss and Paul Christiano are both former students of mine, and I’m amazed by the program they and the others have put together here.  I hope you’re all having fun—or maximizing your utility functions, or whatever.

My research is mostly about quantum computing, and more broadly, computation and physics.  But I was asked to talk about something you can actually use in your lives, so I want to tell a different story, involving common knowledge.

I’ll start with the “Muddy Children Puzzle,” which is one of the greatest logic puzzles ever invented.  How many of you have seen this one?

OK, so the way it goes is, there are a hundred children playing in the mud.  Naturally, they all have muddy foreheads.  At some point their teacher comes along and says to them, as they all sit around in a circle: “stand up if you know your forehead is muddy.”  No one stands up.  For how could they know?  Each kid can see all the other 99 kids’ foreheads, so knows that they’re muddy, but can’t see his or her own forehead.  (We’ll assume that there are no mirrors or camera phones nearby, and also that this is mud that you don’t feel when it’s on your forehead.)

So the teacher tries again.  “Knowing that no one stood up the last time, now stand up if you know your forehead is muddy.”  Still no one stands up.  Why would they?  No matter how many times the teacher repeats the request, still no one stands up.

Then the teacher tries something new.  “Look, I hereby announce that at least one of you has a muddy forehead.”  After that announcement, the teacher again says, “stand up if you know your forehead is muddy”—and again no one stands up.  And again and again; it continues 99 times.  But then the hundredth time, all the children suddenly stand up.

(There’s a variant of the puzzle involving blue-eyed islanders who all suddenly commit suicide on the hundredth day, when they all learn that their eyes are blue—but as a blue-eyed person myself, that’s always struck me as needlessly macabre.)

What’s going on here?  Somehow, the teacher’s announcing to the children that at least one of them had a muddy forehead set something dramatic in motion, which would eventually make them all stand up—but how could that announcement possibly have made any difference?  After all, each child already knew that at least 99 children had muddy foreheads!

Like with many puzzles, the way to get intuition is to change the numbers.  So suppose there were two children with muddy foreheads, and the teacher announced to them that at least one had a muddy forehead, and then asked both of them whether their own forehead was muddy.  Neither would know.  But each child could reason as follows: “if my forehead weren’t muddy, then the other child would’ve seen that, and would also have known that at least one of us has a muddy forehead.  Therefore she would’ve known, when asked, that her own forehead was muddy.  Since she didn’t know, that means my forehead is muddy.”  So then both children know their foreheads are muddy, when the teacher asks a second time.

Now, this argument can be generalized to any (finite) number of children.  The crucial concept here is common knowledge.  We call a fact “common knowledge” if, not only does everyone know it, but everyone knows everyone knows it, and everyone knows everyone knows everyone knows it, and so on.  It’s true that in the beginning, each child knew that all the other children had muddy foreheads, but it wasn’t common knowledge that even one of them had a muddy forehead.  For example, if your forehead and mine are both muddy, then I know that at least one of us has a muddy forehead, and you know that too, but you don’t know that I know it (for what if your forehead were clean?), and I don’t know that you know it (for what if my forehead were clean?).

What the teacher’s announcement did, was to make it common knowledge that at least one child has a muddy forehead (since not only did everyone hear the announcement, but everyone witnessed everyone else hearing it, etc.).  And once you understand that point, it’s easy to argue by induction: after the teacher asks and no child stands up (and everyone sees that no one stood up), it becomes common knowledge that at least two children have muddy foreheads (since if only one child had had a muddy forehead, that child would’ve known it and stood up).  Next it becomes common knowledge that at least three children have muddy foreheads, and so on, until after a hundred rounds it’s common knowledge that everyone’s forehead is muddy, so everyone stands up.

The moral is that the mere act of saying something publicly can change the world—even if everything you said was already obvious to every last one of your listeners.  For it’s possible that, until your announcement, not everyone knew that everyone knew the thing, or knew everyone knew everyone knew it, etc., and that could have prevented them from acting.

This idea turns out to have huge real-life consequences, to situations way beyond children with muddy foreheads.  I mean, it also applies to children with dots on their foreheads, or “kick me” signs on their backs…

But seriously, let me give you an example I stole from Steven Pinker, from his wonderful book The Stuff of Thought.  Two people of indeterminate gender—let’s not make any assumptions here—go on a date.  Afterward, one of them says to the other: “Would you like to come up to my apartment to see my etchings?”  The other says, “Sure, I’d love to see them.”

This is such a cliché that we might not even notice the deep paradox here.  It’s like with life itself: people knew for thousands of years that every bird has the right kind of beak for its environment, but not until Darwin and Wallace could anyone articulate why (and only a few people before them even recognized there was a question there that called for a non-circular answer).

In our case, the puzzle is this: both people on the date know perfectly well that the reason they’re going up to the apartment has nothing to do with etchings.  They probably even both know the other knows that.  But if that’s the case, then why don’t they just blurt it out: “would you like to come up for some intercourse?”  (Or “fluid transfer,” as the John Nash character put it in the Beautiful Mind movie?)

So here’s Pinker’s answer.  Yes, both people know why they’re going to the apartment, but they also want to avoid their knowledge becoming common knowledge.  They want plausible deniability.  There are several possible reasons: to preserve the romantic fantasy of being “swept off one’s feet.”  To provide a face-saving way to back out later, should one of them change their mind: since nothing was ever openly said, there’s no agreement to abrogate.  In fact, even if only one of the people (say A) might care about such things, if the other person (say B) thinks there’s any chance A cares, B will also have an interest in avoiding common knowledge, for A’s sake.

Put differently, the issue is that, as soon as you say X out loud, the other person doesn’t merely learn X: they learn that you know X, that you know that they know that you know X, that you want them to know you know X, and an infinity of other things that might upset the delicate epistemic balance.  Contrast that with the situation where X is left unstated: yeah, both people are pretty sure that “etchings” are just a pretext, and can even plausibly guess that the other person knows they’re pretty sure about it.  But once you start getting to 3, 4, 5, levels of indirection—who knows?  Maybe you do just want to show me some etchings.

Philosophers like to discuss Sherlock Holmes and Professor Moriarty meeting in a train station, and Moriarty declaring, “I knew you’d be here,” and Holmes replying, “well, I knew that you knew I’d be here,” and Moriarty saying, “I knew you knew I knew I’d be here,” etc.  But real humans tend to be unable to reason reliably past three or four levels in the knowledge hierarchy.  (Related to that, you might have heard of the game where everyone guesses a number between 0 and 100, and the winner is whoever’s number is the closest to 2/3 of the average of all the numbers.  If this game is played by perfectly rational people, who know they’re all perfectly rational, and know they know, etc., then they must all guess 0—exercise for you to see why.  Yet experiments show that, if you actually want to win this game against average people, you should guess about 20.  People seem to start with 50 or so, iterate the operation of multiplying by 2/3 a few times, and then stop.)

Incidentally, do you know what I would’ve given for someone to have explained this stuff to me back in high school?  I think that a large fraction of the infamous social difficulties that nerds have, is simply down to nerds spending so much time in domains (like math and science) where the point is to struggle with every last neuron to make everything common knowledge, to make all truths as clear and explicit as possible.  Whereas in social contexts, very often you’re managing a delicate epistemic balance where you need certain things to be known, but not known to be known, and so forth—where you need to prevent common knowledge from arising, at least temporarily.  “Normal” people have an intuitive feel for this; it doesn’t need to be explained to them.  For nerds, by contrast, explaining it—in terms of the muddy children puzzle and so forth—might be exactly what’s needed.  Once they’re told the rules of a game, nerds can try playing it too!  They might even turn out to be good at it.

OK, now for a darker example of common knowledge in action.  If you read accounts of Nazi Germany, or the USSR, or North Korea or other despotic regimes today, you can easily be overwhelmed by this sense of, “so why didn’t all the sane people just rise up and overthrow the totalitarian monsters?  Surely there were more sane people than crazy, evil ones.  And probably the sane people even knew, from experience, that many of their neighbors were sane—so why this cowardice?”  Once again, it could be argued that common knowledge is the key.  Even if everyone knows the emperor is naked; indeed, even if everyone knows everyone knows he’s naked, still, if it’s not common knowledge, then anyone who says the emperor’s naked is knowingly assuming a massive personal risk.  That’s why, in the story, it took a child to shift the equilibrium.  Likewise, even if you know that 90% of the populace will join your democratic revolt provided they themselves know 90% will join it, if you can’t make your revolt’s popularity common knowledge, everyone will be stuck second-guessing each other, worried that if they revolt they’ll be an easily-crushed minority.  And because of that very worry, they’ll be correct!

(My favorite Soviet joke involves a man standing in the Moscow train station, handing out leaflets to everyone who passes by.  Eventually, of course, the KGB arrests him—but they discover to their surprise that the leaflets are just blank pieces of paper.  “What’s the meaning of this?” they demand.  “What is there to write?” replies the man.  “It’s so obvious!”  Note that this is precisely a situation where the man is trying to make common knowledge something he assumes his “readers” already know.)

The kicker is that, to prevent something from becoming common knowledge, all you need to do is censor the common-knowledge-producing mechanisms: the press, the Internet, public meetings.  This nicely explains why despots throughout history have been so obsessed with controlling the press, and also explains how it’s possible for 10% of a population to murder and enslave the other 90% (as has happened again and again in our species’ sorry history), even though the 90% could easily overwhelm the 10% by acting in concert.  Finally, it explains why believers in the Enlightenment project tend to be such fanatical absolutists about free speech—why they refuse to “balance” it against cultural sensitivity or social harmony or any other value, as so many well-meaning people urge these days.

OK, but let me try to tell you something surprising about common knowledge.  Here at SPARC, you’ve learned all about Bayes’ rule—how, if you like, you can treat “probabilities” as just made-up numbers in your head, which are required obey the probability calculus, and then there’s a very definite rule for how to update those numbers when you gain new information.  And indeed, how an agent that wanders around constantly updating these numbers in its head, and taking whichever action maximizes its expected utility (as calculated using the numbers), is probably the leading modern conception of what it means to be “rational.”

Now imagine that you’ve got two agents, call them Alice and Bob, with common knowledge of each other’s honesty and rationality, and with the same prior probability distribution over some set of possible states of the world.  But now imagine they go out and live their lives, and have totally different experiences that lead to their learning different things, and having different posterior distributions.  But then they meet again, and they realize that their opinions about some topic (say, Hillary’s chances of winning the election) are common knowledge: they both know each other’s opinion, and they both know that they both know, and so on.  Then a striking 1976 result called Aumann’s Theorem states that their opinions must be equal.  Or, as it’s summarized: “rational agents with common priors can never agree to disagree about anything.”

Actually, before going further, let’s prove Aumann’s Theorem—since it’s one of those things that sounds like a mistake when you first hear it, and then becomes a triviality once you see the 3-line proof.  (Albeit, a “triviality” that won Aumann a Nobel in economics.)  The key idea is that knowledge induces a partition on the set of possible states of the world.  Huh?  OK, imagine someone is either an old man, an old woman, a young man, or a young woman.  You and I agree in giving each of these a 25% prior probability.  Now imagine that you find out whether they’re a man or a woman, and I find out whether they’re young or old.  This can be illustrated as follows:


The diagram tells us, for example, that if the ground truth is “old woman,” then your knowledge is described by the set {old woman, young woman}, while my knowledge is described by the set {old woman, old man}.  And this different information leads us to different beliefs: for example, if someone asks for the probability that the person is a woman, you’ll say 100% but I’ll say 50%.  OK, but what does it mean for information to be common knowledge?  It means that I know that you know that I know that you know, and so on.  Which means that, if you want to find out what’s common knowledge between us, you need to take the least common coarsening of our knowledge partitions.  I.e., if the ground truth is some given world w, then what do I consider it possible that you consider it possible that I consider possible that … etc.?  Iterate this growth process until it stops, by “zigzagging” between our knowledge partitions, and you get the set S of worlds such that, if we’re in world w, then what’s common knowledge between us is that the world belongs to S.  Repeat for all w’s, and you get the least common coarsening of our partitions.  In the above example, the least common coarsening is trivial, with all four worlds ending up in the same set S, but there are nontrivial examples as well:


Now, if Alice’s expectation of a random variable X is common knowledge between her and Bob, that means that everywhere in S, her expectation must be constant … and hence must equal whatever the expectation is, over all the worlds in S!  Likewise, if Bob’s expectation is common knowledge with Alice, then everywhere in S, it must equal the expectation of X over S.  But that means that Alice’s and Bob’s expectations are the same.

There are lots of related results.  For example, rational agents with common priors, and common knowledge of each other’s rationality, should never engage in speculative trade (e.g., buying and selling stocks, assuming that they don’t need cash, they’re not earning a commission, etc.).  Why?  Basically because, if I try to sell you a stock for (say) $50, then you should reason that the very fact that I’m offering it means I must have information you don’t that it’s worth less than $50, so then you update accordingly and you don’t want it either.

Or here’s another one: suppose again that we’re Bayesians with common priors, and we’re having a conversation, where I tell you my opinion (say, of the probability Hillary will win the election).  Not any of the reasons or evidence on which the opinion is based—just the opinion itself.  Then you, being Bayesian, update your probabilities to account for what my opinion is.  Then you tell me your opinion (which might have changed after learning mine), I update on that, I tell you my new opinion, then you tell me your new opinion, and so on.  You might think this could go on forever!  But, no, Geanakoplos and Polemarchakis observed that, as long as there are only finitely many possible states of the world in our shared prior, this process must converge after finitely many steps with you and me having the same opinion (and moreover, with it being common knowledge that we have that opinion).  Why?  Because as long as our opinions differ, your telling me your opinion or me telling you mine must induce a nontrivial refinement of one of our knowledge partitions, like so:


I.e., if you learn something new, then at least one of your knowledge sets must get split along the different possible values of the thing you learned.  But since there are only finitely many underlying states, there can only be finitely many such splittings (note that, since Bayesians never forget anything, knowledge sets that are split will never again rejoin).

And something else: suppose your friend tells you a liberal opinion, then you take it into account, but reply with a more conservative opinion.  The friend takes your opinion into account, and replies with a revised opinion.  Question: is your friend’s new opinion likelier to be more liberal than yours, or more conservative?

Obviously, more liberal!  Yes, maybe your friend now sees some of your points and vice versa, maybe you’ve now drawn a bit closer (ideally!), but you’re not going to suddenly switch sides because of one conversation.

Yet, if you and your friend are Bayesians with common priors, one can prove that that’s not what should happen at all.  Indeed, your expectation of your own future opinion should equal your current opinion, and your expectation of your friend’s next opinion should also equal your current opinion—meaning that you shouldn’t be able to predict in which direction your opinion will change next, nor in which direction your friend will next disagree with you.  Why not?  Formally, because all these expectations are just different ways of calculating an expectation over the same set, namely your current knowledge set (i.e., the set of states of the world that you currently consider possible)!  More intuitively, we could say: if you could predict that, all else equal, the next thing you heard would probably shift your opinion in a liberal direction, then as a Bayesian you should already shift your opinion in a liberal direction right now.  (This is related to what’s called the “martingale property”: sure, a random variable X could evolve in many ways in the future, but the average of all those ways must be its current expectation E[X], by the very definition of E[X]…)

So, putting all these results together, we get a clear picture of what rational disagreements should look like: they should follow unbiased random walks, until sooner or later they terminate in common knowledge of complete agreement.  We now face a bit of a puzzle, in that hardly any disagreements in the history of the world have ever looked like that.  So what gives?

There are a few ways out:

(1) You could say that the “failed prediction” of Aumann’s Theorem is no surprise, since virtually all human beings are irrational cretins, or liars (or at least, it’s not common knowledge that they aren’t). Except for you, of course: you’re perfectly rational and honest.  And if you ever met anyone else as rational and honest as you, maybe you and they could have an Aumannian conversation.  But since such a person probably doesn’t exist, you’re totally justified to stand your ground, discount all opinions that differ from yours, etc.  Notice that, even if you genuinely believed that was all there was to it, Aumann’s Theorem would still have an aspirational significance for you: you would still have to say this is the ideal that all rationalists should strive toward when they disagree.  And that would already conflict with a lot of standard rationalist wisdom.  For example, we all know that arguments from authority carry little weight: what should sway you is not the mere fact of some other person stating their opinion, but the actual arguments and evidence that they’re able to bring.  Except that as we’ve seen, for Bayesians with common priors this isn’t true at all!  Instead, merely hearing your friend’s opinion serves as a powerful summary of what your friend knows.  And if you learn that your rational friend disagrees with you, then even without knowing why, you should take that as seriously as if you discovered a contradiction in your own thought processes.  This is related to an even broader point: there’s a normative rule of rationality that you should judge ideas only on their merits—yet if you’re a Bayesian, of course you’re going to take into account where the ideas come from, and how many other people hold them!  Likewise, if you’re a Bayesian police officer or a Bayesian airport screener or a Bayesian job interviewer, of course you’re going to profile people by their superficial characteristics, however unfair that might be to individuals—so all those studies proving that people evaluate the same resume differently if you change the name at the top are no great surprise.  It seems to me that the tension between these two different views of rationality, the normative and the Bayesian, generates a lot of the most intractable debates of the modern world.

(2) Or—and this is an obvious one—you could reject the assumption of common priors. After all, isn’t a major selling point of Bayesianism supposed to be its subjective aspect, the fact that you pick “whichever prior feels right for you,” and are constrained only in how to update that prior?  If Alice’s and Bob’s priors can be different, then all the reasoning I went through earlier collapses.  So rejecting common priors might seem appealing.  But there’s a paper by Tyler Cowen and Robin Hanson called “Are Disagreements Honest?”—one of the most worldview-destabilizing papers I’ve ever read—that calls that strategy into question.  What it says, basically, is this: if you’re really a thoroughgoing Bayesian rationalist, then your prior ought to allow for the possibility that you are the other person.  Or to put it another way: “you being born as you,” rather than as someone else, should be treated as just one more contingent fact that you observe and then conditionalize on!  And likewise, the other person should condition on the observation that they’re them and not you.  In this way, absolutely everything that makes you different from someone else can be understood as “differing information,” so we’re right back to the situation covered by Aumann’s Theorem.  Imagine, if you like, that we all started out behind some Rawlsian veil of ignorance, as pure reasoning minds that had yet to be assigned specific bodies.  In that original state, there was nothing to differentiate any of us from any other—anything that did would just be information to condition on—so we all should’ve had the same prior.  That might sound fanciful, but in some sense all it’s saying is: what licenses you to privilege an observation just because it’s your eyes that made it, or a thought just because it happened to occur in your head?  Like, if you’re objectively smarter or more observant than everyone else around you, fine, but to whatever extent you agree that you aren’t, your opinion gets no special epistemic protection just because it’s yours.

(3) If you’re uncomfortable with this tendency of Bayesian reasoning to refuse to be confined anywhere, to want to expand to cosmic or metaphysical scope (“I need to condition on having been born as me and not someone else”)—well then, you could reject the entire framework of Bayesianism, as your notion of rationality. Lest I be cast out from this camp as a heretic, I hasten to say: I include this option only for the sake of completeness!

(4) When I first learned about this stuff 12 years ago, it seemed obvious to me that a lot of it could be dismissed as irrelevant to the real world for reasons of complexity. I.e., sure, it might apply to ideal reasoners with unlimited time and computational power, but as soon as you impose realistic constraints, this whole Aumannian house of cards should collapse.  As an example, if Alice and Bob have common priors, then sure they’ll agree about everything if they effectively share all their information with each other!  But in practice, we don’t have time to “mind-meld,” swapping our entire life experiences with anyone we meet.  So one could conjecture that agreement, in general, requires a lot of communication.  So then I sat down and tried to prove that as a theorem.  And you know what I found?  That my intuition here wasn’t even close to correct!

In more detail, I proved the following theorem.  Suppose Alice and Bob are Bayesians with shared priors, and suppose they’re arguing about (say) the probability of some future event—or more generally, about any random variable X bounded in [0,1].  So, they have a conversation where Alice first announces her expectation of X, then Bob announces his new expectation, and so on.  The theorem says that Alice’s and Bob’s estimates of X will necessarily agree to within ±ε, with probability at least 1-δ over their shared prior, after they’ve exchanged only O(1/(δε2)) messages.  Note that this bound is completely independent of how much knowledge they have; it depends only on the accuracy with which they want to agree!  Furthermore, the same bound holds even if Alice and Bob only send a few discrete bits about their real-valued expectations with each message, rather than the expectations themselves.

The proof involves the idea that Alice and Bob’s estimates of X, call them XA and XB respectively, follow “unbiased random walks” (or more formally, are martingales).  Very roughly, if |XA-XB|≥ε with high probability over Alice and Bob’s shared prior, then that fact implies that the next message has a high probability (again, over the shared prior) of causing either XA or XB to jump up or down by about ε.  But XA and XB, being estimates of X, are bounded between 0 and 1.  So a random walk with a step size of ε can only continue for about 1/ε2 steps before it hits one of the “absorbing barriers.”

The way to formalize this is to look at the variances, Var[XA] and Var[XB], with respect to the shared prior.  Because Alice and Bob’s partitions keep getting refined, the variances are monotonically non-decreasing.  They start out 0 and can never exceed 1 (in fact they can never exceed 1/4, but let’s not worry about constants).  Now, the key lemma is that, if Pr[|XA-XB|≥ε]≥δ, then Var[XB] must increase by at least δε2 if Alice sends XA to Bob, and Var[XA] must increase by at least δε2 if Bob sends XB to Alice.  You can see my paper for the proof, or just work it out for yourself.  At any rate, the lemma implies that, after O(1/(δε2)) rounds of communication, there must be at least a temporary break in the disagreement; there must be some round where Alice and Bob approximately agree with high probability.

There are lots of other results in my paper, including an upper bound on the number of calls that Alice and Bob need to make to a “sampling oracle” to carry out this sort of protocol approximately, assuming they’re not perfect Bayesians but agents with bounded computational power.  But let me step back and address the broader question: what should we make of all this?  How should we live with the gargantuan chasm between the prediction of Bayesian rationality for how we should disagree, and the actual facts of how we do disagree?

We could simply declare that human beings are not well-modeled as Bayesians with common priors—that we’ve failed in giving a descriptive account of human behavior—and leave it at that.   OK, but that would still leave the question: does this stuff have normative value?  Should it affect how we behave, if we want to consider ourselves honest and rational?  I would argue, possibly yes.

Yes, you should constantly ask yourself the question: “would I still be defending this opinion, if I had been born as someone else?”  (Though you might say this insight predates Aumann by quite a bit, going back at least to Spinoza.)

Yes, if someone you respect as honest and rational disagrees with you, you should take it as seriously as if the disagreement were between two different aspects of yourself.

Finally, yes, we can try to judge epistemic communities by how closely they approach the Aumannian ideal.  In math and science, in my experience, it’s common to see two people furiously arguing with each other at a blackboard.  Come back five minutes later, and they’re arguing even more furiously, but now their positions have switched.  As we’ve seen, that’s precisely what the math says a rational conversation should look like.  In social and political discussions, though, usually the very best you’ll see is that two people start out diametrically opposed, but eventually one of them says “fine, I’ll grant you this,” and the other says “fine, I’ll grant you that.”  We might say, that’s certainly better than the common alternative, of the two people walking away even more polarized than before!  Yet the math tells us that even the first case—even the two people gradually getting closer in their views—is nothing at all like a rational exchange, which would involve the two participants repeatedly leapfrogging each other, completely changing their opinion about the question under discussion (and then changing back, and back again) every time they learned something new.  The first case, you might say, is more like haggling—more like “I’ll grant you that X is true if you grant me that Y is true”—than like our ideal friendly mathematicians arguing at the blackboard, whose acceptance of new truths is never slow or grudging, never conditional on the other person first agreeing with them about something else.

Armed with this understanding, we could try to rank fields by how hard it is to have an Aumannian conversation in them.  At the bottom—the easiest!—is math (or, let’s say, chess, or debugging a program, or fact-heavy fields like lexicography or geography).  Crucially, here I only mean the parts of these subjects with agreed-on rules and definite answers: once the conversation turns to whose theorems are deeper, or whose fault the bug was, things can get arbitrarily non-Aumannian.  Then there’s the type of science that involves messy correlational studies (I just mean, talking about what’s a risk factor for what, not the political implications).  Then there’s politics and aesthetics, with the most radioactive topics like Israel/Palestine higher up.  And then, at the very peak, there’s gender and social justice debates, where everyone brings their formative experiences along, and absolutely no one is a disinterested truth-seeker, and possibly no Aumannian conversation has ever been had in the history of the world.

I would urge that even at the very top, it’s still incumbent on all of us to try to make the Aumannian move, of “what would I think about this issue if I were someone else and not me?  If I were a man, a woman, black, white, gay, straight, a nerd, a jock?  How much of my thinking about this represents pure Spinozist reason, which could be ported to any rational mind, and how much of it would get lost in translation?”

Anyway, I’m sure some people would argue that, in the end, the whole framework of Bayesian agents, common priors, common knowledge, etc. can be chucked from this discussion like so much scaffolding, and the moral lessons I want to draw boil down to trite advice (“try to see the other person’s point of view”) that you all knew already.  Then again, even if you all knew all this, maybe you didn’t know that you all knew it!  So I hope you gained some new information from this talk in any case.  Thanks.

Update: Coincidentally, there’s a moving NYT piece by Oliver Sacks, which (among other things) recounts his experiences with his cousin, the Aumann of Aumann’s theorem.

Another Update: If I ever did attempt an Aumannian conversation with someone, the other Scott A. would be a candidate! Here he is in 2011 making several of the same points I did above, using the same examples (I thank him for pointing me to his post).

Introducing some British people to P vs. NP

July 22nd, 2015

Here’s a 5-minute interview that I did with The Naked Scientists (a radio show syndicated by the BBC, and no, I’m not sure why it’s called that), explaining the P vs. NP problem.  For readers of this blog, there won’t be anything new here, but, well … you might enjoy the rest of the hour-long programme [sic], which also includes segments about a few other Clay Millennium problems (the Riemann Hypothesis, Navier-Stokes, and Poincaré), as well as a segment about fat fish that live in caves and gorge themselves on food on the rare occasions when it becomes available, and which turn out to share a gene variant with humans with similar tendencies.

Quantum query complexity: the other shoe drops

June 30th, 2015

Two weeks ago I blogged about a breakthrough in query complexity: namely, the refutation by Ambainis et al. of a whole slew of conjectures that had stood for decades (and that I mostly believed, and that had helped draw me into theoretical computer science as a teenager) about the largest possible gaps between various complexity measures for total Boolean functions. Specifically, Ambainis et al. built on a recent example of Göös, Pitassi, and Watson to construct bizarre Boolean functions f with, among other things, near-quadratic gaps between D(f) and R0(f) (where D is deterministic query complexity and R0 is zero-error randomized query complexity), near-1.5th-power gaps between R0(f) and R(f) (where R is bounded-error randomized query complexity), and near-4th-power gaps between D(f) and Q(f) (where Q is bounded-error quantum query complexity). See my previous post for more about the definitions of these concepts and the significance of the results (and note also that Mukhopadhyay and Sanyal independently obtained weaker results).

Because my mental world was in such upheaval, in that earlier post I took pains to point out one thing that Ambainis et al. hadn’t done: namely, they still hadn’t shown any super-quadratic separation between R(f) and Q(f), for any total Boolean function f. (Recall that a total Boolean function, f:{0,1}n→{0,1}, is one that’s defined for all 2n possible input strings x∈{0,1}n. Meanwhile, a partial Boolean function is one where there’s some promise on x: for example, that x encodes a periodic sequence. When you phrase them in the query complexity model, Shor’s algorithm and other quantum algorithms achieving exponential speedups work only for partial functions, not for total ones. Indeed, a famous result of Beals et al. from 1998 says that D(f)=O(Q(f)6) for all total functions f.)

So, clinging to a slender reed of sanity, I said it “remains at least a plausible conjecture” that, if you insist on a fair comparison—i.e., bounded-error quantum versus bounded-error randomized—then the biggest speedup quantum algorithms can ever give you over classical ones, for total Boolean functions, is the square-root speedup that Grover’s algorithm easily achieves for the n-bit OR function.

Today, I can proudly report that my PhD student, Shalev Ben-David, has refuted that conjecture as well.  Building on the Göös et al. and Ambainis et al. work, but adding a new twist to it, Shalev has constructed a total Boolean function f such that R(f) grows roughly like Q(f)2.5 (yes, that’s Q(f) to the 2.5th power). Furthermore, if a conjecture that Ambainis and I made in our recent “Forrelation” paper is correct—namely, that a problem called “k-fold Forrelation” has randomized query complexity roughly Ω(n1-1/k)—then one would get nearly a cubic gap between R(f) and Q(f).

The reason I found this question so interesting is that it seemed obvious to me that, to produce a super-quadratic separation between R and Q, one would need a fundamentally new kind of quantum algorithm: one that was unlike Simon’s and Shor’s algorithms in that it worked for total functions, but also unlike Grover’s algorithm in that it didn’t hit some impassable barrier at the square root of the classical running time.

Flummoxing my expectations once again, Shalev produced the super-quadratic separation, but not by designing any new quantum algorithm. Instead, he cleverly engineered a Boolean function for which you can use a combination of Grover’s algorithm and the Forrelation algorithm (or any other quantum algorithm that gives a huge speedup for some partial Boolean function—Forrelation is just the maximal example), to get an overall speedup that’s a little more than quadratic, while still keeping your Boolean function total. I’ll let you read Shalev’s short paper for the details, but briefly, it once again uses the Göös et al. / Ambainis et al. trick of defining a Boolean function that equals 1 if and only if the input string contains some hidden substructure, and the hidden substructure also contains a pointer to a “certificate” that lets you quickly verify that the hidden substructure was indeed there. You can use a super-fast algorithm—let’s say, a quantum algorithm designed for partial functions—to find the hidden substructure assuming it’s there. If you don’t find it, you can simply output 0. But if you do find it (or think you found it), then you can use the certificate, together with Grover’s algorithm, to confirm that you weren’t somehow misled, and that the substructure really was there. This checking step ensures that the function remains total.

Are there further separations to be found this way? Almost certainly! Indeed, Shalev, Robin Kothari, and I have already found some more things (as well as different/simpler proofs of known separations), though nothing quite as exciting as the above.

Update (July 1): Ronald de Wolf points out in the comments that this “trust-but-verify” trick, for designing total Boolean functions with unexpectedly low quantum query complexities, was also used in a recent paper by himself and Ambainis (while Ashley Montanaro points out that a similar trick was used even earlier, in a different context, by Le Gall).  What’s surprising, you might say, is that it took as long as it did for people to realize how many applications this trick has.

Update (July 2): In conversation with Robin Kothari and Cedric Lin, I realized that Shalev’s superquadratic separation between R and Q, combined with a recent result of Lin and Lin, resolves another open problem that had bothered me since 2001 or so. Given a Boolean function f, define the “projective quantum query complexity,” or P(f), to be the minimum number of queries made by a bounded-error quantum algorithm, in which the answer register gets immediately measured after each query. This is a model of quantum algorithms that’s powerful enough to capture (for example) Simon’s and Shor’s algorithms, but not Grover’s algorithm. Indeed, one might wonder whether there’s any total Boolean function for which P(f) is asymptotically smaller than R(f)—that’s the question I wondered about around 2001, and that I discussed with Elham Kashefi. Now, by using an argument based on the “Vaidman bomb,” Lin and Lin recently proved the fascinating result that P(f)=O(Q(f)2) for all functions f, partial or total. But, combining with Shalev’s result that there exists a total f for which R(f)=Ω(Q(f)2.5), we get that there’s a total f for which R(f)=Ω(P(f)1.25). In the other direction, the best I know is that P(f)=Ω(bs(f)) and therefore R(f)=O(P(f)3).

Celebrate gay marriage—and its 2065 equivalent

June 27th, 2015

Yesterday was a historic day for the United States, and I was as delighted as everyone else I know.  I’ve supported gay marriage since the mid-1990s, when as a teenager, I read Andrew Hodges’ classic biography of Alan Turing, and burned with white-hot rage at Turing’s treatment.  In the world he was born into—our world, until fairly recently—Turing was “free”: free to prove the unsolvability of the halting problem, free to help save civilization from the Nazis, just not free to pursue the sexual and romantic fulfillment that nearly everyone else took for granted.  I resolved then that, if I was against anything in life, I was against the worldview that had hounded Turing to his death, or anything that even vaguely resembled it.

So I’m proud for my country, and I’m thrilled for my gay friends and colleagues and relatives.  At the same time, seeing my Facebook page light up with an endless sea of rainbow flags and jeers at Antonin Scalia, there’s something that gnaws at me.  To stand up for Alan Turing in 1952 would’ve taken genuine courage.  To support gay rights in the 60s, 70s, 80s, even the 90s, took courage.  But celebrating a social change when you know all your friends will upvote you, more than a decade after the tide of history has made the change unstoppable?  It’s fun, it’s righteous, it’s justified, I’m doing it myself.  But let’s not kid ourselves by calling it courageous.

Do you want to impress me with your moral backbone?  Then go and find a group that almost all of your Facebook friends still consider it okay, even praiseworthy, to despise and mock, for moral failings that either aren’t failings at all or are no worse than the rest of humanity’s.  (I promise: once you start looking, it shouldn’t be hard to find.)  Then take a public stand for that group.

“Can Quantum Computing Reveal the True Meaning of Quantum Mechanics?”

June 25th, 2015

I now have a 3500-word post on that question up at NOVA’s “Nature of Reality” blog.  If you’ve been reading Shtetl-Optimized religiously for the past decade (why?), there won’t be much new to you there, but if not, well, I hope you like it!  Comments are welcome, either here or there.  Thanks so much to Kate Becker at NOVA for commissioning this piece, and for her help editing it.

FCRC Highlights

June 19th, 2015

By popular request, here are some highlights from this week’s FCRC conference in Portland, Oregon:

  • The edit distance between two strings means the minimum number of insertions, deletions, and replacements needed to convert one string to the other: for example, SHTETL and SHETLAND have an edit distance of 4.  Edit distance has major, obvious applications to DNA sequence comparison, as well as plagiarism detection and many other things.  There’s a clever dynamic programming algorithm to compute the edit distance between two n-bit strings, but it takes ~n2 time, which is already too slow for many applications.  Can you do better?  I remember wondering about that 15 years ago, as a beginning grad student taking Richard Karp’s computational biology course.  Now Arturs Backurs and Piotr Indyk have shown that, if you can compute edit distance in O(n2-ε) time for any ε>0, then you can also solve CNF-SAT in 2cn time for some c<1, thereby refuting the Strong Exponential Time Hypothesis.  For more about this important result, see this MIT News article.
  • Olivier Temam gave a superb keynote talk about hardware neural networks.  His main points were these: implementing neural nets with special-purpose hardware was a faddish idea a few decades ago, but was abandoned once people realized that (a) it didn’t work that great, and (b) more to the point, anything you could do with special-purpose hardware, you could do better and more easily with silicon chips, after waiting just a few years for Moore’s Law to catch up.  Today, however, two things have spurred a revival of the idea: firstly, neural nets (renamed “deep learning,” and done with bigger networks and way more training data) are delivering spectacular, state-of-the-art results; and second, transistors have stopped shrinking, so it now makes more sense to think about the few orders-of-magnitude speed improvement that you can get from special-purpose hardware.  This would mean organizing computers kind-of, sort-of like the brain is organized, with (for example) memory integrated into the connections between the “neurons” (processing elements), rather than on a separate chip that’s connected to the processor by a bus.  On the other hand, Temam also stressed that computer architects shouldn’t slavishly copy the brain: instead, they should simply build the fastest hardware they can to implement the best available machine-learning algorithms, and they should rely on the machine-learning theorists to incorporate whatever broad lessons are to be gleaned from neuroscience (as they’ve done several times in the past).
  • Three separate sets of authors (Koppula, Lewko, and Waters; Canetti, Holmgren, Jain, and Vaikuntanathan; and Bitansky, Garg, Lin, Pass, and Telang) independently wrote papers that showed how to achieve “indistinguishability obfuscation” (i.o.) for Turing machines rather than for circuits.  For those not in the world of theoretical crypto, i.o. is a hot concept that basically means: obfuscating a program in such a way that no adversary can figure out anything about which program you started with, among all the possible programs that compute the same function in roughly the same amount of time.  (On the other hand, the adversary might be able to learn more than she could if merely given a black box for the function.  And that’s why this kind of obfuscation falls short of the “gold standard,” which was shown to be impossible in general in seminal work by Barak et al.)  Recent papers have shown how to achieve the weaker notion of i.o., but they first require converting your program to a Boolean circuit—something that’s absurdly inefficient in practice, and also has the theoretical drawback of producing an obfuscated program whose size grows, not merely with the size of the original, unobfuscated program, but also with the amount of time the original program is supposed to run for. So, the new work gets around that drawback, by cleverly obfuscating a program whose purpose is to compute the “next step function” of the original program, on data that’s itself encrypted. The talk was delivered in “tag team” format, with one representative from each group of authors speaking for 6-7 minutes. Surprisingly, it worked extremely well.
  • Laci Babai gave a truly epic hour-long Knuth Prize lecture, basically trying to summarize all of his work over the past 35 years (and related work by others), in 170 or so slides.  The talk had not a single word of filler: it was just pure beef, result after result, some of them well-known and seminal (e.g., MIP=NEXP, AM[2]=AM[k], AlmostNP=MA, group membership in NP, group non-membership in AM…) and others obscure little gems.  Boaz Barak commented that an entire semester-long course could be taught from the PowerPoint slides. Laci ended the talk by defining the Babai point, and then saying “having made my point, I’m done.”
  • Ambainis (yes, the same Ambainis), Filmus and Le Gall had a paper about the limitations of the techniques used to achieve all matrix multiplication algorithms from Coppersmith-Winograd (O(n2.3755)) onward, including those of Stothers 2010 (O(n2.3730)), Vassilevska Williams 2012 (O(n2.3728642)), and Le Gall 2014 (O(n2.3728639)).  Their basic conclusion—not surprising, but still nice to establish formally—is that applying more and more massive computer search to the current ideas can’t possibly get you below O(n2.308); new ideas will be needed to push further.
  • At the STOC business meeting, there was a long discussion about the proposal to turn STOC into a weeklong “theory festival,” with more plenary talks (including from other fields), possibly more parallel sessions, etc. There were lots of interesting arguments, but alas, I was too tired and jetlagged to remember what they were. (Anyone who does remember is welcome to chime in.)

There are many great things that I haven’t written about—for example, I haven’t even said a word about any of the three best paper talks!—but I’m out of energy right now.  Others are more than welcome to share other FCRC highlights in the comments section.

A query complexity breakthrough

June 17th, 2015

Update (June 26): See this just-released paper, which independently obtains a couple of the same results as the Ambainis et al. paper, but in a different way (using the original Göös et al. function, rather than modifications of it).

Lots of people have accused me of overusing the word “breakthrough” on this blog. So I ask them: what word should I use when a paper comes out that solves not one, not two, but three of the open problems I’ve cared about most for literally half of my life, since I was 17 years old?

Yesterday morning, Andris Ambainis, Kaspars Balodis, Aleksandrs Belovs, Troy Lee, Miklos Santha, and Juris Smotrovs posted a preprint to ECCC in which they give:

(1) A total Boolean function f with roughly a fourth-power separation between its deterministic and bounded-error quantum query complexities (i.e., with D(f)~Q(f)4). This refutes the conjecture, which people have been making since Beals et al.’s seminal work in 1998, that the biggest possible gap is quadratic.

(2) A total Boolean function f with a quadratic separation between its deterministic and randomized query complexities (with D(f)~R0(f)2). This refutes a conjecture of Saks and Wigderson from 1986, that the best possible gap is R0(f)~D(f)0.753 (from the recursive AND/OR tree), and shows that the known relation D(f)=O(R0(f)2) is close to tight.

(3) The first total Boolean function f with any asymptotic gap between its zero-error and bounded-error randomized query complexities (in particular, with R0(f)~R(f)3/2).

(There are also other new separations—for example, involving exact quantum query complexity and approximate degree as a real polynomial. But the above three are the most spectacular to me.)

In updates to this post (coming soon), I’ll try my best to explain to general readers what D(f), R(f), and so forth are (see here for the classic survey of these measures), and I’ll also discuss how Ambainis et al. designed the strange functions f that achieve the separations (though their paper already does a good job of explaining it). For now, I’ll just write the stuff that’s easier to write.

I’m at the Federated Computing Research Conference in Portland, Oregon right now, where yesterday I gave my STOC talk (click here for the PowerPoint slides) about the largest possible separations between R(f) and Q(f) for partial Boolean functions f. (That paper is also joint work with Andris Ambainis, who has his fingers in many pies, or his queries in many oracles, or something.) Anyway, when I did a practice run of my talk on Monday night, I commented that, of course, for total Boolean functions f (those not involving a promise), the largest known gap between R(f) and Q(f) is quadratic, and is achieved when f is the OR function because of Grover’s algorithm.

Then, Tuesday morning, an hour before I was to give my talk, I saw the Ambainis et al. bombshell, which made that comment obsolete. So, being notoriously bad at keeping my mouth shut, I mentioned to my audience that, while it was great that they came all the way to Portland to learn what was new in theoretical computer science, if they wanted real news in the subfield I was talking about, they could stop listening to me and check their laptops.

(Having said that, I have had a wonderful time at FCRC, and have learned lots of other interesting things—I can do another addendum to the post about FCRC highlights if people want me to.)

Anyway, within the tiny world of query complexity—i.e., the world where I cut my teeth and spent much of my career—the Ambainis et al. paper is sufficiently revolutionary that I feel the need to say what it doesn’t do.

First, the paper does not give a better-than-quadratic gap between R(f) and Q(f) (i.e., between bounded-error randomized and quantum query complexities). The quantum algorithms that compute their functions f are still “just” variants of the old standbys, Grover’s algorithm and amplitude amplification. What’s new is that the authors have found functions where you can get the quadratic, Grover speedup between R(f) and Q(f), while also getting asymptotic gaps between D(f) and R(f), and between R0(f) and R(f). So, putting it together, you get superquadratic gaps between D(f) and Q(f), and between R0(f) and Q(f). But it remains at least a plausible conjecture that R(f)=O(Q(f)2) for all total Boolean functions f—i.e., if you insist on a “fair comparison,” then the largest known quantum speedup for total Boolean functions remains the Grover one.

Second, as far as I can tell (I might be mistaken) (I’m not), the paper doesn’t give new separations involving certificate complexity or block sensitivity (e.g., between D(f) and bs(f)). So for example, it remains open whether D(f)=O(bs(f)2), and whether C(f)=O(bs(f)α) for some α<2. (Update: Avishay Tal, in the comments, informs me that the latter conjecture was falsified by Gilmer, Saks, and Srinivasan in 2013. Wow, I’m really out of it!)

In the end, achieving these separations didn’t require any sophisticated new mathematical machinery—just finding the right functions, something that could’ve been done back in 1998, had anyone been clever enough. So, where did these bizarre functions f come from? Ambainis et al. directly adapted them from a great recent communication complexity paper by Mika Göös, Toniann Pitassi, and Thomas Watson. But the Göös et al. paper itself could’ve been written much earlier. It’s yet another example of something I’ve seen again and again in this business, how there’s no substitute for just playing around with a bunch of examples.

The highest compliment one researcher can pay another is, “I wish I’d found that myself.” And I do, of course, but having missed it, I’m thrilled that at least I get to be alive for it and blog about it. Huge congratulations to the authors!

Addendum: What’s this about?

OK, so let’s say you have a Boolean function f:{0,1}n→{0,1}, mapping n input bits to 1 output bit. Some examples are the OR function, which outputs 1 if any of the n input bits are 1, and the MAJORITY function, which outputs 1 if the majority of them are.

Query complexity is the study of how many input bits you need to read in order to learn the value of the output bit. So for example, in evaluating the OR function, if you found a single input bit that was 1, you could stop right there: you’d know that the output was 1, without even needing to look at the remaining bits. In the worst case, however, if the input consisted of all 0s, you’d have to look at all of them before you could be totally sure the output was 0. So we say that the OR function has a deterministic query complexity of n.

In this game, we don’t care about any other resources used by an algorithm, like memory or running time: just how many bits of the input it looks at! There are many reasons why, but the simplest is that, unlike with memory or running time, for many functions we can actually figure out how many input bits need to be looked at, without needing to solve anything like P vs. NP. (But note that this can already be nontrivial! For algorithms can often cleverly avoid looking at all the bits, for example by looking at some and then deciding which ones to look at next based on which values they see.)

In general, given a deterministic algorithm A and an n-bit input string x, let DA,x (an integer from 0 to n) be the number of bits of x that A examines when you run it. Then let DA be the maximum of DA,x over all n-bit strings x. Then D(f), or the deterministic query complexity of f, is the minimum of DA, over all algorithms A that correctly evaluate f(x) on every input x.

For example, D(OR) and D(MAJORITY) are both n: in the worst case, you need to read everything. For a more interesting example, consider the 3-bit Boolean function

f(x,y,z) = (not(x) and y) or (x and z).

This function has D(f)=2, even though it depends on all 3 of the input bits. (Do you see why?) In general, even if f depends on n input bits, D(f) could be as small as log2n.

The bounded-error randomized query complexity, or R(f), is like D(f), except that now we allow the algorithm to make random choices of which input bit to query, and for each input x, the algorithm only needs to compute f(x) with probability 2/3. (Here the choice of 2/3 is arbitrary; if you wanted the right answer with some larger constant probability, say 99.9%, you could just repeat the algorithm a constant number of times and take a majority vote.) The zero-error randomized query complexity, or R0(f), is the variant where the algorithm is allowed to make random choices, but at the end of the day, needs to output the correct f(x) with probability 1.

To illustrate these concepts, consider the three-bit majority function, MAJ(x,y,z). We have D(MAJ)=3, since if a deterministic algorithm queried one bit and got a 0 and queried a second bit and got a 1 (as can happen), it would have no choice but to query the third bit. But for any possible setting of x, y, and z, if we choose which bits to query randomly, there’s at least a 1/3 chance that the first two queries will return either two 0s or two 1s—at which point we can stop, with no need to query the third bit. Hence R0(MAJ)≤(1/3)2+(2/3)3=8/3 (in fact it equals 8/3, although we haven’t quite shown that). Meanwhile, R(MAJ), as we defined it, is only 1, since if you just need a 2/3 probability of being correct, you can simply pick x, y, or z at random and output it!

The bounded-error quantum query complexity, or Q(f), is the minimum number of queries made by a quantum algorithm for f, which, again, has to output the right answer with probability at least 2/3 for every input x. Here a quantum algorithm makes a “query” by feeding a superposition of basis states, each of the form |i,a,w〉, to a “black box,” which maps each basis state to |i, a XOR xi, w〉, where i is the index of the input bit xi to be queried, a is a 1-qubit “answer register” into which xi is reversibly written, and w is a “workspace” that doesn’t participate in the query. In between two queries, the algorithm can apply any unitary transformation it wants to the superposition of |i,a,w〉’s, as long as it doesn’t depend on x. Finally, some designated qubit is measured to decide whether the algorithm accepts or rejects.

As an example, consider the 2-bit XOR function, XOR(x,y). We have D(XOR)=R0(XOR)=R(XOR)=2, since until you’ve queried both bits, you’ve learned nothing about their XOR. By contrast, Q(XOR)=1, because of the famous Deutsch-Jozsa algorithm.

It’s clear that

0 ≤ Q(f) ≤ R(f) ≤ R0(f) ≤ D(f) ≤ n,

since a quantum algorithm can simulate a randomized one and a randomized one can simulate a deterministic one.

A central question for the field, since these measures were studied in the 1980s or so, has been how far apart these measures can get from each other. If you allow partial Boolean functions—meaning that only some n-bit strings, not all of them, are “valid inputs” for which the algorithm needs to return a definite answer—then it’s easy to get enormous separations between any two of the measures (indeed, even bigger than exponential), as for example in my recent paper with Andris.

For total functions, by contrast, it’s been known for a long time that these measures can differ by at most polynomial factors:

D(f) = O(R(f)3) (Nisan)

D(f) = O(R0(f)2) (folklore, I think)

R0(f) = O(R2 log(n)) (Midrijanis)

D(f) = O(Q(f)6) (Beals et al. 1998)

OK, so what were the largest known gaps? For D versus R0 (as well as D versus R), the largest known gap since 1986 has come from the “recursive AND/OR tree”: that is, an OR of two ANDs of two ORs of two ANDs of … forming a complete binary tree of depth d, with the n=2d input variables comprising the leaves. For this function, we have D(f)=n, whereas Saks and Wigderson showed that R0(f)=Θ(n0.753) (and later, Santha showed that R(f)=Θ(n0.753) as well).

For D versus Q, the largest gap has been for the OR function: we have D(OR)=n (as mentioned earlier), but Q(OR)=Θ(√n) because of Grover’s algorithm. Finally, for R0 versus R, no asymptotic gap has been known for any total function. (This is a problem that I clearly remember working on back in 2000, when I was an undergrad. I even wrote a computer program, the Boolean Function Wizard, partly to search for separations between R0 versus R. Alas, while I did find one or two functions with separations, I was unable to conclude anything from them about asymptotics.)

So, how did Ambainis et al. achieve bigger gaps for each of these? I’ll try to have an explanation written by the time my flight from Portland to Boston has landed tonight. But if you can’t wait for that, or you prefer it straight from the horse’s mouth, read their paper!

Addendum 2: The Actual Functions

As I mentioned before, the starting point for everything Ambainis et al. do is a certain Boolean function g recently constructed by Göös, Pitassi, and Watson (henceforth GPW), for different purposes than the ones that concern Ambainis et al. We think of the inputs to g as divided into nm “cells,” which are arranged in a rectangular grid with m columns and n rows. Each cell contains a bit that’s either 0 or 1 (its “label), as well as a pointer to another cell (consisting of ~log2(nm) bits). The pointer can also be “null” (i.e., can point nowhere). We’ll imagine that a query of a cell gives you everything: the label and all the bits of the pointer. This could increase the query complexity of an algorithm, but only by a log(n) factor, which we won’t worry about.

Let X be a setting of all the labels and pointers in the grid. Then the question we ask about X is the following:

    Does there exist a “marked column”: that is, a column where all n of the labels are 1, and which has exactly one non-null pointer, which begins a chain of pointers of length m-1, which visits exactly one “0” cell in each column other than the marked column, and then terminates at a null pointer?

If such a marked column exists, then we set g(X)=1; otherwise we set g(X)=0. Crucially, notice that if a marked column exists, then it’s unique, since the chain of pointers “zeroes out” all m-1 of the other columns, and prevents them from being marked.

This g already leads to a new query complexity separation, one that refutes a strengthened form of the Saks-Wigderson conjecture. For it’s not hard to see that D(g)=Ω(mn): indeed, any deterministic algorithm must query almost all of the cells. A variant of this is proved in the paper, but the basic idea is that an adversary can answer all queries with giant fields of ‘1’ labels and null pointers—until a given column is almost completed, at which point the adversary fills in the last cell with a ‘0’ label and a pointer to the last ‘0’ cell that it filled in. The algorithm just can’t catch a break; it will need to fill in m-1 columns before it knows where the marked one is (if a marked column exists at all).

By contrast, it’s possible to show that, if n=m, then R(g) is about O(n4/3). I had an argument for R(g)=O((n+m)log(m)) in an earlier version of this post, but the argument was wrong; I thank Alexander Belov for catching the error. I’ll post the R(g)=O(n4/3) argument once I understand it.

To get the other separations—for example, total Boolean functions for which D~R02, D~Q4, R0~Q3, R0~R3/2, and R~approxdeg4—Ambainis et al. need to add various “enhancements” to the basic GPW function g defined above. There are three enhancements, which can either be added individually or combined, depending on one’s needs.

1. Instead of just a single marked column, we can define g(X) to be 1 if and only if there are k marked columns, which point to each other in a cycle, and which also point to a trail of m-k ‘0’ cells, showing that none of the other columns contain all ‘1’ cells. This can help a bounded-error randomized algorithm—which can quickly find one of the all-1 columns using random sampling—while not much helping a zero-error randomized algorithm.

2. Instead of a linear chain of pointers showing that all the non-marked columns contain a ‘0’ cell, for g(X) to be 1 we can demand a complete binary tree of pointers, originating at a marked column and fanning out to all the unmarked columns in only log(m) layers. This can substantially help a quantum algorithm, which can’t follow a pointer trail any faster than a classical algorithm can; but which, given a complete binary tree, can “fan out” and run Grover’s algorithm on all the leaves in only the square root of the number of queries that would be needed classically. Meanwhile, however, putting the pointers in a tree doesn’t much help deterministic or randomized algorithms.

3. In addition to pointers “fanning out” from a marked column to all of the unmarked columns, we can demand that in every unmarked column, some ‘0’ cell contains a back-pointer, which leads back to a marked column. These back-pointers can help a randomized or quantum algorithm find a marked column faster, while not much helping a deterministic algorithm.

Unless I’m mistaken, the situation is this:

With no enhancements, you can get D~R2 and something like D~R03/2 (although I still don’t understand how you get the latter with no enhancements; the paper mentions it without proof Andris has kindly supplied a proof here).

With only the cycle enhancement, you can get R0~R3/2.

With only the binary tree enhancement, you can get R~approxdeg4.

With only the back-pointer enhancement, you can get D~R02.

With the cycle enhancement and the binary-tree enhancement, you can get R0~Q3.

With the back-pointer enhancement and the binary-tree enhancement, you can get D~Q4.

It’s an interesting question whether there are separations that require both the cycle enhancement and the back-pointer enhancement; Ambainis et al. don’t give any examples.

And here’s another interesting question not mentioned in the paper. Using the binary-tree enhancement, Ambainis et al. achieve a fourth-power separation between bounded-error randomized query complexity and approximate degree as a real polynomial—i.e., quadratically better than any separation that was known before. Their proof of this involves cleverly constructing a low-degree polynomial by summing a bunch of low-degree polynomials derived from quantum algorithms (one for each possible marked row). As a result, their final, summed polynomial does not itself correspond to a quantum algorithm, meaning that they don’t get a fourth-power separation between R and Q (which would’ve been even more spectacular than what they do get). On the other hand, purely from the existence of a function with R~approxdeg4, we can deduce that that function has either

(i) a super-quadratic gap between R and Q (refuting my conjecture that the Grover speedup is the best possible quantum speedup for total Boolean functions), or
(ii) a quadratic gap between quantum query complexity and approximate degree—substantially improving over the gap found by Ambainis in 2003.

I conjecture that the truth is (ii); it would be great to have a proof or disproof of this.

97% environmentalist

June 7th, 2015

I decided to add my name to a petition by, as of this writing, 81 MIT faculty, calling on MIT to divest its endowment from fossil fuel companies.  (My co-signatories include Noam Chomsky, so I guess there’s something we agree about!)  There’s also a wider petition signed by nearly 3500 MIT students, faculty, and staff, mirroring similar petitions all over the world.

When the organizers asked me for a brief statement about why I signed, I sent them the following:

Signing this petition wasn’t an obvious choice for me, since I’m sensitive to the charge that divestment petitions are just meaningless sanctimony, a way for activists to feel morally pure without either making serious sacrifices or engaging the real complexities of an issue.  In the end, though, that kind of meta-level judgment can’t absolve us of the need to consider each petition on its merits: if we think of a previous crisis for civilization (say, in the late 1930s), then it seems obvious that even symbolic divestment gestures were better than nothing.  What made up my mind was reading the arguments pro and con, and seeing that the organizers of this petition had a clear-eyed understanding of what they were trying to accomplish and why: they know that divestment can’t directly drive down oil companies’ stock prices, but it can powerfully signal to the world a scientific consensus that, if global catastrophe is to be averted, most of the known fossil-fuel reserves need to be left in the ground, and that current valuations of oil, gas, and coal companies fail to reflect that reality.

For some recent prognoses of the climate situation, see (for example) this or this from Vox.  My own sense is that the threat has been systematically understated even by environmentalists, because of the human impulse to shoehorn all news into a hopeful narrative (“but there’s still time!  if we just buy locally-grown produce, everything can be OK!”).  Logically, there’s an obvious tension between the statements:

(a) there was already an urgent need to act decades ago, and

(b) having failed to act then, we can still feasibly avert a disaster now.

And indeed, (b) appears false to me.  We’re probably well into the era where, regardless of what we do or don’t do, some of us will live to see a climate dramatically different from the one in which human civilization developed for the past 10,000 years, at least as different as the last Ice Ages were.

And yet that fact still doesn’t relieve us of moral responsibility.  We can buy more time to prepare, hoping for technological advances in the interim; we can try to bend the curve of CO2 concentration away from the worst futures and toward the merely terrible ones.  Alas, even those steps will require political will that’s unprecedented outside of major wars.  For the capitalist free market (which I’m a big fan of) to work its magic, actual costs first need to get reflected in prices—which probably means massively taxing fossil fuels, to the point where it’s generally cheaper to leave them in the ground and switch to alternatives.  (Lest anyone call me a doctrinaire treehugger, I also support way less regulation of the nuclear industry, to drive down the cost of building the hundreds of new nuclear plants that we’ll probably need.)

These realities have a counterintuitive practical implication that I wish both sides understood better.  Namely, if you share my desperation and terror about this crisis, the urgent desire to do something, then limiting your personal carbon footprint should be very far from your main concern.  Like, it’s great if you can bike to work, and you should keep it up (fresh air and exercise and all).  But I’d say the anti-environmentalists are right that such voluntary steps are luxuries of the privileged, and will accordingly never add up to a hill of beans.  Let me go further: even to conceptualize this problem in terms of personal virtue and blame seems to me like a tragic mistake, one on which the environmentalists and their opponents colluded.  Given the choice, I’d much rather that the readers of this blog flew to all the faraway conferences they wanted, drove gas-guzzling minivans, ate steaks every night, and had ten kids, but then also took some steps that made serious political action to leave most remaining fossil fuels in the ground even ε more likely, ε closer to the middle of our Overton window.  I signed the MIT divestment petition because it seemed to me like such a step, admittedly with an emphasis on the ε.

Can blog posts nourish the soul? Scott A. (alas, not me) as existence proof

June 3rd, 2015

Reading the essays and speculative fiction of Scott Alexander, as they’ve grown in awesomeness even just within the past half-year, has for me been like witnessing the birth of a new Asimov.  (For more Alexandery goodness, check out Universal Love, Said the Cactus Person.)  That this nerd-bard, this spinner of stupid Internet memes into reflections on eternity, came to my attention by way of his brilliantly defending me, is almost immaterial at this point; I don’t think it plays any role in my continuing admiration for his work.  Whatever you do, just keep writing, other Scott A.

The End of Suffering?

June 1st, 2015

A computer science undergrad who reads this blog recently emailed me about an anxiety he’s been feeling connected to the Singularity—not that it will destroy all human life, but rather that it will make life suffering-free and therefore no longer worth living (more Brave New World than Terminator, one might say).

As he puts it:

This probably sounds silly, but I’ve been existentially troubled by certain science fiction predictions for about a year or two, most of them coming from the Ray Kurzweil/Singularity Institute types … What really bothers me is the idea of the “abolition of suffering” as some put it. I just don’t see the point. Getting rid of cancer, premature death, etc., that all sounds great. But death itself? All suffering? At what point do we just sit down and ask ourselves, why not put our brains in a jar, and just activate our pleasure receptors for all eternity? That seems to be the logical conclusion of that line of thinking. If we want to reduce the conscious feeling of pleasure to the release of dopamine in the brain, well, why not?

I guess what I think I’m worried about is having to make the choice to become a cyborg, or to upload my mind to a computer, to live forever, or to never suffer again. I don’t know how I’d answer, given the choice. I enjoy being human, and that includes my suffering. I really don’t want to live forever. I see that as a hedonic treadmill more than anything else. Crazy bioethicists like David Pearce, who want to genetically re-engineer all species on planet Earth to be herbivores, and literally abolish all suffering, just add fuel to my anxiety.

… Do you think we’re any closer to what Kurzweil (or Pearce) predicted (and by that I mean, will we see it in our lifetimes)? I want to stop worrying about these things, but something is preventing me from doing so. Thoughts about the far flung (or near) future are just intrusive for me. And it seems like everywhere I go I’m reminded of my impending fate. Ernst Jünger would encourage me to take up an attitude of amor fati, but I can’t see myself doing that. My father says I’m too young to worry about these things, and that the answer will be clear when I’ve actually lived my life. But I just don’t know. I want to stop caring, more than anything else. It’s gotten to a point where the thoughts keep me up at night.

I don’t know how many readers might have had similar anxieties, but in any case, I thought my reply might be of some interest to others, so with the questioner’s kind permission, I’m reproducing it below.

1. An end to suffering removing the meaning from life? As my grandmother might say, “we should only have such problems”! I believe, alas, that suffering will always be with us, even after a hypothetical technological singularity, because of basic Malthusian logic. I.e., no matter how many resources there are, population will expand exponentially to exploit them and make the resources scarce again, thereby causing fighting, deprivation, and suffering. What’s terrifying about Malthus’s logic is how fully general it is: it applies equally to tenure-track faculty positions, to any extraterrestrial life that might exist in our universe or in any other bounded universe, and to the distant post-Singularity future.

But if, by some miracle, we were able to overcome Malthus and eliminate all suffering, my own inclination would be to say “go for it”! I can easily imagine a life that was well worth living—filled with beauty, humor, play, love, sex, and mathematical and scientific discovery—even though it was devoid of any serious suffering. (We could debate whether the “ideal life” would include occasional setbacks, frustrations, etc., even while agreeing that at any rate, it should certainly be devoid of cancer, poverty, bullying, suicidal depression, and one’s Internet connection going down.)

2. If you want to worry about something, then rather than an end to suffering, I might humbly suggest worrying about a large increase in human suffering within our lifetimes. A few possible culprits: climate change, resurgent religious fundamentalism, large parts of the world running out of fresh water.

3. It’s fun to think about these questions from time to time, to use them to hone our moral intuitions—and I even agree with Scott Alexander that it’s worthwhile to have a small number of smart people think about them full-time for a living.  But I should tell you that, as I wrote in my post The Singularity Is Far, I don’t expect a Singularity in my lifetime or my grandchildrens’ lifetimes. Yes, technically, if there’s ever going to be a Singularity, then we’re 10 years closer to it now than we were 10 years ago, but it could still be one hell of a long way away! And yes, I expect that technology will continue to change in my lifetime in amazing ways—not as much as it changed in my grandparents’ lifetimes, probably, but still by a lot—but how to put this? I’m willing to bet any amount of money that when I die, people’s shit will still stink.