## Pipin’-hot learnin’ theorems

I’ve decided to “launch” my latest paper on the blogosphere even before posting it to quant-ph — so that you, my loyal readers, can be the very first to lay eyes on it. (True fact: as I was writing, I didn’t look once at the screen.) Comments more than welcome.

The Learnability of Quantum States [PS] [PDF]
Scott Aaronson

Abstract: Let me warn you up-front, this is one big-ass mother of a paper. It’s got learning. It’s got quantum. It’s got philosophy. It’s got weird complexity classes (naturally). It’s even got experimental physics applications (don’t worry, I showered afterward). And dude. When I say “experimental,” I’m not talking wormholes or anthropic postselection. I’m talking stuff that you, the quantum state tomographer, can try in your lab today. And no, this is not the real abstract.

Update (8/20): I’ve posted a slightly revised version, mostly in response to the comments I received here.

### 37 Responses to “Pipin’-hot learnin’ theorems”

1. Anonymous Says:

It’s so hot that it made my browser crash!

rrtucci

2. Scott Says:

I warned you, man…

Seriously, were you launching Adobe Acrobat? That memory hog has crashed my browser more times than I can remember. I recommend Ghostview.

3. Anonymous Says:

I think your result bolsters my contention that quantum computers can do classical Bayesian net calculations very effectively. (see http://arxiv.org/abs/quant-ph/0004028 if interested)

rrtucci

4. Scott Says:

I don’t understand the connection. I’m showing that you can do something with few classical data points (thereby placing a limit on quantum protocols).

5. Andy D Says:

On first skim, I like the modularity of the paper and the way quantum results are packaged to fit nicely into the existing learning machinery.

To gloss–it looks like this paper is based on a perspectival switch with respect to Nayak’s result–Nayak said ‘shucks, only n effective dimensions to use for coding’, you say ‘huzzah, only n effective dimensions of unpredictability’. Is this a fair statement?

An observation–most complexity theory results start from an algorithm and lead to an impossibility result; e.g., ‘we can implement public-key crypto using circuit class C, ergo we (probably) can’t learn C as a concept class’. But your paper gives a (info-theoretic) learning algorithm as a corollary to Nayak’s impossibility result for coding.

6. Greg Kuperberg Says:

I won’t dispute your choice of notation BQP/poly in your own paper. Of course you mean it to be Merlin-style advice that need only fulfill the “B” promise when it is good advice. However, since I am told (by both Lance Fortnow and Eric Allender) that the Karp-Lipton interpretation of /poly is standard, you might put in a comment that you mean it to be Merlin-style advice.

(With the wrong definition I am not sure
that BQP/qpoly should be in QMA/poly.)

7. Scott Says:

To gloss–it looks like this paper is based on a perspectival switch with respect to Nayak’s result–Nayak said ‘shucks, only n effective dimensions to use for coding’, you say ‘huzzah, only n effective dimensions of unpredictability’. Is this a fair statement?

Fair? It’s friggin’ insightful, is what it is. I could not have written a better one-sentence summary.

8. Scott Says:

An observation–most complexity theory results start from an algorithm and lead to an impossibility result; e.g., ‘we can implement public-key crypto using circuit class C, ergo we (probably) can’t learn C as a concept class’. But your paper gives a (info-theoretic) learning algorithm as a corollary to Nayak’s impossibility result for coding.

Yeah, I was struck by that too. But there have been other examples of algorithms from impossibility results — most famously, the Linial-Mansour-Nisan algorithm for learning AC0.

PS. Andy, I added you to my linklog. Like it or not, the pressure’s now on you to write more of your heartbreakingly-clear complexity theory exegeses… 🙂

9. John SIdles Says:

That was a very enjoyable paper. I would enjoy seeing more discussion of what the results imply for the simulation of quantum systems.

E.g., doesn’t the result formally imply that simulation of quantum dynamical systems is possible with polynomial space resources?

This is a problem of great practical importance. See, e.g., the well-known Heidelberg quantum chemistry web site

in particular their recent Physics Report “The multiconfiguration
time-dependent Hartree (MCTDH) method …
“.

Specifically, your result seems to imply that any quantum chemistry calculation can be carried out in polynomial space and time resources, if one is smart enuf to choose the right basis states for a Dirac-Frenkel variational calculation (eq. 4 of the above).

If nothing else, this is a chance to cite Dirac’s 1930 article and Frenkel’s 1934 article, on this topic. This would be cool.

From this point-of-view, you’ve proved a strong result about the accuracy achievable by Dirac-Frenkel variational calculations.

10. Anonymous Says:

“I don’t understand the connection. I’m showing that you can do something with few classical data points (thereby placing a limit on quantum protocols).”

One wants to apply your algorithm to an interesting quantum state. What I propose in my paper is a way to tailor make a quantum state |Bayes> that contains within it the solution (a probability) of a classical bayesian net calculation. Your result sheds light on how to extract that solution from |Bayes> in an efficient way.

(Note also that the probabilities derived from a classical Bayesian net calculation need not be perfectly precise; they are used to make decisions, to do AI, and most of the time one can make decisions based on probabilities that are not perfectly precise. Your method would give probabilities about |Bayes> that are not perfectly precise. So there is a match there.)

rrtucci

11. Dave Bacon Says:

It’s even got experimental physics applications (don’t worry, I showered afterward).

Complexity theorists are so cute when they do experimental physics.

Actually, there is something which bothers me about using this, as a physicist. You select your measurements from a finite set of measurements according to any distribution. Cool. But in the real world(TM) this isn’t possible. You may have a set of finite possible measurements you may want to measure, but hell will freeze over before you are able to actually make these precise measurements. In the real world, each little measurement will be smeared over a continuous set of noisy measurements. And it seems that this might play havok with the theorem. ??? I don’t see this as one of your objections.

12. Scott Says:

In the real world, each little measurement will be smeared over a continuous set of noisy measurements. And it seems that this might play havok with the theorem. ???

Calm down, my physicist friend. A noisy measurement is still just a POVM, no?

13. Scott Says:

E.g., doesn’t the result formally imply that simulation of quantum dynamical systems is possible with polynomial space resources?

Specifically, your result seems to imply that any quantum chemistry calculation can be carried out in polynomial space and time resources, if one is smart enuf to choose the right basis states for a Dirac-Frenkel variational calculation

1. We’ve known since Bernstein-Vazirani that “quantum dynamical systems can be simulated in polynomial space” — i.e., BQP is contained in PSPACE.

2. My result says nothing about learning with polynomial computation time, only about learning with a polynomial number of measurements.

3. It also says nothing about simulating dynamical systems, only about learning a static quantum state.

So while it’s conceivable that there’s a connection here, I’m not seeing it.

14. John Sidles Says:

The problem with thought-provoking papers is, they’re so darn thought-provoking.

Scott, it seems to me (on physical grounds) that in a well-ordered universe, your phrase “learning with a polynomial number of [well-chosen] measurements” would be rigorously equivalent to “simulating with a polynomial number of [well-chosen] time steps”.

As you say, in both cases, “well-chosen” is the tough part! So maybe they are the same problem?

In any case, your article has got me thinking about physical models that might rigorously embody a “learning==simulation” equivalence, using reasoning similar to physical models that embody an “amplifier==measuring device” equivalence.

E.g., Carlton Caves linear amplifier noise limit is readily shown to be equivalent to the standard quantum limit (SQL) for test masses.

Remark: no-one talks about beating Caves’ limit, and yet everyone talks about beating the SQL … go figure?

In any case, such equivalences are powerful cognitive tools (in the rare cases that they can be proved).

15. John Sidles Says:

Aha! After a second pass through Scott’s article, here’s a suggested answer to Scott’s question (the answer was inspired by Gene Wilder’s Fundamental Principle of Science: Change the plus to minus, and the minus to plus!)

Question: (from page 3) “[An unanswered question is] are there are nontrivial special cases of the quantum learning problem that can be solved, not only with a linear number of measurements, but also with a polynomial amount of computation.”

Suggested Answer: Recently developed algorithms for efficient quantum simulation via model order reduction are readily adapted to solve nontrivial special cases of the quantum learning problem. Conversely, the quantum learning theorem provides mathematical insight as to why these efficient algorithms exist.

Example: in a system of $n$ Pauli spins, let the set of all one-spin and two-spin operators comprise the training set ${E_i}$, with ${p_i}$ their specified expectation values. Construct a `spin-dust’ Hamiltonian, with spin-spin couplings adjusted to yield the specified ${p_i}$. Run a quantum trajectory simulation, yielding a state trajectory $|psi(t)>$. Then the required hypothesis state is the time average of $sigma = |psi(t)>Refinement: In general, computing the wave function$|psi(t)>$requires exponential time and space resources. Therefore, introduce noise into the system, then convert the noise into an equivalent measurement process that quenches higher-order correlations, and finally, use the Dirac-Frenkel variational equations to project$|psi(t)>$onto a lower-dimensional product-sum manifold having a Kahler geometry. The resulting simulation-based computation of$sigma$require only polynomial space and time resources to compute and store$|psi(t)>\$ (in a representation compressed by the geometric projection).

Scott, the above algorithm is simply what we are already doing in our quantum microscopy simulations. We use these simulations to design our samples and measuring devices. And we discovered in numerical experiments that the quantum fidelity of the projection onto a lower-dimensional (Kahler) manifold is remarkably high.

What greatly puzzled us–until the Quantum Learning Theorem renormalized our expectations–was why this simulation algorithm worked so well. And of course, the quantum chemistry people have been similarly puzzled about why their (very similar) algorithms work so well.

So for us quantum simulationists, the physical implication of the Quantum Learning Theorem is very important (if this implication can be established rigorously): quantum simulation with polynomial-resources is generically feasible for any quantum system whose physics can be captured by a polynomial number of training measurements.

Scott, I apologise if I have misunderstood your theorem! And does it follow from your theorem, that a quantum computation process cannot be described by a polynomial number of training measurements? This might almost be used as the definition of a quantum computation process. Just as a random number can be defined as a number that has no algorithmic compression.

16. Dave Bacon Says:

Calm down, my physicist friend. A noisy measurement is still just a POVM, no?

Ah, I am calm now.

Any guesses (or conjectures as you might say) on how this scales for multioutcome measurements (I’m pretty certain I asked you this when you visited and, since I don’t have a glowing feeling when I revisit this question in my head, I have a feeling that you dodged the question. 🙂 )

17. Scott Says:

Any guesses (or conjectures as you might say) on how this scales for multioutcome measurements

Specially for you, my friend, I’ve just spent five minutes reducing the k-outcome case to the 2-outcome one.

Let’s say you want to apply the k-outcome POVM {E_{i1},…,E_{ik}}. Then here’s what you can do: first choose j uniformly at random from {1,…,k}. Then pretend you’re only applying {E_{ij},I-E_{ij}} (i.e., ignore the other k-1 outcomes).

How many training measurements you need will depend on what your success criterion is. With high probability over i, do you want to be able to estimate Tr(E_{ij}*rho) for all k E_{ij}’s simultaneously? In that case, it will suffice to replace every occurrence of epsilon in my upper bounds by epsilon/k.

I’m almost certain you could improve this bound by analyzing the k-outcome case directly — any ideas?

Another question is what happens if the measurements are noisy, but the noise is adversarial rather than probabilistic.

I might have to add some objections to the list, and you to the acknowledgments…

18. Andy D Says:

To show my gratitude (and worthiness), a new post awaits.

19. Johan Richter Says:

It is nice to see that you have time do some real work despite your blogging. It is also nice that I almost sort of semi-understand the paper.

20. Greg Kuperberg Says:

Here is a question motivated by some of the introduction to the paper. The paper asks how much advice is worth; it points out that it is not known whether BQP/qpoly is different from BQP/poly. (Or I would call it BQP/mpoly but never mind that at the moment.) But can it be settled that quantum polynomial advice is not better than deterministic advice if you have a lot of computational power? For example, does BQR/qpoly = R/poly? Here BQR is what a QTM can compute without limits on resources.

21. Scott Says:

For example, does BQR/qpoly = R/poly?

Yes. My BQP/qpoly in PP/poly result immediately implies that if you scale it up to R.

22. Greg Kuperberg Says:

Oh, right, I forgot.

23. Greg Kuperberg Says:

Right, what I was going to say was, if you proved that quantum advice is better than classical advice, it would follow that BQP != PP. You would be the Perelman of complexity theory. (Or maybe almost, since PP is bigger than NP.)

So your statement on page 6, “Perhaps not surprisingly, the answer is unknown” is misleading. It is demonstrably unsurprising :-).

Unless you meant that no one knows a classical oracle separation. Then it may be fair to say that it is merely “perhaps not surprising”. But it should be explained that, unlike the result about /qlog, the question concerning /qpoly is 1000 times easier in the presence of an oracle.

24. chris Says:

I think I may need to lose some weight, as I was rather shattered by Definition 2.1.

25. Scott Says:

🙂

26. John Sidles Says:

On Lance Fortnow’s weblog, specifically the August 17 problem challenge by Claire Kenyon, I point out that there is seemingly a Bayesian informatic parallel between Claire’s “typical” random number distributions and Aaron’s “interesting” quantum states.

Namely, knowing that a quantum state is “interesting” seems (to me) to represent the same (large) amount of exploitable Bayesian information as knowing that a random number has been selected from a “typical” distribution.

27. anonymous lurker Says:

“I’ll hold it down to one comment per week … with links instead of prose!”

John Sidles, didn’t you say this not too long ago?

28. chris Says:

For anyone else as clueless and curious about learnability as I am, I found http://www.autonlab.org/tutorials/vcdim.html

29. Maxim Says:

Scott, you may want to take a look at Michael Keyl’s paper on a large-deviations approach to quantum state estimation:

http://arxiv.org/abs/quant-ph/0412053
Quantum state estimation and large deviations
Authors: M. Keyl
Reviews in Mathematical Physics, Vol. 18, No. 1 (2006) 19-60

Abstract: In this paper we propose a method to estimate the density matrix rho of a d-level quantum system by measurements on the N-fold system. The scheme is based on covariant observables and representation theory of unitary groups and it extends previous results concerning the estimation of the spectrum of rho. We show that it is consistent (i.e. the original input state rho is recovered with certainty if N to infty), analyze its large deviation behavior, and calculate explicitly the corresponding rate function which describes the exponential decrease of error probabilities in the limit N to infty. Finally we discuss the question whether the proposed scheme provides the fastest possible decay of error probabilities.

30. Drew Arrowood Says:

Scott,

I think that you mean to say that computational learning theory is an answer to Nelson Goodman’s riddle of induction, not Hume’s problem of induction, see this for some indication of what isn’t addressed. Also, philosophers tend to talk about ravens, rather than crows.

31. Scott Says:

Scott, you may want to take a look at Michael Keyl’s paper on a large-deviations approach to quantum state estimation:

Thanks for the reference! I might cite this.

A quick glance shows that, as in almost every tomography paper I’ve seen, Keyl takes trace distance as the metric on states, which immediately implies that he’s going to need exponentially many measurements to get a decent approximation. (He also considers the case where you just want to learn the eigenvalue spectrum — but there, again, you’re going to need exponentially many measurements.)

32. Scott Says:

Hi Drew,

I think that you mean to say that computational learning theory is an answer to Nelson Goodman’s riddle of induction, not Hume’s problem of induction, see this for some indication of what isn’t addressed.

I looked at that web page, and Hume’s problem as described there is essentially the problem that I believe computational learning theory addresses.

I also read more about Goodman’s “grue” problem, and to be honest, I don’t understand why it’s not just a dramatic restatement of Hume’s original problem. Maybe you can explain it to me?

Also, philosophers tend to talk about ravens, rather than crows.

“All crows are black” gets 1,000 Google hits, while “All ravens are black” gets 12,000. So it seems that while either is acceptable, the latter is indeed more standard. Thanks! 🙂

33. Drew Arrowood Says:

Hume was writing in the Eighteenth Century, before there was a clear distinction between psychology (the description of how we reason) and logic (normative claims about how rational agents ought to reason), a distinction that probably isn’t well understood until Frege, and still isn’t well understood by “Continental” philosophers.

“Induction” goes back to Aristotle, in the _Prior Analytics_. The problem for Hume consists of two parts, to (1) describe when in fact we do it (in order to show that certain generalizations aren’t already present in the mind) and (2) justify the status of the judgments that come into the mind by this process — are they true? further, are they adequate?

(1) and (2) are going to have the same solution, iff the agents in question are themselves very special and they are in a very special kind of world — namely, they have enough cognitive resources to compute the generalizations, and the world that they are in requires of them no more than the generalizations that they do compute from it. We are almost certain that both things are false about our world.

I think that you are talking about part of Hume’s second problem (the truth part).

What Goodman (along with Carnap and the rest of the logical empiricists) did was talk about propositions, rather than ideas and impressions in the mind, with regard to problems of confirmation. This allows us to begin to speak of a grammar relating propositions in a truth functional way. This is a distinction you probably take for granted every day, and most of the time is a good one to make.

As to crows versus ravens — the reason why anyone would care is this — if you are going to write down a matrix of eigenvalues along the diagonal as part of the factorization for some other matrix, you are going to use a capital lambda, aren’t you? If you don’t, people will wonder what special point you are trying to make. For philosophers, words are often notation.

34. Scott Says:

Hi Drew,

I changed crows to ravens and acknowledged you.

As for the distinction between how we reason and how rational beings ought to reason: granted there will always be people who don’t understand it (along with plenty of other elementary distinctions). But do you really think that Hume wouldn’t have understood such a thing as well as you and I do, even if it’s not what he chose to emphasize?

I guess Goodman deserves roughly as much credit for explicating Hume as Cao and Zhu deserve for explicating Perelman… 🙂

35. Bram Cohen Says:

My apologies if I ask about something which is already made clear in that paper – I didn’t read it thoroughly, and didn’t undestand the parts which I did all that well. Anyhow, here’s my question.

Is it correct that there’s a procedure which is done in the real world called ‘quantum tomography’, which involves doing a large number of measurements, and you’re essentially suggesting that it’s possible to reduce the number of measurements necessary from exponential to some smaller asymptotic? If so, has there to date been an experiment which was done which should have been able to use fewer measurements given your new techniques? If there hasn’t been, what would be the crossover point of number of measurements where your techniques would be a win?

It appears that you suggest that your numbers could be improved on a lot by future research, which would make any numbers given become obsolete fairly quickly, but it’s still good to get some real world grounding, especially in quantum computation, since the field has been plagued by struggling to suggest real experiments since day 1.

36. Scott Says:

Hi Bram,

Is it correct that there’s a procedure which is done in the real world called ‘quantum tomography’, which involves doing a large number of measurements, and you’re essentially suggesting that it’s possible to reduce the number of measurements necessary from exponential to some smaller asymptotic?

Yes — I’m saying you can reduce the number of measurements from exponential to linear, provided you change the goal from predicting the outcomes of all measurements to predicting the outcomes of most of them.

If so, has there to date been an experiment which was done which should have been able to use fewer measurements given your new techniques? If there hasn’t been, what would be the crossover point of number of measurements where your techniques would be a win?

That depends on what your goal is — e.g., what percent of measurements do you want to get right? In the open problem section, I conjecture that with “reasonable” error parameters, the learning approach might already be a win with 3 or 4 qubits — in which case, yes, there have been plenty of experiments that might have benefitted from this approach.

But as you say, it depends on how far my upper bounds can be improved, as well as on what the constant factors are. The most direct way to find out (and the only way to convince experimentalists! 🙂 ) will be to run a numerical simulation, which I’m hoping to do in the near future.

(Note: Complicating the issue further is that, in many experiments that have done full tomography, the goal was just to confirm the presence of entanglement. If that’s the goal, then what you ought to be doing is neither full tomography nor pretty-good tomography — you should just be looking for an “entanglement witness” like a Bell inequality violation.)

37. Drew Arrowood Says:

“Would Hume have used our distinction” is a hermeneutical question. Consider the following statement of Hume’s:

“The hearing of an articulate voice and rational discourse in the dark assures us of the presence of some person. Why? Because these are the effects of the human make and fabric, and closely connected with it. “

Can you imagine NOT being able to imagine talking robots? When it comes to making sense across cultures and times of modal concepts, caution should be indicated. Of course, what we are dealing with here IS a modal concept.

Let me try to demonstrate the distinction between the normative and the descriptive part of Hume’s problem — and why our attempts at a solution don’t begin to look like his.

Your result is conditional — if we have these observables, then we can determine the state of the world. That’s the Goodman side of the problem — the logical preconditions for the epistemic achievement of learning the quantum state.

David Bacon’s posting on this thread indicates that we may not “have” these observables. Though maximality principles (and, I suppose, Gleason’s Theorem) tell us we should have all of the observables of your (quadratic) form, we know that (1) physicists do exclude certain observables due to superselection rules and (2) computer scientists want to exclude certain observables, because those observables would let us solve the halting problem. The question of “what observables do exist” is somewhat like “what predicates are projectible” for Goodman — not a condition within his confirmation theory, but a precondition for that theory being able to do its work. The confirmation theory of Goodman (like Wittgenstein’s, as Kripke points out, in his book on Wittgenstein) answers this question by pointing back to human subjectivity — some predicates are made up by us, and others are natural kinds, and the first sort aren’t projectible. [We typically don’t answer our questions about observables that way, though — with one major exception.]

The achievement of Goodman’s is precisely what makes your enterprise possible — splitting off one question from the other, showing that we can prove a conditional without having to accept the truth of the protasis. In the intervening centuries up until the 20th, Humean sorts of solution were tried. The last one (and the one most recognizable to Americans) was that Charles Peirce, who gave an evolutionary explanation for both parts of the problem at the same time (which almost no one thinks succeeded) — and it is only an explanation of that form that Hume and his contemporaries would have considered satisfactory.

The final piece of the puzzle is that a solution to Hume’s problem of induction must explain how we come to know each and every generalization we know. Now, I don’t mean simply psychology in the sense of what goes on in psychology departments. Any claim of the form: “All cats are either dead or alive”, if it is true, and I know it through confirmatory instances in the world, must have determinate cats (in an eigenstate of the eigenfunction I’m measuring) in my story — must tell me the necessary and sufficient conditions of moving from a state description to a claim about a property of a cat. I know you don’t want to claim that you have done that — you are better at dancing around “The Beast” than anyone.

Hume was a diplomat and a historian and an economist and a philosopher … the world was a lot smaller in those days. His problems were too big for him, and he couldn’t solve them. Presumably, at some point in history, we might be able to add up all of the piecemeal solutions in some way and say that then we had solved Hume’s problem.