## Quantum Computing Since Democritus Lecture 15: Learning

Lektur iz heer.

This week I explain Valiant’s PAC-learning model (previously covered in GITCS Lectures 19, 20, 21), and also — in response to a question from the floor — take a swipe at Bayesian fundamentalism. When you only know one formalism to describe some phenomenon (in this case, that of choosing hypotheses to fit data), it’s easy to talk yourself into believing that formalism is the Truth: to paraphrase Caliph Omar, “if it agrees with Bayesianism, it is superfluous; if it disagrees, it is heresy.” The antidote is to learn other formalisms. Enter computational learning theory: an account of learning that’s clear, mathematically rigorous, useful, nontrivial, and completely different from the Bayesian account (though of course they have points of contact). The key idea is to jettison the notoriously-troublesome notion of a *prior*, replacing it by a *concept class* (about which one makes no probabilistic assumptions), as well as a probability distribution over sample data rather than hypotheses.

Incidentally, I’d say the same thing about complexity theory. If you think (for example) that Turing machines are the only way to reason about computational efficiency, then you’re overdue for a heaping helping of communication complexity, circuit complexity, query complexity, algebraic complexity…

Ah yes, complexity. This week I was at the Conference on Computational Complexity at the beautiful University of Maryland in College Park: home of the Terrapins, as one is reminded by signs placed roughly every three inches. I heard some great talks (ask in the comments section if you want details), gave two talks myself, and during the business meeting, was elected to the CCC Steering Committee. This being a complexity conference, my declared campaign motto was *“No We Can’t!”* It was inspiring to see how this simple yet hopeful motto united our community: from derandomization to circuit lower bounds, from quantum computing to proof complexity, we might have different backgrounds but we all worry about shrinking grant sizes and the rising costs of conference registration; we all face common challenges to which we want to prove that no solutions exist. Rest assured, I will treat my duties as a steering committee member (mostly helping to select PC chairs, who in turn select the program committees who select the conference papers) with the awesome gravity they deserve.

Comment #1 June 26th, 2008 at 4:09 pm

Sorry my post will not be about the actual text, but I wonder the truth behind the saying which you attributed to Caliph Omar. Because I heard the same sentence being attributed to many different people (wikipedia entry about the library of Alexandria for example tells a different story).

Comment #2 June 26th, 2008 at 4:27 pm

Ali, it’s probably an example of Stigler’s law.

Comment #3 June 26th, 2008 at 4:37 pm

Ali, I didn’t know the attribution for that quote either, so I googled something like “library of alexandria koran superfluous heresy,” and most of the hits (including from “reliable-seeming” sources) referenced Omar. Anyone who knows the history care to enlighten us?

Comment #4 June 26th, 2008 at 5:46 pm

“No We Can’t!” reminds me of this great piece of writing:

http://hem.bredband.net/arenamontanus/Mage/mathemagick.html

Comment #5 June 26th, 2008 at 8:11 pm

“Enter computational learning theory: an account of learning that’s clear, mathematically rigorous, useful, nontrivial, and completely different from the Bayesian account (though of course they have points of contact). The key idea is to jettison the notoriously-troublesome notion of a prior, replacing it by a concept class (about which one makes no probabilistic assumptions), as well as a probability distribution over sample data rather than hypotheses.”

I don’t think you are jettisoning anything, except the word Bayesian. You are just restricting yourself to a special case, it seems to me. Are you implying that Bayesian networks are not rigorous?

Comment #6 June 26th, 2008 at 8:41 pm

He’s jettisoning (iiuc) the requirement that one knows probability distributions of hypotheses. If you do know them, the problem becomes a special case.

Comment #7 June 27th, 2008 at 4:44 am

Koray: That’s exactly right. Instead of saying “until I know otherwise, I’ll assume the universe will do thing i to me with probability p

_{i},” CS people are wont to say: “within this space of possibilities, I’ll assume the universe will dothe very worst thing to me that it possibly can,and I’d like to do OK regardless.” I.e., take Murphy’s Law as your epistemic axiom, and use the universal quantifier rather than the random quantifier over possible universes. Obviously you need to make further assumptions to get anywhere — but the existence of “assumption tradeoffs” between different learning approaches is exactly my point!Comment #8 June 27th, 2008 at 12:06 pm

I deeply agree with Scott: “‘within this space of possibilities, I’ll assume the universe will do the very worst thing to me that it possibly can, and I’d like to do OK regardless.’ I.e., take Murphy’s Law as your epistemic axiom, and use the universal quantifier rather than the random quantifier over possible universes.”

That was essentially the basis of the philosophy of Descartes, not only based on doubt rather than faith, but assuming that an evil entity might be inserting false sense impressions and false memories into his memories.

This is also the basis of the Mathematical Disinformation Theory of Philip V. Fellman et al, which draws from Ulam’s analysis of “20 questions with N lies” (itself leading to Golay error-correcting codes for the optimal solution when N=3), and the literature of military and Intelligence strategies and tactics.

In the corporate world, you must assume that every Annual Report is a maximally misleading mix of verifiable truth and computationally expensive to falsify lies. There is an optimal level of disinformation in an annual report — too few lies and competitors can infer cost structure and undercut you in the market, too many lies and you lose market share, perhaps completely (i.e. Enron).

This is a deeper form of Murphy’s Law. It is a quantum Murphy’s law by a player who wants to exhaust your computational assets and force you into sub-optimal play.

Comment #9 June 27th, 2008 at 2:05 pm

Jonathan Vos Post’s analysis can be read as an informatic assault on the libertarian community’s moral preference for free-market economics.

Long version: In a market economy, realizing the twin moral objectives of equality of opportunity and equality of citizenship requires equality of access both to information

andto information-processing resources.Short version: Stallman is right … and modern information theory explains why! 🙂

Comment #10 June 27th, 2008 at 2:15 pm

Interesting. In statistical decision theory, there’s a notion of an admissible decision rule, which is one that is not dominated by another rule at every possible outcome. Roughly, admissible rules are always Bayesian (you have to allow improper priors), but you don’t have to use Bayesian reasoning to come up with one. The most popular alternatives are minimax rules, which you can find by assuming that you are playing a two-person zero-sum game with Nature, who is out to get you. I wonder if the two are related. The problem with minimax rules is that they are hard to compute exactly, so maybe statistical learning theory is a computational approximation?

Comment #11 June 27th, 2008 at 5:44 pm

What if the drawings from D aren’t independent?

Comment #12 June 27th, 2008 at 6:36 pm

Johan: Excellent question! It’s easy to think of dependencies between the sample points for which you’re hosed (e.g., if the data consists of (x,x,x,…) for a randomly chosen x). Indeed, even k-wise independence is easily seen to be insufficient, for any fixed k. On the other hand, this paper of Aldous and my adviser shows that a kind of PAC-learning is still possible if the sample points are drawn according to a Markov chain. I’m sure there are other results on PAC-learning with various kinds of dependencies among the sample points; anyone want to point us to some?

Comment #13 June 27th, 2008 at 8:34 pm

An off-topic post:was rrtucci your schoolmate? he ALWAYS gave thought-provoking responses to your posts. i like that!

Comment #14 June 27th, 2008 at 8:50 pm

was rrtucci your schoolmate?Argh… trying to repress the memories! (Alright, no 🙂 )

Comment #15 June 27th, 2008 at 9:30 pm

How is (1/epsilon) log(|C|/delta) the same as log (delta/|C|)/ log (1-epsilon)?

(The latter is what I get when I solve delta = |C|(1-epsilon)^m for m.)

Comment #16 June 27th, 2008 at 9:57 pm

How is (1/epsilon) log(|C|/delta) the same as log (delta/|C|)/ log (1-epsilon)?#2,

Though my math can’t be better than yours, I’d suggest that a way to check this is substituting the variables by different numbers to find it out.

Comment #17 June 27th, 2008 at 10:36 pm

MI’s: log(1-ε)≈-ε for small ε.

Comment #18 June 28th, 2008 at 12:30 am

The online learning framework can be thought of as a way of learning when the sample points are not independent. In this framework, data points arrive one after the other, the learner picks a hypothesis at each stage out of a fixed concept class, and is suitably penalized if the next point fails to agree with the hypothesis. Typically, the guarantees tell you that the decision rule you used over time does almost as well as the optimal hypothesis that you could have used.

Comment #19 June 28th, 2008 at 8:44 am

was rrtucci your schoolmate?

That would have been an honor, but no. I am much older and dumber than Scott. My phd is in physics, and I am not an academic, so we have different perspectives.

Comment #20 June 28th, 2008 at 12:15 pm

Rrtucci Says:

“… I am much older and dumber than Scott. My phd is in physics, and I am not an academic, so we have different perspectives.”Ditto … except I’m an academic medical researcher. So why would a

medicalresearcher care about the learnability of quantum states?Well, in medicine we are always looking for (to use the nomenclature of Scott’s lecture) “learnable concepts” that predictively encapsulate languages that describe complex systems. IMHO, Scott’s research does a great service in demonstrating that these (classical) learning principles can be extended to quantum system description and (therefore) to quantum system engineering.

Combining these informatic principles with the NSF-funded SynBERC objective “make biology into an engineering discipline” … and then further recognizing that the most fundamental biological mechanisms are quantum dynamical … well, that’s why there is such a growing torrent of interest in quantum information theory among disparate disciplines like chemistry, condensed matter physics, biology and medicine.

The branch of system engineering that serves these disciplines at the quantum level is called (what else?) quantum system engineering.

Quantum system engineering embraces all the canonical elements of QIT, but tends to weight and combine these canonical elements somewhat differently from the theorem-proving QIT community … which is not surprising given its very different objectives … hopefully these communities have much to offer one another.

Comment #21 June 28th, 2008 at 8:00 pm

H. L. Mencken in the hills outside Dayton, Tennessee, in 1925:

From the

Baltimore Evening Sun,13 July 1925.Comment #22 June 28th, 2008 at 8:03 pm

I agree with John Sidles and Walt. I was adding computational complexity spin to an existing revolution.

The assault on naive neoclassical techno-libertarianism is described for the layman in Paulia Borsook’s book on Silicon Valley geeks who sweep under the carpet the massive Federal and state investment in National Defense that established their infrastructure.

In academe, it was launched by Daniel Kahneman and Amos Tversky, the former of whom won the Nobel Prize in economics in 2002. See:

Hedonic Man

The new economics and the pursuit of happiness.

Alan Wolfe, The New Republic Published: Wednesday, July 09, 2008

HAPPINESS: A REVOLUTION IN ECONOMICS (MUNICH LECTURES IN ECONOMICS)

By Bruno S. Frey

(MIT Press, 240 pp., $35)

PREDICTABLY IRRATIONAL: THE HIDDEN FORCES THAT SHAPE OUR DECISIONS

By Dan Ariely

(HarperCollins, 280 pp., $25.95)

“When I first began hearing about what Bruno S. Frey, professor of economics at the University of Zurich, calls the ‘revolution’ in his discipline, my reaction was one of delight. As far as I was concerned, it could not happen fast enough. Neoclassical economists had insisted upon the primacy of self- interest only in order to model human behavior, but the way rational choice theory developed (at the University of Chicago in particular) suggested that self-interest was not just a fact for these thinkers, but also an ideal: not just how people do act but also how they should act. Their relentless advocacy of market-based public policies was finally ideological–and, by my lights, ideologically wrong. Also the jargon grew impenetrable, and the mathematics ostentatious and obnoxious….”

Comment #23 June 28th, 2008 at 9:10 pm

That would have been an honor, but no. I am much older and dumber than Scott. My phd is in physics, and I am not an academic, so we have different perspectives.Having different perspectives can be fruitful. Many years ago when I worked for my physics professor, most of the time we had different views on problems about the ongoing research. Sometimes, he was right and, sometimes, he wasn’t.

Comment #24 June 29th, 2008 at 12:55 am

I’m never sure what ‘Bayesianism’ is, but Bayes decision rules — like another poster pointed out — are optimal, meaning given whatever data the world throws at you in that context, it’s the best you could do. So that seems perfectly consistent with the epistemic view you attributed to CS people, of assuming the worst case and still trying to do OK. It’s true that this analysis assumes a probabilistic representation, but why is that a problem? Even if it is, I guess there could be an appeal to Cox’s theorem and similar results…

I personally think the term ‘Bayesian’ should be dropped. All it means is that you reason in accord with the rules of probability theory, and oh, that turns out to have nice properties. That’s very innocuous, in my mind. So most disagreements about that probably *are* trivial or meaningless… yet somehow the term and/or Bayesian managed to have an air of controversy to it. It induces religious wars which are all silly. Probability theory is good. Probability is good for induction and dealing with uncertainty in a unified, coherent way. If you’re Bayesian, you do that. Why is that controversial?

In light of that, there’s no point to deny or argue about a Bayesian approach, but rather the only interesting argument is whether probability is a convenient representation for the problem you’re trying to solve, or for the theorem about some class of structures that you’re trying to prove. That’s 100x more interesting than “Bayesian(ism): for and against” nonsense.

Comment #25 June 29th, 2008 at 4:42 am

“In light of that, there’s no point to deny or argue about a Bayesian approach, but rather the only interesting argument is whether probability is a convenient representation for the problem you’re trying to solve, or for the theorem about some class of structures that you’re trying to prove. That’s 100x more interesting than “Bayesian(ism): for and against” nonsense.” -zf

Amen to that, but try saying it to a Bayesian!

…

The source of the Bayesian religious ferver is that their claim isn’t that probabilistic methods are a good way to understand induction etc. rather the claim often seems like “probabilistic methods are ALWAYS superior to other methods”. It’s when people make that kind of statement that the Bayesian religious wars arise…. That and when they talk about the philosophy behind conditioning probabilities on the data rather than hypotheses or vice versa.

Comment #26 June 29th, 2008 at 7:43 am

Scott’s lecture concludes: “… The question of how to find and represent the hypothesis comprises much of the rest of [learning] theory. As yet, almost nothing is known

provedabout this part of learning theory in the quantum world.”Scott, the above quote was really great … and I have further improved it (hopefully) by striking out “known” and substituting “proved”. Because didn’t someone-or-other say “Much more is known than has been proved?” 🙂

Seriously … Lecture #15 was an outstanding lecture that IMHO suggests many interesting research topics.

If we focus on quantum states that can be efficiently learned

in practice—or efficiently represented, or whose dynamics can be efficiently simulated—then the class of quantum states that we have “learned” (and thus understood) is very large indeed … and growing rapidly.So this is one of those cases where pretty much every quantum researcher knows more than we can prove … and one thing that I have learned from your blog is that this is an exceedingly common situation in information theory (both classical and quantum).

Of all the (many) heuristics that people use to represent states efficiently, perhaps the most common is to exploit

a prioriknowledge of bases in which Hamiltonian operators are sparse. Hence the intuition comes that the issues your lecture raises must be very intimately related to recent work in the field of compressive sampling. It is likely that there are new mathematical theorems to be proved here … and it iscertainthat there are practical applications to be found here!Gosh, now I even feel like attempting to prove some theorems myself … which is very far from my usual inclination to just start analyzing hardware systems.

Anyway, thanks for a fine lecture, which greatly assisted me in understanding your article on learnable quantum states.

Comment #27 June 29th, 2008 at 10:57 am

> Bayes decision rules … are optimal, meaning given whatever data the world throws at you in that context, it’s the best you could do.

So, when A.E. discovered the theory of relativity, all he did was update probabilities according to the (surprisingly little) data he had, following Bayes’ rule?

Comment #28 June 29th, 2008 at 12:57 pm

Bayes decision rules — like another poster pointed out — are optimal, meaning given whatever data the world throws at you in that context, it’s the best you could do.I agree that, within the framework and assumptions of Bayesianism, Bayes’ Rule is optimal (and there are theorems to that effect). 🙂 It would be an interesting exercise to look at any particular such theorem — say, Cox’s Theorem — and see where it makes an assumption tantamount to a prior over hypotheses being meaningful. I.e., at what step does the PAC-learning approach, of

for allstates of the world belonging to some set andfor mostsample data, picking a hypothesis that agrees with most future data, get assumed away as irrational?(I really don’t see any way to recast the PAC approach as a Bayes decision rule, though maybe I’m not being sufficiently imaginative. If we were

justtaking the worst-case over possible states of the world, then we could assume a uniform prior and an “infinitely risk-averse” utility function, which gave a utility of -infinity to bad guesses. But in the PAC-approach, we only ask to succeed with high probability over the sample data, and it’s that combination of worst-case with average-case that’s the source of the difficulty.)So, maybe the issue is just that the PAC-learning criterion violates the decision theory axiom that utilities satisfy U(pA+qB)=pU(A)+qU(B)? As it happens, I was never entirely comfortable with that axiom (there are situations where it badly violates intuition, like the Allais paradox).

Comment #29 June 29th, 2008 at 1:11 pm

The Allais paradox goes away when you remember that utility isn’t purely a function of money. Certainty feels warm and fuzzy.

Comment #30 June 29th, 2008 at 1:19 pm

Fuzzy money? Ewww… here I was thinking it was

havingmoney that madeyoufeel warm and fuzzy.All joking aside, I do think that having an alternative to Bayesian statistics is important, as is understanding how they can go wrong. While Bayesianism is a useful tool in the toolbox, the problem of priors, so to speak, can be seen underlying a lot of pseudoscience: by choosing a bizarre prior distribution, scammers and cons can create figures that sound quite reasonable. That’s one of Mark C. Chu-Carrol’s favorite topics over at Good Math/Bad Math.

Comment #31 June 29th, 2008 at 1:37 pm

Nick: No, it doesn’t. The whole point is that it works with any utility function that satisfies the linearity axiom.

Comment #32 June 29th, 2008 at 8:35 pm

Can someone explain whether PAC-learning solves the “grue” paradox? At first glance it seems to just assume the problem away, by saying that the data samples are IID. Is that a correct understanding?

Comment #33 June 29th, 2008 at 11:37 pm

Wei: It depends what you mean by “solve”! As discussed in the lecture, in PAC-learning you always consider a concept class which is smaller than the class of all logically conceivable concepts. In particular, you consider concepts that are

limited in complexitysomehow (say, that are computed by polynomial-size Boolean circuits), which will rule out “grue” concepts once they become sufficiently complicated, thereby enabling you to learn. Now, some would say that in restricting the concept class we’ve assumed the problem away. Others, though, would say that we’ve elucidated something important about learning:of courseyou need to restrict the concept class somehow before you can learn! Why would you ever have imagined otherwise? 🙂Comment #34 June 30th, 2008 at 9:38 am

… (remarks stimulated by Scott’s post above)

Scott is pointing out (AFAICT) a natural and yet wonderfully interesting idea that is at the heart of the connexion between learning theory and complexity: useful concepts are

limited in complexity.For example, in public key distribution, Alice and Bob share a low-complexity concept class that describes the public key. It is hard for Eve to learn this concept class on her own (but how hard, exactly?) even though in principle Eve has all the information she needs to accomplish this.

Scott’s theorem about learnable quantum states suggests that novel forms of cryptography might be constructed in which the public key is a quantum state (described as a classical array of numbers), and Alice’s and Bob’s private key is their (shared) low-complexity quantum concept class that provides a high-fidelity explanation of it.

Then a natural trap-door problem is the (easy) task of verifying that a quantum concept class works, versus the (hard) task of conceiving that concept class.

Just for fun, our QSE Group has carried this program to completion, and yes, it turns out that pretty much any high-fidelity quantum simulation code can be adapted to serve as a means of public key distribution, based on the fundamental “trap-door” problem of computing high-fidelity quantum concept classes.

The public key is the quantum state to be described … the private key(s) is the basis/bases in which that state has a compact distribution … and it turns out that Alice’s and Bob’s bases need not be the same.

Scott’s lecture points to a very natural language—namely, quantum learning theory—for describing key distribution methods based on efficient (classical) algorithms for simulating quantum systems … we have put it on our list of results to write up! 🙂

Comment #35 June 30th, 2008 at 11:55 am

wolfgang:

So, when A.E. discovered the theory of relativity, all he did was update probabilities according to the (surprisingly little) data he had, following Bayes’ rule?Yes.

Comment #36 June 30th, 2008 at 12:31 pm

Scott: D’oh, you’re right.

Comment #37 June 30th, 2008 at 12:33 pm

komponisto,

the page you linked to is really cute.

I liked especially this sentence

“It’s just the prior was over the possible characters of physical law, and observing other physical laws let Einstein update his model of the character of physical law, which he then used to predict a particular law of gravitation.”

Perhaps you can explain to me what it means that “the prior was over the possible characters of physical law”, because to my slow mind that is just meaningless nonsense.

Comment #38 June 30th, 2008 at 3:21 pm

I think the Bayesian approach (e.g. Solomonoff Induction) of assigning smaller priors to more complex hypotheses is more natural than the PAC-learning approach of having an unavoidably arbitrary cut-off point for complexity of concepts.

But in my question I was trying to get at a different issue. The “grue” hypothesis says there are two different distributions that we are drawing our samples from, one before 2050, and another one after 2050. But PAC-learning requires that the samples are drawn from independent and

identicaldistributions, so it seems that we can’t even formulate “grue” as a concept in PAC-learning. Is this correct?Comment #39 June 30th, 2008 at 4:23 pm

Ok, after thinking about it some more, I can formulate the “grue” concept as follows. Let the sample space be the set of emerald observation events (i.e. space-time coordinates where an emerald’s color has been or will be observed). The the “grue” concept maps every event with t = 2050AD to blue. The “green” concept simply maps every event to green.

But PAC-learning still doesn’t work for this example (because the samples are not IID). Either we admit “grue” into the concept class, in which case according to PAC-learning “green” and “grue” are equally valid based on current data. Or we don’t admit “grue” into the concept class, but how can we be completely sure that emeralds are not, and can not be grue? Neither choice seems satisfactory.

Now compare this with Solomonoff Induction, where we assign “grue” a smaller prior than “green” because it has a higher Kolmogorov complexity. That seems a lot more appealing, at least to me.

Comment #40 June 30th, 2008 at 4:26 pm

Oops. Apparently I need to use HTML entity names for less than and greater than. The above should be:

The the “grue” concept maps every event with t < 2050AD to green, and every event with t >= 2050AD to blue.

Comment #41 June 30th, 2008 at 5:49 pm

Wei: Yes, for learning to be possible you also need some assumption to the effect that the future resembles the past, which in the PAC setting we model by saying that the samples are i.i.d. (Note that such an assumption is as necessary in the Bayesian case as in the PAC one.)

Comment #42 June 30th, 2008 at 7:10 pm

But the future does not resemble the past, at least not in the i.i.d. sense, since the universe is not in thermal equilibrium! What if we tried to apply PAC-learning to the question “Will oil prices be less than $200 next year?” or “How fast will a $100 CPU be next month?”

Maybe PAC-learning works if we somehow knew in advance that in the specific case we’re considering, the samples are i.i.d. But how did we learn

that, if we’re not allowed to apply Bayes’ Theorem?Comment #43 June 30th, 2008 at 7:14 pm

Wei, even if we

wereallowed to apply Bayes’ Theorem, how would we know that? We’d have to build it into our prior!To summarize my viewpoint: PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction, but neither one magically gives you more than you put in.

Comment #44 June 30th, 2008 at 8:17 pm

Yes we do need to build it into our prior, but we already know what to build into our prior, and why, so I don’t see how that’s an objection to Bayesianism. I’ve mentioned Solomonoff Induction a couple of times (see arXiv:0709.1516v1 for a recent review). Why do you not consider it an adequate solution to the “prior” problem?

I’m not saying there aren’t any problems with Solomonoff Induction, but I don’t see any problems that PAC-learning handles better.

Comment #45 June 30th, 2008 at 9:17 pm

Wei, for me one of the biggest problems is that the Solomonoff prior is uncomputable. Of course you can fix that by using time-bounded Kolmogorov complexity — but in that case, you’ve already taken a big step toward the PAC-learning framework (where you can consider, e.g., all polynomial-size circuits that might explain your data, thereby getting a problem that’s “merely” exponential as far as anyone knows — or better yet, can consider restricted hypothesis classes for which polynomial-time learning algorithms are available).

Comment #46 July 1st, 2008 at 12:15 pm

I see the same problem addressed from a different angle in today’s revision:

arXiv:0708.0654 (replaced)

Title: Structure or Noise?

Authors: Susanne Still, James P. Crutchfield

Comments: 6 pages, 2 figures;

Subjects: Data Analysis, Statistics and Probability (physics.data-an); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Learning (cs.LG); Mathematical Physics (math-ph); Statistics (math.ST); Chaotic Dynamics (nlin.CD)

(Submitted on 5 Aug 2007 (v1), last revised 29 Jun 2008 (this version, v2))

Abstract: We show how rate-distortion theory provides a mechanism for automated theory building by naturally distinguishing between regularity and randomness. We start from the simple principle that model variables should, as much as possible, render the future and past conditionally independent. From this, we construct an objective function for model making whose extrema embody the trade-off between a model’s structural complexity and its predictive power. The solutions correspond to a hierarchy of models that, at each level of complexity, achieve optimal predictive power at minimal cost. In the limit of maximal prediction the resulting optimal model identifies a process’s intrinsic organization by extracting the underlying causal states. In this limit, the model’s complexity is given by the statistical complexity, which is known to be minimal for achieving maximum prediction. Examples show how theory building can profit from analyzing a process’s causal compressibility, which is reflected in the optimal models’ rate-distortion curve–the process’s characteristic for optimally balancing structure and noise at different levels of representation.

Comment #47 July 1st, 2008 at 2:26 pm

Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.

For example, in the basic PAC-learning theorem, we just pick

anyhypothesis that’s consistent with the past and use that to predict the future, and Valiant proved that this works assuming samples are i.i.d. Well, in the Bayesian setting we’d use the set of hypotheses that are consistent with the past, weighted by their prior probabilities, to predict the future. Clearly this can’t beworsethan just picking any hypothesis, no matter what the prior is, right?As for Solomonoff induction being uncomputable, we should expect learning to be hard in general, otherwise why would we have such big brains and still do so badly at it? I mean, we don’t similarly use the undecidability of PA as an argument against it being a good formalism for number theory.

Comment #48 July 1st, 2008 at 3:10 pm

Wei, I think the difference is that, despite the undecidability of PA, mathematicians are able to prove or disprove a large proportion of the statements they actually care about. By contrast, the uncomputability of the Solomonoff prior would seem to rear its head in pretty much

everycase where one would actually want to use it!I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses.I would say it differently from you: in some special circumstances, we can do learning even if we’re not comfortable picking

anyprior, or with the entire notion of a prior.You raise some interesting issues; thanks! But now that the basic difference has been clarified, there might be little to do except agree to disagree (Aumann notwithstanding). 🙂

Comment #49 July 2nd, 2008 at 1:08 am

wolfgang:

Perhaps you can explain to me what it means that “the prior was over the possible characters of physical law”, because to my slow mind that is just meaningless nonsenseLook a couple of paragraphs before that:

“Rather than observe the planets, and infer what laws might cover their gravitation, Einstein was observing the other laws of physics, and inferring what new law might follow the same pattern. Einstein wasn’t finding an equation that covered the motion of gravitational bodies. Einstein was finding a character-of-physical-law that covered previously observed equations, and that he could crank to predict the next equation that would be observed”

Comment #50 July 2nd, 2008 at 7:48 am

komponisto,

I am able to read – that is not the problem. My problem is that I cannot make any sense of this. (An additional problem is that I cannot match this description with other descriptions of what A.E. was actually doing – see e.g. the biography of Abraham Pais.)

Does the author suggest some sort of probability distribution over the “space” of all possible laws of physics?

How would one construct such a “space” without the benefit of hindsight?

Perhaps I could make progress by parsing the sentence you quote?

“Einstein was finding a character-of-physical-law”

What is a ‘character-of-physical-law’ ?

“that covered previously observed equations”

How does one “observe equations?

“that he could crank to predict the next equation”

Does the author suggest that careful observation of non-relativistic equations (which was the ‘character of physics’ prior to A.E.) allows us to predict relativistic physics?

I think that “crank” is perhaps the key word in this text.

I am sorry to be frank…

Comment #51 July 2nd, 2008 at 12:53 pm

The “’space’ of all possible laws of physics” is indeed an interesting concept, and one that considerably pre-dates the anthropic argument.

I have written about the topology of that space. For example: if one defines a distance between two points in that space, i.e. between two laws of physics, is that a semi-metric space, or does it also obey a triangle inquality and define a metric space?

Is there always a law of physics in between any two laws of physics?

See also: “ideocosm” by the great astronomer Fritz Zwicky.

Is the evolution of our universe a trajectory through that space, i.e. with parameters such as C slowly changing? The theory of inflation seems to require this. And by what basis (Bayes?) do we believe that the universe is not going to suddenly phase change, as adiabatic colling “froze out” forces one at a time, so as to suddenly symmetry break and – presto! — a fifth force (or sixth modulo some theories of “quintessence” or dark energy) is part of physical law.

This is by no means “crank” or crackpot science, but a reasonable attempt to understand the meta-laws by which physical laws are inter-related or established in the first place.

I would hope that quantum computers could search both Zwicky’s ideocosm and the aforementioned “space of all possible laws of physics” much faster than classical computers can. To that extent, the topology of the space is of interest in estimating the speed of such search.

Comment #52 July 2nd, 2008 at 3:44 pm

An interesting question is how do people manage to learn at all, given that learning seems so hard (i.e. uncomputable) in general. Part of the answer has to be some kind of anthropic selection effect, which places us in an environment where learning isn’t too hard, so that for example when we disregard the hypotheses that take exponential time or more to compute, things still work out well enough. Solomonoff Induction doesn’t take this into account.

I think another part of the answer is that we have some way of doing mathematics in a “probabilistic” or “intuitive” way that I haven’t seen captured in any formalism yet. For example, while reading some physics papers recently, I was struck by how many times the authors say things like “the math is too technically difficult, but we conjecture/speculate that …” Where do these conjectures/speculations and the degrees of uncertainty we attach to them come from?

The PAC-learning model helps a bit to answer the question of “how can we learn?” but not as much as I had hoped. Sorry if I’m belaboring this point, but the universe just isn’t naturally composed of i.i.d. samples. In order to apply PAC-learning we have to either somehow intelligently select a subset of inputs that are i.i.d., or actively cause our samples to be i.i.d. Both of these are still hard problems in general, as far as I can tell.

As for Aumann, I don’t think we have any real disagreement here. It’s probably just that we are not interested in exactly the same questions/issues, and our respective ways of thinking about PAC-learning are a reflection of that.

Comment #53 July 2nd, 2008 at 3:56 pm

Does the author suggest some sort of probability distribution over the “space” of all possible laws of physics?How would one construct such a “space” without the benefit of hindsight

See here and here (and pages linked therein).

“Einstein was finding a character-of-physical-law”What is a ‘character-of-physical-law’ ?

A physical law might be simple, it might be complex, it might have a Kolmogorov complexity of 102, it might be in three variables, it might involve second-order differential equations, or Riemannian manifolds, or Hilbert spaces…all these are “character(istic)s” that a physical law may or may not have.

“that covered previously observed equations”How does one “observe equations?

By opening a history book or a physics text, which is how Einstein would have observed such equations as F=Gm_1m_2/r^2, K= (1/2)mv^2, V=IR,…

Does the author suggest that careful observation of non-relativistic equations (which was the ‘character of physics’ prior to A.E.) allows us to predict relativistic physics?Yes!

And this is exactly what A.E. did: remember he apparently wasn’t even aware of the Michelson-Morley experiment (though others such as Lorentz were). As I’m sure you know, there were already problems with the equations as they stood in 1900 that in hindsight we can see as pointing toward relativity. Einstein’s achivement consists in not needing hindsight to see this.I think that “crank” is perhaps the key word in this text.I am sorry to be frank

If there is a problem, it is not your frankness, but rather a premature willingness to dismiss Yudkowsky before (apparently) understanding what he’s talking about.

Comment #54 July 2nd, 2008 at 4:21 pm

komponisto,

this is my last comment on this topic.

> which is how Einstein would have observed such equations as F=Gm_1m_2/r^2, K= (1/2)mv^2, V=IR,…

And how does one get from this to that (= relativistic physics)?

Comment #55 July 2nd, 2008 at 4:58 pm

this is my last comment on this topicAnd how does one get from this to that (= relativistic physics)?Do you expect the answer to be short and simple? If it were, relativity would have been invented a lot sooner. There’s a reason Einstein is considered a genius.

If you want to end the discussion here, then I suppose the best I can do (besides referring you to the totality of Yudkowsky’s posts on Overcoming Bias), is quote a paragraph from here:

“In the extreme case, a Bayesian superintelligence could use enormously less sensory information than a human scientist to come to correct conclusions. First time you ever see an apple fall down, you observe the position goes as the square of time, invent calculus, generalize Newton’s Laws… and see that Newton’s Laws involve action at a distance, look for alternative explanations with increased locality, invent relativistic covariance around a hypothetical speed limit, and consider that General Relativity might be worth testing. Humans do not process evidence efficiently – our minds are so noisy that it requires orders of magnitude more extra evidence to set us back on track after we derail. Our collective, academia, is even slower.”

Comment #56 July 2nd, 2008 at 6:23 pm

As a science fiction author/editor with some Math and Computer degrees, I’m more than dubious about a “Bayesian superintelligence” with essentially infinite computational capability, or at least unphysical in our specific (finite) cosmos.

This is only one hair less absurd than Vernor Vinge’s (intentionally provocative) statement at conventions that he’s playing with fiction about extraterrestrials for whom ALL questions about integers are immediately and intuitively obvious to answer. “Diophantine equations,” I commented, “Godel numbers, Hilbert’s tenth problem, Yuri Matiyasevich?”

Professor Vinge smiled knowingly.

In that context, wolfgang is right, and komponisto needs to study some well-known classical complexity and undecidability results.

Scott is also correct, on a linked to blog, that we know quite well how Einstein progressed, and one need not ideologically rewrite (and falsify) history. See also the Einstein Papers Project, whose editors I speak with now and then at Caltech.

Comment #57 July 2nd, 2008 at 7:03 pm

Wei Dai, since you are so interested in Bayesianism and quantum mechanics, my latest paper might interest you. I would certainly appreciate hearing your thoughts about it. Anyone else reading this—your comments about my paper would also be most welcomed. Send them to: tucciATar-tiste.com

Comment #58 July 2nd, 2008 at 7:09 pm

PS: the hyphen between the r and t is part of my email address

Comment #59 July 3rd, 2008 at 12:25 am

In that context, wolfgang is right, and komponisto needs to study some well-known classical complexity and undecidability results1. The ideas to which I was referring belong to Eliezer Yudkowsky, not me, and are (continually) expounded at length on

Overcoming Bias. So I hope that, before deciding that I need to “study some well-known classical complexity and undecidability results”, you made sure to study the entire train of thought represented by the OB posts I linked to and are prepared to offer a detailed technical critique.2. Exactly what is wolfgang “right” about? All he did was indicate a lack of comprehension, a situation which I (perhaps vainly) attempted to ameliorate.

3. It is mighty presumptuous of you to declare that I need to “study some well-known classical complexity and undecidability results”, while you proceed to make the ill-justified (and, frankly, implausible) assumption that the human level of intelligence is near the physical upper bound. There is no reason to assume that even a “Bayesian superintelligence” as described above would be unphysical, let alone anything like “infinite” in its computational capacity.

Comment #60 July 3rd, 2008 at 9:29 am

Since this thread is asymptoting to a rate of zero new posts per day, perhaps now is a good time to say that Lecture 15 (on learnability) was wonderful, and that Lecture 14 (on quantum noise) was wonderful, and to express the opinion that 14∪15 is more wonderful still, in the following sense.

Scott’s “Learnability Lecture” prompts us to ask “Why are learnable systems ubiquitous in the physical world”? For example, why does my digital camera store compressed jpeg images with such high fidelity … this would be impossible if most objects in my visual field were not learnable!

The empirical ubiquity of classical learnability has no good classical explanation (as far as I know). But it has a *very* good quantum explanation, that is covered (somewhat implicitly) in Scott’s “Quantum Noise Lecture.”

Namely, noise evolves quantum systems to be learnable … and this is why quantum computations cannot tolerate excessive noise.

What can we do with this insight? Well, we can prove theorems with it … but theorem-proving is not my strong point … I will leave this to the QIT folks!

But we can also (less rigorously yet more practically) create design rules with it. For example, a design rule that if a quantum system is hard to simulate, then by introducing more noise, you can render the system

easierto simulate.In our QSE Group’s experience, this design rule works exceedingly well … the effect is rather like the artificial viscosity that von Neumann invented to make the Navier-Stokes equations easier to simulate … the theorems of Scott’s lecture help us understand that both artificial viscosity and quantum noise serve to make systems more “learnable”, and hence more easily simulatable—which is a very useful principle to know!

The broader point is that Scott’s lectures constitute (effectively) an outreach program in which the theorem-provers and the design-rule inventors discuss these issues. My own opinion is that there are many more elegant theorems *and* useful quantum system design rules to be discovered at this interface.

This outreach may or may not have been the intended consequence of Scott’s lecture, but it is a welcome consequence none-the-less … and in any case, how many of us can say that our actions always have the intended consequences?

The above is

mainlyan excuse to say “thank you” to Scott for two enjoyable lectures … and I also want to thank Gil Kalai for his thoughtful posts … and I thank everyone else for their comments too.Comment #61 July 3rd, 2008 at 6:08 pm

In the rectangle example, what about 4 points in a straight line where one of the interior points isn’t included in the rectangle? In that case, there isn’t any rectangle that includes three points and not one of the interior points.

Comment #62 July 3rd, 2008 at 7:31 pm

Travis, VCdim( C) is defined as the size of the largest subset shattered by C, so it doesn’t matter that there is one subset of 4 points (all in a straight line) that C doesn’t shatter, as long as there is another subset of 4 points (like in the diagram) that C does shatter.

Comment #63 July 4th, 2008 at 12:53 am

Wei,

In the problem posed, the rectangle is not supposed to be rotated?

Comment #64 July 6th, 2008 at 8:13 am

No, the rectangle is not supposed to be rotated. (Actually I’m not sure what you are asking exactly, but I don’t think “yes” is the correct answer to any question you likely have in mind.)

Comment #65 July 6th, 2008 at 8:15 am

Scott, I have another question of my own. I’ve been trying to see if Bayes’ Theorem can be applied to mathematical conjectures, but one obvious problem is what prior to use on them? Since no prior is necessary with PAC-learning, could it work instead?

Take Goldbach’s Conjecture (“Every even integer greater than 2 can be written as the sum of two primes.”) as an example. Superficially, Goldbach’s Conjecture looks a lot like the hypothesis that all ravens are black. And intuitively it seems like we should be able to gain some confidence in this conjecture if we test many even integers and find that they can all indeed be written as the sum of two primes. But how to formalize this intuition?

Goldbach’s Conjecture can certainly be formulated as a concept in the PAC-learning model (i.e., as a function that maps every even integer to true). And we can also sample the even integers with an i.i.d. and test whether each sample can be written as the sum of two primes or not. But what is the right concept class? If we include all Boolean functions over the even integers, the VC dimension is infinite, and PAC-learning fails. What else can we do? Would it be justifiable to place some kind of complexity limit on the concept class in this case, and can we “learn” Goldbach’s Conjecture if we do that?