FWIW — and I realize this doesn’t seem to relate to your particular claim, but I think it’s worth noting — it does appear that what Van Horn sets out in that paper is also not logical probability in the sense I discussed above.

]]>Deduction is a tool that works at the level of theory. It takes true statements and makes other true statements. By contrast, probability theory is a tool that works at the model level. We imagine a space of possible worlds, the worlds being mathematical objects, and we put a proability distribution on that space.

I used to think the same about probability theory, until I read Kevin van Horn’s From Propositional Logic to Plausible Reasoning: A Uniqueness Theorem. He puts the (conditional) probability at the level of the logical consequence relation A |= B. For me, as someone who explicitly takes non-classical logic into account, this is a huge difference. I used to think that you must restrict yourself to classical logic, if you want to use probability theory. However, A |= B is just true or false, even in non-classical logic. There is a slight difference that A, B, C |= D is equivalent to A∧B∧C |= D in classical logic, but may be false in some non-classical logics. But A, B, C |= D can still be interpreted as the conditional probability P(D | A, B, C).

]]>And now Adam has posted yet a third segment, in which I talk about small, lighthearted things like existential threats to civilization and the prospects for superintelligent AI.

Starting at 16:40, Scott says something like “… we don’t have anything in our past experience that is analogous to it … We are asking about a logical possibility that is sort of unlike anything that’s ever been seen. The closest analogy that I can think of is sometimes people ask things like what if these famous math problems like the Riemann hypothesis or P vs NP, what if they are independent of the axioms of set theory and you could neither prove nor disprove them… it is a logical possibility … until we have even one genuine example of it … we can’t really do statistics with a sample of size zero”

I would like to take that discussion as an opportunity to explain some of my recent thoughts on circularity which I find interesting (and which are more or less related to Sebastian Oberhoff’s rollicking essay). But before that …

There is a huge difference between a math problem “being independent of the axioms of set theory” and a math problem that “you could neither prove nor disprove”. I agree with Scott that we don’t have examples of math problems even remotely comparable to the Riemann hypothesis or P vs NP that “you could neither prove nor disprove”. On the other hand, I don’t agree with the claim that there would be no nice, good, and natural examples of Gödel-undecidable math problems. Especially since Gödel explicitly mentions the formal system of Principia Mathematica, which is slightly weaker than ZFC, already Martin’s theorem (see my comment #29 above) should do the trick.

There are quite some famous math problems that have been proved independent of some reasonable natural axiom system in the past, prominent examples include the parallel postulate and the continuum hypothesis. (If I were the judge, I would say they have both been disproven in a suitable sense. I am less sure how to judge the axiom of choice in this context.)

Can we learn how to answer the “what if …” question from these examples? It seems like we somehow managed to “construct sufficiently nice and natural models” where those hypothesis failed. But how could we ever manage to construct natural models of the natural numbers different from the one true model, in a way that those models would give interesting insights into P vs NP? (My recent thoughts on circularity below give some intuition for how consistent (but unsound/dishonest) axiom systems give models of the natural numbers.)

Gentzen’s consistency proof of PA suggests a different way to answer the “what if …” question: It uses the additional assumption that epsilon_0 is well founded. An analogy for P vs NP would be to prove P != NP using additional derandomization assumptions. The challenge here is to address the objection that “you explicitly put in (the additional assumptions) what you wanted to prove in the first place”, and that hence your proof is circular and worthless.

OK, here are my thoughts on circularity:

Of course Hume is right that justifying induction by its success in the past is circular. Of course Copenhagen is right that describing measurements in terms of unitary quantum mechanics is circular. Of course Poincaré is right that defining the natural numbers as finite strings of digits is circular. (Please forgive me that I simplified those subtle philosophical positions to such objectionable short statements. My comment is already too long, and subtle philosophical elaborations at this point would reduce the number of readers to basically zero.)

But this circularity is somehow trivial, it doesn’t really count. It does make sense to use induction, describing measurement in terms of unitary quantum mechanics does clarify things, and the natural numbers are really well defined. But why? Has anybody ever written a clear refutation explaining how to overcome those trivial circularities? Peter Smith’s essay Induction and predicativity at least addresses this question with respect to the natural numbers. But I still remember how Nik Weaver got angry when I linked to that essay, basically disagreeing with the starting point that this trivial circularity even exists. My most intuitive explanation of this circularity so far at least no longer caused angry replies, but perhaps this is because it stays so abstract (like a model of set theory where the continuum hypothesis is refuted) that it doesn’t hurt anyone. Nothing is said about where those infinite sequences of digits come from.

So I wondered how to make this more concrete, in a similar spirit like one shows that equality testing for computational real numbers is undecidable. The idea here is to take some Turing machine for which we would like to decide whether it halts or not, and write out the real number 0.000…000… with an additional zero for every step of the machine, and output one final 1 when (and only if) the machine finally halts. Deciding whether this real number is equal to zero is equivalent to solving the halting problem for the given Turing machine. How can one do something similarly concrete for natural numbers? Suppose that you have a consistent (but maybe unsound/dishonest) axiom system that claim that some given Turing machine would halt. Then this Turing machine defines a natural number in any model of that axiom system. To expose the trivial circularity, we can now use this natural number to define the length of a finite string of digits. How do we get the digits themselves? We can just use any Turing machine which outputs digits for that, independent of whether it halts or not. Well, maybe if it doesn’t halt, it would still be interesting whether it will output a finite or an infinite number of digits, but in the worst case we can ask the axiom system again about its opinion on this matter. The requirement here is that the natural numbers defined in this way should define a natural number in any model of that axiom system.

]]>No, I still think you’re mixing things together that don’t belong together. Different levels, really. Certainly deduction works just as well with contingent premises, but I don’t think one can infer the rest of what you say from that. The relation between necessary domains and contingent ones is asymmetric.

I’m going to abuse some logic terminology here — sorry for the abuse of terminology but I’m having trouble thinking of a better way to put this — and talk about the level of “theory” and the level of “model”. By “model” I mean, like, a description of the world as a mathematical object. By “theory” I mean the statements we make about the world, about this object. Like I said, I’m abusing logic terminology here, I don’t mean it in quite the same way, but like in logic, a “model” is a mathematical object and a “theory” is statements about that object.

Deduction is a tool that works at the level of theory. It takes true statements and makes other true statements. By contrast, probability theory is a tool that works at the model level. We imagine a space of possible worlds, the worlds being mathematical objects, and we put a proability distribution on that space. You can assign probabilities to statements, yes, but this happens by replacing the statement by its extension.

Which is to say, probability theory assumes you are logically omniscient. As you already know, attempting to apply probability theory to mathematics results in triviality — any given statement has probability either 1 (it is true), or 0 (it is false). There are not multiple possible models; your space of possible worlds is a single point.

Yes, of course in actual logic there can be multiple models of ZFC, etc. Like I said, I’m not talking about that, I’m abusing the terminology a bit. I’m assuming a sort of mathematical platonism where the model comes first and the theory merely describes it. So, if you know the model, you also know the theory.

And so probability theory, as well as all our other tools based on it for dealing with contingent matters, being model-based, trivialize questions of theory. Sure, it’s true that deduction works for contingent matters, but that’s a theory-level concern; it just lets you navigate around a theory, gets you between different statements within it. And probability theory works on the level where we’re not dealing with those, where we already automatically know all true statements about any given model.

But if there’s one thing in your comment that makes me stop and say “Hold on, what?” it’s this:

Similar thing can be said about probability theory when we use something like Carnap’s logical probabilities as opposed to contingent probabilities like relative frequencies or propensities. This makes absolutely no difference for probability theory.

Carnap’s logical probabilities — as best I can tell from my quick reading, because I actually hadn’t encountered these before, so thank you for bringing them to my attention — are no good for these purposes. They’re “admissible”, they obey the laws of ordinary probability theory. Which means automatically that they’ll assign any necessarily true statement a probability of 1 and any necessarily false statement a probability of 0. Which is to say, they assume logical omniscience, once again, the very thing we’re trying to avoid here. (Carnap’s logical probabilities also seem to require adding various symbols one wouldn’t normally have, but I’ll ignore that for now.)

My understanding was that the entire idea of “logical probability” — in the sense of, logical probability that avoids logical omniscience, that allows us to assign probabilities other than 0 or 1 to mathematical statements that are nonetheless provable or disprovable — and how to formalize it was basically a gigantic open problem. OK, maybe less gigantic now thanks to the people at MIRI. But it’s pretty obvious why the problem is hard — because, as already mentioned, if your logical probabilities satisfy the laws of ordinary probability, you once again end up with decidable mathematical statements getting a probability in {0,1}, i.e., logical omniscience.

So, as I said, the relation between these tools — deduction and probability theory — is an asymmetric one. Deduction works at the theory level. It therefore also automatically applies at the model level, which incorporates the theory level. Probability theory works at the model level, which incorporates the theory level; but probability theory is also a tool that *trivializes* the theory level, so if you attempt to port it back to the theory level without adjustment you get something trivial. Logical probability is, if it is to be nontrivial, *not* probability; it is a new tool that resembles probability. There’s a real asymmetry here, and the automatic porting in one direction doesn’t mean that you can automatically (nontrivially) port things in the other direction.

Additionally, nobody would argue that deduction works totally differently when you use necessary premises, like “all integers have a successor”, as opposed to when you use contingent premises like “all men are mortal”. It’s completely irrelevant for deduction whether the premises or conclusions are necessary or contingent. Similar thing can be said about probability theory when we use something like Carnap’s logical probabilities as opposed to contingent probabilities like relative frequencies or propensities. This makes absolutely no difference for probability theory. And both deductive logic work and probability theory work necessarily, because otherwise they wouldn’t work with necessary subjects such as mathematics. If it isn’t justified to assume two totally different kind of deduction or probability theory because they seem so similar, then it is also not justified to assume the same thing for induction or Ockham’s razor.

]]>By the same argument, this explanation should not require any contingent “inductive assumptions” (like effectiveness of the world), since as I said, induction (and by extension, Ockham’s Razor) apparently works necessarily because it apparently works in mathematics.

I don’t think this is correct; I think this is a bad generalization. Occam’s Razor seems to work in mathematics. It also seems to work in matters involving the external world. The former of these must hold necessarily. That doesn’t mean the latter must also. I don’t see any reason to assume that one explanation must cover both (well, other than… I don’t need to finish this joke 😛 ). And indeed an explanation based on, say, Solomonoff Induction necessarily only works towards explaining the latter, since Solomonoff Induction assumes you’re already logically omniscient (I’m not familiar with these other arguments but I wouldn’t be surprised if they did too, formalizing partial states of logical rather than empirical knowledge is not so easy).

]]>The justification you name is merely pragmatic. First of all, there already exists a much more direct pragmatic justification of Ockham’s Razor (optimal worst-case performance, see Kelly 2008 in the references) which doesn’t require Solomonoff prediction at all. Also the familiar argument by Reichenbach 1935 or it’s “meta-induction” refinement by Schurz 2008.

Second, only an epistemic justification can be converted into an explanation, since any epistemic justification E can be used to explain a hypothesis H if H is true. Because any epistemic justification E for H must entail H logically or probabilistically. This is not the case for a pragmatic justification, which just entails the usefulness of H. But an explanation is exactly what should exist by the argument in #45, and therefore also an epistemic justification. By the same argument, this explanation should not require any contingent “inductive assumptions” (like effectiveness of the world), since as I said, induction (and by extension, Ockham’s Razor) apparently works necessarily because it apparently works in mathematics.

]]>The optimality argument has been refuted, see http://philsci-archive.pitt.edu/12429/1/soloccam.pdf

Thanks for the link! The paper actually does not agree with your conclusion, as per footnote 5 in the paper:

This phrasing suggests a weaker property than reliability (i.e., convergence to the truth), namely optimality (convergence to predictions that are at least as good as those of any other prediction method). However, the proof that is referred to is about reliability.

So the paper covers reliability not optimality. As I initially said, any prior would converge with Bayesian inference, or as the paper says, “We can prove that any Bayesian predictor, operating under the inductive assumption of S, is reliable under the assumption that the data is indeed generated by some source S* ∈ S.”

I don’t think the optimality of Solomonoff induction is in dispute, and I think optimality is sufficient justification for Occam’s razor as a general recommendation for prediction: it’s simple, it’s explicable/communicable, it’s effective, and it’s optimal.

However, the paper also concludes that, “If there exists a way to salvage the argument at all, then it would have to consist in demonstrating anew that effectiveness as an inductive assumption does lead to a fundamental simplicity notion”.

That is, the paper does not definitively establish that effectiveness does not require simplicity, which you seem to imply, it concludes only that effectiveness is sufficient for Solomonoff Induction, and that simplicity does not appear necessary unless it is itself required by effectiveness. That remains an open problem.

]]>Wow, thanks for that link! I wasn’t expecting much when I clicked on it but it makes a good argument. The key point isn’t stated clearly up front, though, so let me summarize it for everyone else. (Warning, I may be butchering some of the technical details regarding different notions of computability a little.)

So, the idea is that a universal prior formalizes the principle Occam’s Razor, and a universal prior works well (for some appropriate notion of that), so this shows that Occam’s Razor works, right? The thing is, there’s a theorem that *any* semicomputable prior on the set of semicomputable sequences (and that assigns nonzero probability to every semicomputable hypothesis) can be represented as a universal prior. So this idea that universal priors are this special Occamian subset of these, is not true — if you demand semicomputability (and non-dogmatism), there’s nothing to contrast them with! And the argument “Occam’s Razor does well in a semicomputable world” just becomes “Assuming a semicomputable world (and being semicomputable yourself) does well in a semicomputable world”, which is less interesting.

OTOH, one could just reply “Yes, assuming a semicomputable world does well in a semicomputable world, but that’s equivalent to Occam’s Razor doing well in a semicomputable world, so what’s the problem? It still remains the case that Occam’s Razor does well in a semicomputable world, so unless you think the world might not be semicomputable, it’s still justified; it’s OK if one only implicitly uses Occam’s Razor rather than invoking it explicitly.” But — as mentioned above — this kind of requires acknowledging priors that don’t assume semicomputable (or aren’t semicomputable themselves, or are dogmatic) as a live option to contrast against, so this reply seems to be on pretty shaky ground…

]]>