In Defense of Kolmogorov Complexity

I got lots of useful and interesting feedback on my last post, though I also learned a valuable sociological lesson about the “two kinds of complexity theory”:

If you write about the kind of complexity theory that involves acronyms like NP, BQP/qpoly, and r.s.r., people will think the issues must be difficult and arcane, even if they’re not and can be understood with very little effort.  By contrast, if you write about the kind of complexity theory that can be illustrated using pictures of coffee cups, people will think the issues can be sorted out with 15 seconds of thought, and will happily propose ‘solutions’ that presuppose what needs to be explained, answer a different question, or fail in simple examples.

Seriously, a large number of commenters raised two important questions, which I’d like to address forthwith in this followup post.

The first question is why I omitted the notion of coarse-graining, which plays a central role in many accounts of entropy and complexity. The short answer is that I shouldn’t have omitted it.  In fact, as both Sean Carroll and Luca Trevisan (among others) quickly pointed out, one can tell a perfectly-reasonable story about the coffee cup by defining the “complextropy,” not in terms of sophistication, but in terms of the ordinary Kolmogorov complexity of a coarse-grained or “smeared-out” state.  If you define the complextropy that way, it should increase and then decrease as desired, and furthermore, it’s probably easier to prove that statement than using the sophistication-based definition (though both versions seem highly nontrivial to analyze).

So, the reason I turned to sophistication was basically just the mathematician’s instinct to situate every concept in the most general structure where that concept makes sense.  For example, why define “connectedness” for polygons in the Euclidean plane, if the concept makes sense for arbitrary topological spaces?  Or in our case, why define “complextropy” for dynamical systems that happen to have a spatial structure over which one can coarse-grain, if the concept also makes sense for arbitrary dynamical systems whose evolution is computable by an efficient algorithm?  Of course, [OPEN PROBLEM ALERT] it would be wonderful to know whether the two types of complextropy can be shown to be related for those dynamical systems for which they both make sense, or whether we can construct a convincing example that separates the two.

The second question is why I invoked Kolmogorov complexity in a discussion about thermodynamics: many people seemed to think that, by doing so, I was making some novel or controversial claim.  I wasn’t.  People like Charles Bennett, Seth Lloyd, and Wojciech Zurek have employed Kolmogorov complexity as a useful language for thermodynamics since the 1980s; I was simply following in their footsteps.  Basically, what Kolmogorov complexity lets you do is talk in a well-defined way about the “entropy” or “randomness” of an individual object, without reference to any ensemble from which the object was drawn.  And this is often extremely convenient: notice that Kolmogorov complexity snuck its way in even when we defined complextropy in terms of coarse-graining!

Of course, if our dynamical system is probabilistic, then we always can talk instead about the “actual” entropy; in that case Kolmogorov complexity basically just amounts to a shorthand.  On the other hand, if our system is deterministic, then talking about the (resource-bounded) Kolmogorov complexity seems essential—since in that case there’s no “true” randomness at all, only pseudorandomness.

But a few commenters went further, disparaging Kolmogorov complexity itself rather than just its application to a particular problem.  Here’s Shtetl-Optimized regular Raoul Ohio:

As usual, my DAH (Devil’s Advocate Hat) is on. This is convenient, because it allows you to comment on anything without doing the work to really understanding it. Thus I will proceed to disparage the notion of using Kolmogorov Complexity (KC) for anything but entertainment.

Math is a subject where a couple of interesting definitions and a few theorems can launch a subfield such as KC. I have never studied KC … but a brief reading of the subject suggests that it started as a joke, and today a lot of people are not in on it.

… the KC of things would change as knowledge in other fields progresses. For example, what is the KC of

δ = 4.66920160910299067185320382…, and

α = 2.502907875095892822283902873218… ?

These are Feigenbaum’s constants (http://en.wikipedia.org/wiki/Feigenbaum_constants). A couple of decades ago, no one knew anything about these numbers. With the concept of analyzing discrete dynamical systems by bifurcation diagrams in hand, these can be calculated with a short program. So, did KC(δ) and KC(α) drop dramatically 20 odd years ago?

…using KC reminds me of physics arguments that use the wave function for the universe. Sure, there must be such a thing, but it is hard to say much about it.

On the other side of the coin, the theorems and proofs in basic KC are rather similar to those in many fields of TCS, and many SO [Shtetl-Optimized] readers might not think of these as a joke…

My intuition is that the entire concept of KC is “ill-posed”, to borrow a term from PDE.

In the interest of “full disclosure”, I must mention that often in the past I have thought some topic was a bunch of hooey until I understood it, after which I thought is was profound, just like listening to Lenard [sic] Cohen.

I wrote a reply to Raoul, and then decided that it should go into a top-level post, for the edification of Kolmogorov-skeptics everywhere.  So without further ado:

Hi Raoul!

I think this is indeed one of those cases where if you understood more, you’d see why your dismissal was wrong. And unlike with (say) art, music, or religion, the reasons why your dismissal is wrong can be articulated in words!

Contrary to what you say, K(x) is not undefinable: I’ll define it right now, as the length of the shortest prefix-free program (in some fixed universal programming language) that prints x and then halts! K(x) is uncomputable, but that’s a very different issue, and something that’s been known since the 1960s.

Basically, what K(x) lets you do is give a clear, observer-independent meaning to the loose notion of there “not existing any patterns” in a string. Already from that statement, it’s obvious that K(x) is going to be hard to compute—for as you correctly point out, detecting the existence or nonexistence of patterns is hard!

(Though contrary to what you say, K(Feigenbaum’s constant) didn’t suddenly become small when Feigenbaum defined the constant, any more than 42038542390523059230 suddenly became composite when I wrote it down, probably for the first time in human history. Please don’t tell me that you make no distinction between mathematical truths and our knowledge of them!)

The key point is that, even without being able to compute K(x) for most x’s, you can still use the definition of K(x) to give meaning to hundreds of intuitions that otherwise would’ve remained forever at a handwaving level. For example:

“The overwhelming majority of strings are patternless.”

“If a short computer program outputs a patternless string, then it can only be doing so by generating the string randomly.”

And many, many less obvious statements—every one of which can be upgraded to a theorem once you have a mathematical definition of “patternlessness”!

Furthermore, the idea of Kolmogorov complexity has actually inspired some important experimental work! For example, if you could compute K, then you could compute the “similarity” between two DNA sequences D1 and D2 by comparing K(D1)+K(D2) to K(D1,D2).

Of course you can’t compute K, but you can compute useful upper bounds on it. For example, let G(x) be the number of bits in the gzip compression of the string x. Then comparing G(D1)+G(D2) to G(D1,D2) has turned out to be a very useful way to measure similarity between DNA sequences.

It’s really no different from how, even though we can never say whether a curve in the physical world is continuous or not (since that would require infinitely precise measurements), the mathematical theories dealing with continuity (e.g., calculus, topology) can still be applied in physics in all sorts of ways.

53 Responses to “In Defense of Kolmogorov Complexity”

  1. wolfgang Says:

    >> define “complextropy” for dynamical systems that happen to have a spatial structure over which one can coarse-grain

    I think the coarse-graining explanation should also work for arbitrary random strings. Averaging of n next neighbors will result in 1/2 +- a and if you define numbers between 1/2-a and 1/2+a as ‘brown’ you will in general get a very simple description for a random string.

  2. wolfgang Says:

    i forgot to add that a is approx. 1/sqrt(n) with n large enough

  3. Peter Morgan Says:

    In your definition, what constitutes “prints x”, particularly if the program has to “halt”? Is it enough to print “sqrt(2)”, giving an answer in terms of another relatively high level algorithm, in which case K(x | sqrt) is finite, or does it have to be printed in decimal, “1.414…”, in which case your definition makes almost all answers K(x | decimal representation)=infinity? Isn’t a concept of *relative* Kolmogorov complexity more useful, insofar as it can be engineered to be finite? What makes me think that there’s an acronym for this that makes it possible to talk about it in a compressed form?

  4. Jay Gischer Says:

    Your last paragraph inspires the following thought: K(x) can be seen as a limit to a well-defined, recursive process.

    With an appropriate enumeration of all programs, we can recursively enumerate all the programs that emit x and halt, and hence, we can recursively enumerate the lengths of all such programs. And this enumeration can be refined, recursively, to be monotonically decreasing. It has a lower bound, the values are integers, therfore it must converge.

  5. the paul Says:

    @Peter Morgan:

    Kolmogorov complexity is a quality of strings- sequences of symbols- so it’s not a question of printing any string which could be interpreted by a human to refer to the value X. The string “sqrt(2)” has its own Kolmogorov complexity, and it has little to do with the Kolmogorov complexity of any other arbitrary representation of the value of the square root of 2.

    “Printing x” means producing/outputting the symbols in the string x.

  6. Foster Boondoggle Says:

    Scott – Could you please explain the distinction between coarse-graining and the definition of the set S as the set with the shortest description such that x is effectively a random member of S? Why doesn’t that make S a coarse grained description of x?

  7. Scott Says:

    Peter Morgan #3: Yeah, what Paul said. The program has to print out the actual string x, not some other representation of it.

  8. Scott Says:

    wolfgang #1: OK, but the problem is that we could also get a value close to 0 for strings that are intuitively very “sophisticated” and “complextropic”!

    For example, suppose we take an n-bit string x that’s as sophisticated and non-random as you like, and then we encode x by a 2n-bit string y, in such a way that each ‘0’ in x gets mapped to ’01’ in y, and each ‘1’ in x gets mapped to ’10’ in y:

    x=00110 → y=0101101001

    Then if we “coarse-grain” y, we’re just going to see equal numbers of 0s and 1s, no matter which part of y we look at, and no matter how unbalanced x may have been! In other words, coarse-graining will completely miss the structure that y inherits from x.

    Of course, once we know how y was generated, we can choose to coarse-grain in a more careful way that doesn’t destroy its structure. But why should we have to choose a different definition of coarse-graining for each encoding scheme? Why not insist on one definition that works for all encoding schemes?

    Aha, welcome to my approach based on sophistication… :-)

  9. Scott Says:

    Foster Boondoggle #6: Yes, you could think of my definition as a more abstract form of coarse-graining. The main differences are the following:

    (1) My definition doesn’t presuppose any spatial structure on the string (but on the other hand, it doesn’t get to take advantage of any spatial structure that’s there).

    (2) My definition insists that the string x be a generic member of the set S. (If not for that requirement, we could just always choose S to be the set of all strings, and my definition would become trivial!) With the coarse-graining definition, by contrast, the way to go from x to S is fixed in advance, and there’s no need to require x to be a generic element of S.

  10. Foster Boondoggle Says:

    I think you’re imputing too much specificity to the stat-mech notion of coarse graining. I always took it to mean just the description of a system by a set of macroscopic state variables such that, conditional on their values, “the physics” is independent of the remaining microscopic details. That sounds an awful lot like the relationship between x and S. (If the interesting aspects of the system’s evolution aren’t independent of the unspecified dof’s, something got left out of the coarse graining – e.g., the air pressure in the room with the cup turned out to be important. In your terms, S wasn’t such that x was a generic member.)

    I don’t get what you mean by “With the coarse-grained definition the way to go from x to S is fixed in advance”. It’s going to be different for any particular problem – it’s not the same for a cup of unmixed coffee and cream as it is for a solar coronal plasma ejected into space.

  11. Scott Says:

    Foster #10: Let’s focus on the following question. Given a bitmap image, how do you actually compute its complextropy?

    According to the coarse-grained approach, you first smear out the bitmap, and then you compute (or rather, approximate as well as you can) the Kolmogorov complexity of the result. Here the definition of “smearing out” is not something that was mathematically supplied to us: rather, we had to make it up ourselves, given our knowledge of what are the relevant variables in this particular system. On the other hand, once we’ve made up the definition, we then apply it in a uniform way to all bitmaps. So, given a bitmap x with smeared version f(x), let S={x':f(x’)=f(x)}. Then it’s entirely possible that x is far from a random or generic element of S, but rather has all sorts of additional non-random structure, which we “forgot” when we summarized x by saying that it was a member of the set S.

    According to the sophistication approach, by contrast, you completely ignore the bitmap structure, and just look at the string x. You don’t have recourse to any “smearing function” that someone with knowledge of this particular dynamical system made up. Instead, you minimize the resource-bounded Kolmogorov complexity KRB(S) over all sets S such that x is an algorithmically-random element of S. Thus, S=S(x) might be a large set for some values of x, and a small set for other values. It need not be the case that S(y)=S(x) for all y∈S(x). S might coarse-grain over a completely different set of variables than the ones that a human, looking at the system, would consider “natural” ones to coarse-grain over. And, unlike with the previous definition, computing the complextropy now requires not one but two nested Kolmogorov complexity computations: one for K(S) and one for K(x|S).

    I hope that makes the differences clearer.

  12. Peter Morgan Says:

    So, given that we require a program to halt (“prints x and then halts”), I take you to be saying that “x” is a *finite* string (presumably from a finite alphabet). The better definition is appreciated, assuming that’s right.

    I was thrown, I suppose, because you talk about the KC of Feigenbaum’s constants, following Raoul’s introduction of that example, without qualification to finite substrings. A program that chooses a particular finite number of decimal digits from an infinite string of decimal digits (such as the 1st, 9th, and 25th digits of a decimal representation of sqrt(2), say) is a particular coarse-graining specification, right? Looks like there are a lot of coarse-grainings of any given infinite string.

  13. wolfgang Says:

    >> In other words, coarse-graining will completely miss the structure that y inherits from x.

    ok, but before I give up I would try to look at n-point correlations and suggest that low complexotropy is low correlations + uniformity after coarse-graining

  14. Scott Says:

    Peter: Normally, one talks about the Kolmogorov complexity of finite strings, but the definition makes sense as well for an infinite string x, with the one proviso that we might need to say K(x) is infinite (if there’s no finite Turing machine to output x). Or one can talk about the Kolmogorov complexity of the first n bits of x, as a function of n.

    I don’t know what it would mean to “coarse-grain” a countably-infinite string. In any case, I was only talking about finite strings in my posts.

  15. Gil Kalai Says:

    It still looks to me that using Kolmogorov complexity is un-natural (and should be controversial even if its not), in the sense that it is much too complex and even intractable notion to describe the phenomena you want to describe. A notion (which, in a sense, allow to talk about “randomness” of deterministic systems) which might be relevant and more sutable to express this idea is topological entropy. In general, statistical notions may suffice to describe the notion of “complexity” that is considered here, and may be more appropriate than complexity-theoretic notions.

  16. Peter Morgan Says:

    KC without halting makes sense, of course, for an infinite string, as it would for a finite length program that generates a decimal representation of sqrt(2). I suppose you’re saying that KC without halting is still KC.

    “one can talk about the Kolmogorov complexity of the first n bits of x, as a function of n”, or one could talk about the Kolmogorov complexity of any chosen sublist or increasing sequence of sublists. First n is very special, albeit it may be natural for some strings, depending on the meaning of the string (“meaning” being any additional information about the string, one example of meaning being “early items in a string are more significant than later items in the string”, but for an array of data we might say something like “every tenth item in the string is more significant than other items”).

    Doesn’t coarse-graining choose some substring of a total string (or perhaps a string of functions of substrings)? Given the unmanageable total number of possibilities, one looks for any meaning that suggests a natural choice for generating a new, shorter string from a given string.

  17. Scott Says:

    Gil #15: Just to reiterate a few points made earlier,

    (1) I believe what we really want here is resource-bounded Kolmogorov complexity—something that’s “merely” exponentially hard to compute, not uncomputable!

    (2) If the dynamics are deterministic, then standard statistical notions such as entropy clearly aren’t going to work. Can you suggest an alternative to resource-bounded Kolmogorov complexity that works for deterministic systems?

    (3) Even for probabilistic systems, it might be more experimentally convenient to take an individual bitmap and feed it into gzip (thereby crudely upper-bounding its Kolmogorov complexity), than to try explicitly to estimate the entropy of a probability distribution over exponentially many bitmaps.

  18. Luke Says:

    I really like this idea of sophistication being a “generalization” of coarse-graining. Worth spending some time ruminating on…

  19. IThinkImClever Says:

    I am suspecting that an easy way to clear up alot of the confusion here regarding Kolmogorov Complexity (and now that ‘compression’ came into the picture), is to take note of the fact that:

    Incompressible bit-strings of *every length* exist.

    (And thus they must have relatively higher KC, to the ‘more compressible’ strings of the same length)

    So while the bit-string X = ‘01010101010101010101010101010101’ can be compressed by the program PX = 16 * ’01’, and thus have low KC, a more random bit-string of the same length, like say, Y = ‘01111001000100110011101101110111’, is *most efficiently* (wrt space complexity) reproduced by a program that explicitly prints it out *as is*; as constant, static data, instead of a much longer program, such as: PY = ‘0’ + (4 * ‘1’) + … + (3 * ‘1’).

    (Of course, this is more intuitive for larger bit-strings)

    The trivial proof of the statement has to do with the fact that the number of bit-strings of length n is greater than the number of descriptions of length less than n, so that *at least one* bit-string of length n is indeed incompressible, and has provably high KC, according to the definition.

    i.e. There are 2^n bit-strings of length n, but only 2^n – 1 possible ‘short descriptions’.

    I hope that helps someone. :)

  20. Gil Kalai Says:

    Dear Scott,
    Precisely! notions of compressability originated in information theory/statistics (like gzip) appear much more appropriate to express your idea than Kolmogorov complexity. Shana Tova to you and your family, –Gil

  21. Bruno Loff Says:

    On the subject, check out the paper “Similarity distance and phylogeny” including as an application the automatic construction of phylogeny trees (that group humans with chimpanzees, mice with rats, cows with whales and so on).

    link:

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.5884

  22. MattF Says:

    Uh… wait a minute. When I first read the post about this, it sorta made sense in a vague “Oh, that makes sense” kind of way, but now I’m not so sure. How are the initial stages of entropic mixing different from the final stages? At times near zero, a system’s state will start to leak out into new regions of phase space, but it does so in a maximally stupid way. I’m not seeing where structure can sneak into this.

  23. Scott Says:

    MattF #22: See The Blind Watchmaker by Richard Dawkins ;-)

  24. Jochen Says:

    I’ve been a huge fan of this recent series of post, starting with the work on computational complexity and philosophy; unfortunately, I don’t really have anything deep to contribute…

    But perhaps, if you’re not aware of it already, you might like to take a gander at Baez/Stay’s work on algorithmic thermodynamics as an addition to your list of works employing algorithmic complexity in a thermodynamic context.

  25. Daniel Franke Says:

    Scott,

    Although your use of KC appears to be sound, the sort of confusion that Raoul is railing against is quite real. Before KC can be meaningful, one must, as you observe, select “some fixed universal programming language”. All Turing-complete languages are not, for this purpose, created equal. The shortest Common Lisp program that outputs Feigenbaum’s constants is non-trivial, but it becomes much shorter if your fixed universal language is a Common Lisp extension that has a (print-feigenbaum-constants) in its standard library. There is no uniquely “natural” choice of universal language, and one person’s notion of a “good” choice will be wildly different from another’s depending on their respective programming backgrounds. This observations tends to get lost when Less Wrongers et. al. try to turn Solomonoff induction into a theory of epistemology.

  26. Scott Says:

    Daniel Franke #25: OK, the dependence on the choice of universal TM is a legitimate criticism of Kolmogorov complexity. But Raoul was not making that criticism, but several other criticisms that (as I said in the post) I think can be pretty straightforwardly answered.

  27. Daniel Franke Says:

    Scott,

    I predict that if Raoul chimes in on this thread, he’ll agree that my point encompasses what he was getting at. Then again, the error I’m describing has been a pet peeve of mine for years (due to seeing it so much on LW), so maybe I’m reading too much into others’ critiques.

  28. wolfgang Says:

    >> a legitimate criticism of Kolmogorov complexit

    another criticism would be that all short bitstrings
    are ‘random’ according to KC, e.g. 1111
    because the simplest programs are already too long.

  29. IThinkImClever Says:

    While I’d consider encoding actual data constants in the model as somehow ‘cheating’ yourself, I do still think that you would run into the exact same problems when choosing which primitive/total/partial functions you would ultimately encode into (or omit from) your underlying model.

    Another criticism of KC I’d have is: what alphabet of symbols would you choose to encode data, and what effect would that ultimately have on the KC?

    For example, powers of 3 in base 2 representation tend to look like random bit-strings, but of course, they have relatively low KC, since you just simply multiply by 3 a number of times to reproduce the data.

    But if you change the encoding alphabet to base 3, wouldn’t larger values of 3^n be most efficiently represented by trit-shifting n times to the left, thus ‘reducing’ its KC from the base 2 representation?

    I’m not really sure how major an issue that is, though.

  30. Yatima Says:

    Good stuff. I demand more! The last time I thought about KC was when I tried to explain to a sneering electronics engineer that the concept actually makes sense, back in the 90’s.

    >>But if you change the encoding alphabet to base 3, wouldn’t larger values of 3^n be most efficiently represented by trit-shifting n times to the left, thus ‘reducing’ its KC from the base 2 representation?

    One shouldn’t care about that. Just use a Turing Machine against which “a” (not “the”) KC can be computed and off you go. Sure, there may be some large differences in KC for finite number of strings (especially if you decide to add a “library” to compute some particular numbers, the KC would be high for one and 1 for the library-laden machine), but in the end KCs should be of similar order for given input string. In the same way as the actual value of a Chaitin’s constant depends on the machine chosen.

    This demands some proof, which is not forthcoming as my brain cells for this are in disrepair and have a meeting tomorrow.

  31. rrtucci Says:

    Betting-Scott, time to bet on the ponies again. (Did your wife know that you had a small gambling problem before she married you?)

    Adiabatic quantum algorithm for search engine ranking

    Silvano Garnerone, Paolo Zanardi, Daniel A. Lidar
    (Submitted on 29 Sep 2011)
    We propose an adiabatic quantum algorithm to evaluate the PageRank vector, the most widely used tool in ranking the relative importance of internet pages. We present extensive numerical simulations which provide evidence that this quantum algorithm outputs any component of the PageRank vector-and thus the ranking of the corresponding webpage-in a time which scales polylogarithmically in the number of webpages. This would constitute an exponential speed-up with respect to all known classical algorithms designed to evaluate the PageRank.

  32. Scott Says:

    IThinkImClever #29 (and Yatima #30): Encoding details like choice of base will change the Kolmogorov complexity by at most an additive constant factor, independent of the length of the string. This constant will simply be the size of a Turing machine needed to convert from one base to the other. For (much, much) more about these sorts of invariances, see any intro to Kolmogorov complexity, e.g. the Li-Vitanyi book.

  33. Scott Says:

    rrtucci: Thanks for the heads-up. I just read the paper, and it looks nice! However, just like with the closely-related quantum algorithm for solving linear systems (by Harrow, Hassidim, and Lloyd), there are two important caveats, which I’m absolutely certain are going to get glossed over in popular articles about this work.

    The first caveat, which the authors mention at the end, is that the algorithm doesn’t give you the PageRank of a particular webpage. Instead, it gives you a superposition over all the pages, such that the probability of observing page i is proportional to the PageRank of i. This means that learning the PageRank of a page of your choice will still generally require linear time, just like in the classical case.

    The second caveat, which I don’t think the authors mention, is that, depending on the precise model for the quantum computer’s architecture, loading the web data into the quantum computer in the first place might already require linear time!

    Actually, rereading the paper, I now see that they don’t cite the work by Harrow, Hassidim, and Lloyd, which surprises me a lot. One thing I’d like to understand better is the relationship between the two algorithms.

  34. David Hogarty Says:

    Daniel Franke/Scott:

    I encourage you to look at the following as a particularly useful fixed language for talking about Kolmogorov Complexity. Ideally, you want a language where few builtins exist, but where talking about several instances in a given fixed set causes an insignificant further expansion of size. This satisfies that requirement well.

    http://en.wikipedia.org/wiki/Binary_lambda_calculus

  35. Raoul Ohio Says:

    Daniel Franke is right that I share his concern, but I acknowledge that I am a duffer on this topic. I suspect my objections are “old hat”, which is why Scott had no trouble dealing with them.

    I have learned lots of esoteric stuff by getting a coffee buzz and working through the standard examples. This procedure likely needs modification for KC, where you can’t actually calculate anything. It reminds me of the programming concept of “pointers to pointers … to pointers”, where another layer of indirection sometimes provides an avenue to solving some issue. I have been able to do this often, particularly once when I translated the Java code in a book on “Design Patterns” into C++.

    Here are another couple questions:

    1. Is it obvious that there really is a well defined (whatever that means) function from the set of strings to the positive integers, that meets the definition of KC?

    The function would presumably be parameterized by the programming language chosen, and possibly some other things. I admit I am a little weak on understanding things that cannot be calculated, but it seems plausible that a clever diagonalization, or whatever, might show that no such function exists.

    2. I thought of the Feigenbaum constants (then a recent development, much in the math news) the first time I heard of KC. Scott shows why KC(δ) did not CHANGE when Feigenbaum identified it. Is it correct to say that the BEST ESTIMATE of KC(δ) changed?

  36. Scott Says:

    Raoul:

    I have learned lots of esoteric stuff by getting a coffee buzz and working through the standard examples. This procedure likely needs modification for KC, where you can’t actually calculate anything.

    No, not really! If you’re interested enough, get a coffee buzz and work through the exercises in the Li-Vitanyi book.

    Is it obvious that there really is a well defined (whatever that means) function from the set of strings to the positive integers, that meets the definition of KC?

    Yes. Again, “well-defined” is not the same thing as “computable.” The halting problem is uncomputable, but it’s perfectly well-defined. And to take an example from Sipser, the function

    f(x) = 1 if God exists, 0 if God doesn’t exist

    is ill-defined, but computable (since it’s a constant function not depending on x).

    Scott shows why KC(δ) did not CHANGE when Feigenbaum identified it. Is it correct to say that the BEST ESTIMATE of KC(δ) changed?

    Actually, I don’t think I’d say that either, for the simple reason that before Feigenbaum, people hadn’t even considered Feigenbaum’s constant! I.e., this is not a case where people looked and looked for a pattern in some apparently-random sequence of digits, until finally Feigenbaum discovered the pattern. Rather, it’s a case where the sequence would never have been written down at all, without prior knowledge of a pattern that made the sequence non-random and interesting.

  37. Jiav Says:

    Assume a thick chemical soup consisting of various nucleic acids, small bubbles of fat and a pinch of salts, then life appears, species evolve and ecosystems become complex.

    Should these increase or decrease Kolmogorov complexity, sophistication and logical depth? What’d be your best guess?

  38. Erik Wennstrom Says:

    So I’m coming late into this conversation class, but I was reading an old (1975) Gregory Chaitin paper (“A Theory of Program Size Formally Identical to Information Theory”) where he talks about some similar topics, and while reading it, I started to get annoyed by the issue of the choice of the universal programing language, and then I was reminded of this discussion.

    Chaitin defines a notion of entropy based on Kolmogorov complexity, and when he chooses his universal programming language (which he just calls an “abstract computer”), he requires that it be “optimal” in the sense that for any other programming language C, the program length overhead that the C -> U compiler introduces is bounded by a constant. To be clear, this constant depends on the particular language C, so changing languages might mean that the overhead gets out of control, but as long as you hold C constant, no matter how big the programs get, the compiling overhead is constant.

    Since such universal programming languages exist, it’s not completely ridiculous to try and use such an idea. But it does mean that there’s not much meaningful to be said about the entropy (and presumably the complextropy) of a single string, or even about comparing the entropy of a finite number of strings. But it does mean that you could certainly talk meaningfully about what happens to the entropy of a sequence of strings (or of a single infinite string) in the limit.

    And if you still want to talk about the entropy of small, finite numbers of strings, I guess if you can justify that the strings in question are “typical” members of an infinite set, then I feel like your conclusions would be at least as reasonable as that justification was.

    Here, it seems to me like we’re discussing middle-behavior in addition to end behavior, so I feel like maybe we should give a little lip service to justify that there would be a good choice of universal language.

  39. Scott Says:

    Erik #38: Not only does there exist a universal translation constant C for any pair of programming languages, but in practice, it’s usually not an unreasonably-large constant. E.g., much smaller than 1 mole = 6*1023, which is a good unit to keep in mind in discussions about entropy. That’s one of the empirical facts that connects Kolmogorov complexity (indeed, computability theory more generally) to the physical world and gives it its relevance. (It’s analogous to the fact that most polynomial-time algorithms have reasonable exponents and constants—if that weren’t the case, polynomial time would be useless as a criterion of efficiency.) Is that enough lip-service? :-D

  40. Daniel Quilp Says:

    I am absolutely stunned that you have not posted an encomium to Steve Jobs. You are a computer science professor. Jobs was the most important innovator in the field. You claim you want to reach out to the public but fail to take advantage of this opportunity. Very sad, very disappointing.

  41. Scott Says:

    Daniel Quilp: Given that this news is in essentially every media outlet everywhere on earth, I didn’t feel it was necessary for me to add something, as I didn’t know Jobs and have no more relation to him than anyone anywhere who owns a Mac or (like me) an iPhone. (Indeed, when I do write about major news stories, people usually accuse me of jumping on a bandwagon and tell me to stick to complexity. I can’t win, no matter what I do!)

    FWIW, yes, Jobs was one of great American innovators and I was extremely sorry to hear about his passing. I was riveted by the NYT obituary, from which I learned many facts about him that I didn’t know before. And I’ll honor Jobs’s memory by upgrading to an iPhone 4S at the Apple Store near my house when it comes out a week or so from now.

    I probably also should’ve blogged about the Physics and Chemistry Nobel Prizes, both of which went to extremely deserving scientists this year. Fundamentally, though, this isn’t a current events blog; it’s a procrastination blog.

  42. Erik Wennstrom Says:

    Indeed, that’s enough lip service for me. I felt like the constants shouldn’t be too bad, but I didn’t really have any good reasons for that feeling. (I’ve come at the whole complexity thing from the logic side of Implicit Computational Complexity, so I sometimes don’t feel like I’ve had enough exposure to complexity to develop the right intuitions.)

  43. Vadim Says:

    Ha ha, Scott, you have the best hecklers.

  44. IThinkImClever Says:

    “Computer science is as much about computers as Astronomy is about telescopes.”
    – Edsger Dijkstra

    And if it weren’t, what would Scott have to say about Jobs? Also,

    “If you have n bits of axioms, you can never prove that a program is the smallest possible if it is more than n bits long.”
    – Gregory Chaitin

    So in the limit, talking to the iPhone 4S is just as useful as talking to another piece of plastic.

  45. Hopefully Anonymous Says:

    “If you write about the kind of complexity theory that involves acronyms like NP, BQP/qpoly, and r.s.r., people will think the issues must be difficult and arcane, even if they’re not and can be understood with very little effort. By contrast, if you write about the kind of complexity theory that can be illustrated using pictures of coffee cups, people will think the issues can be sorted out with 15 seconds of thought, and will happily propose ‘solutions’ that presuppose what needs to be explained, answer a different question, or fail in simple examples.”

    Ha, I epically resemble that remark.

  46. Hopefully Anonymous Says:

    @ Comment #40: Don’t be a douche.

    @ Comment #41: Classy response.

    “Fundamentally, though, this isn’t a current events blog; it’s a procrastination blog.”

    Great closer!

  47. John Baez Says:

    Hi, Scott!

    I also learned a valuable sociological lesson about the “two kinds of complexity theory”…

    Yeah: if you explain things clearly more people will think about them – and the results are not always pretty. :-)

    Still worth the risk, I’d say.

    I’ve been thinking about Kolmogorov complexity a bit these days. I was going to mention my paper with Mike Stay on algorithmic thermodynamics, but I see Jochen beat me to it. I still have an excuse for doing so, though: a little correction. Our paper does not apply Kolmogorov complexity to thermodynamics. Rather, we take the relation between Kolmogorov complexity and entropy and extend it so we can apply more concepts from thermodynamics to the theory of computation! I think there’s room for more work in this reverse direction – I hope people try it.

    I also wrote up a pop explanation of how Kritchman and Raz proved Gödel’s second incompleteness theorem using Kolmogorov complexity and the idea behind the ‘surprise examination’ paradox, here. And Bruce Smith and I asked a question which you or your readers might be able to answer.

  48. Scott Says:

    Hi John,

    Thanks for the comment!

    Regarding your and Bruce Smith’s wonderful question: welcome to the club; I’ve also been asking people for explicit numerical bounds on where the “onset of undecidability” occurs! In fact, half a year ago, I asked a closely-related question on MathOverflow. I didn’t get a definite answer, but I did learn that there’s approximately one logician who’s actively worked on this problem: Harvey Friedman.

    Using the techniques of today, it looks like the best one can do is essentially to construct a Turing machine that explicitly enumerates the theorems of PA or ZFC—in which case the numerical bounds would be absolutely terrible, since one would need to code up the axiom schema, the inference rules of first-order logic, etc. (Probably the situation is somewhat better in a programming language like Lisp or Python.) Finding a short program whose behavior is provably independent of PA or ZFC seems closely-related to the notorious problem of finding “natural” Gödel-undecidable arithmetical statements. Thus, while finding the smallest N such that the Nth Busy Beaver number is independent of ZFC might be dismissed as a parlor game, my own intuition—and Friedman seems to agree—is that this problem actually provides an incredibly useful and underappreciated yardstick. I doubt much progress can be made on it without some genuine breakthroughs in mathematical foundations!

    Moving on to your paper with Mike Stay. You know, every time I read anything technical you’ve written, my head spins from notions that look at first glance like “compile-time errors”: a finite field with one element. Programs that never halt as “singularities to be patched up.” The log of a program’s runtime as analogous to the energy of a gas container. Then I wish my brain would expand enough that these notions would compile.

    Since I already knew (and accepted) that algorithmic entropy was closely related to thermodynamic entropy, I find the new quantities that you and Stay define—algorithmic temperature, algorithmic heat, algorithmic pressure—deliciously tantalizing. But I kept asking myself what these quantities are good for—i.e., what questions in complexity or computability could they help address? Indeed, until I knew a use to which these quantities were going to be put, I don’t think I could even understand the more basic matter of why we should define them in one way rather than another way. (As you freely admit, right now the choices of log(runtime), program output as a positive integer, etc. seem totally arbitrary: why not the space usage or the runtime mod 2?)

    On the other hand, the metaphor you construct is so intriguing that, even if neither you nor I can think of any use for it at present, I can readily imagine someone, someday finding a use!

    On the third hand, I suspect that the biggest application of your metaphor might be one that you don’t even mention in the paper. Namely, it could provide a perfect pedagogical tool for explaining thermodynamic concepts to people with math/CS backgrounds! When I took chemistry and physics as an undergrad, I was driven up a wall by the constant demands to manipulate the partial derivatives of magic quantities called “entropy,” “temperature,” and “pressure” using rules that one memorized, without ever seeing what those quantities were in a mathematically self-contained way. (An unrealistic toy system would’ve been fine, as long as I was told everything about it! Best of all would’ve been a discrete toy system.)

    Today, if I ever feel a need to relearn basic thermodynamics, your paper is going to be the first thing I consult—even though it almost certainly wasn’t intended for that purpose!

  49. John Baez Says:

    I find the new quantities that you and Stay define—algorithmic temperature, algorithmic heat, algorithmic pressure—deliciously tantalizing. But I kept asking myself what these quantities are good for—i.e., what questions in complexity or computability could they help address?

    I feel exactly the same way! As you say, it may take a while for someone to figure this out. It may take someone who knows a lot about thermodynamics and computation, and thinks about both at once for a long time.

    As you freely admit, right now the choices of log(runtime), program output as a positive integer, etc. seem totally arbitrary: why not the space usage or the runtime mod 2?

    Well, a lot of thermodynamics is about maximizing entropy subject to constraints on quantities you care about: from here we inevitably get the partition function, the fact that each of our quantities has a “conjugate”, and all those confusing relationships involving partial derivatives of our quantities and their conjugates.

    You get to pick the quantities you care about. If you’re a physicist, you care about energy, so you’ll be forced into thinking about its conjugate: roughly, temperature. If you’re a physicist studying a box of gas, you care about its volume, so you’ll also be forced into thinking about its conjugate: roughly, pressure. And so on.

    But if you’re a computer scientist, you’re interested in other things. We picked some examples of things you might be interested in, to illustrate this philosophy. But it’s really up to you. If the runtime mod 2 is what turns you on (not likely), go ahead and use that! All the basic framework will still apply.

    Our plan had been to study more properties of the partition function Z(beta,gamma,delta) which we begin to discuss in Section 4.4. That’s the “obvious thing to do” from a thermodynamics perspective. The rate at which it diverges as beta, gamma and delta approach certain values should be interesting. It could provide a new way of thinking about the average behavior of computer programs.

    Unfortunately, the partition function seems highly dependent on the choice of universal Turing machine. We say a bit about this. The problems we point out arise from “goofy” universal Turing machines where infinitely many programs halt immediately, or all programs take a long time to run. So, I wonder if there’s an interesting class of “non-goofy” universal Turing machines.

    It seems we can’t say much about “the average behavior of computer programs” if we can’t define and exclude goofy universal Turing machines, which are like goofy programming languages that nobody would actually want to use.

  50. Jiav Says:

    Scott,

    1) regarding complexity:

    Would you mind to answer my question post #37, or maybe explain why it’s not an interesting question?

    2) regarding “Simpler” statements equivalent to Con(PA) or Con(ZFC), and inspired by Stefan Geschke’s answer on mathoverflow:

    What about trying all diophatine equations below some threshold? The idea would be that coding “Y” as an integer and “try all solutions to all diophantine equations using less than Y variable” can requiere far less space than “try all solution to the particular diophantine equation Z”.

    Pseudocode:
    a-select a set of X integer (say 10000)
    b-try it as a tentative solution to any diophantine equation including less than X variables
    c-count the number of equation solved so far
    d-halt if this number is below a certain integer Y, or else redo (a-d) using the next set of integer

    For a well chosen couple of integer X an Y, it will halt iff counterexamples to Con(PA) or Con(ZF) can be found. X is “easy” to find (write the most obvious equivalent diophantine equation and look its size), wheras Y is absolutly non trivial (but garanteed to exist) as it is the number of solvable diphantine equations (below X variables) not equivalent to Con(PA) or Con(ZF).

    Two interesting (?) observations however:
    -this garantees the existence of a TM much shorter than the “obvious” ones
    -one can design a series of TM equivalent to the consistency of any interesting set of axioms, and the size of these TM will increase at most as the log of the number of variable of the most obvious reduction to a diophantine equation.

    Does that partially answer your question, or you were thinking at even more drastic shortening?

  51. John Baez Says:

    Scott wrote:

    Using the techniques of today, it looks like the best one can do is essentially to construct a Turing machine that explicitly enumerates the theorems of PA or ZFC—in which case the numerical bounds would be absolutely terrible, since one would need to code up the axiom schema, the inference rules of first-order logic, etc. (Probably the situation is somewhat better in a programming language like Lisp or Python.)

    I’d still be interested in knowing those bounds. While they’re big, they’re not that big: there are lots of things in the universe that seem to (and probably do) have a higher Kolmogorov complexity than that.

    But yes, it would be even better if someone could significantly chop down these ‘naive’ bounds.

    … my own intuition—and Friedman seems to agree—is that this problem actually provides an incredibly useful and underappreciated yardstick.

    I tend to agree with you. But if you haven’t yet, it’s worth reading Panu Raatikainen’s opposing view in On interpreting Chaitin’s incompleteness theorem, Journal of Philosophical Logic 27 (1998). At the very least it calls for fine-tuning any claims one wants to make.

  52. Shubhendu Says:

    I only got to read this post recently and I must say I thoroughly enjoyed reading it. You mention a few applications. I would also like to add that some of the _best_ applications of Kolmogorov Complexity have been in Machine Learning (and that can get as applied as it gets if one wants to). One might say – Well that’s not really Kolmogorov Complexity. Indeed the MDL idea is slightly different. But it was inspired by the idea of Kolmogorov Complexity and the fact that you can use it to reason about what algorithm to choose is beautiful!
    Infact it isn’t a random event IMHO that the first paper in Machine Learning was also amongst the first in Kolmogorov Complexity (Solomonoff 1956).

  53. Hector Zenil Says:

    For those concerned about the choice of universal Turing machine (or programming language) for evaluations of Kolmogorov complexity: see the “invariance theorem” in any textbook, the theorem is considered to be the founding theorem of the theory given that otherwise as you suspect, Kolmogorov complexity wouldn’t make sense, but this was taken care of in the 60’s when the measure was first proposed by Kolmogorov, Chaitin, Solomonoff and Levin (4 of the greatest mathematicians and computer scientists that came up with the same measure independently).

    Important to notice is that from the mathematical point of view there is no discussion on whether Kolmogorov complexity is suitable or serious, IT IS the mathematical measure of randomness in mathematics, period. But this is not only because of decree. Just as it happened to the theory of computability in the 30’s when Turing, Post, Church, Kleene and others were trying to define the concept of algorithm and they all came up with equivalent theories, the development of Kolmogorov complexity underwent a similar process of convergence of definitions, where the above mentioned (notably Chaitin and Levin) together with Schnorr and Martin-Loef, proved that all these theories dealing with the characterization of (both finite and infinite) randomness are equivalent in a strong way (even if there are some subtle difference when jumping between finite strings and infinite sequences). All these approaches characterize the same randomness using: uncompressibility (Kolmogorov, Chaitin), unpredictability (Schnorr) and statistical tipicality (Martin-Loef). For more see: http://www.scholarpedia.org/article/Algorithmic_randomness

    Concerning the uncomputability of Kolmogorov complexity, it is more precise to call K a lower semicomputable function, which is very different to “simply” uncomputable. It means that one can calculate upper bounds as Scott has pointed out, but also that compressibility is a sufficient test for non-randomness. That is, even if you cannot prove that a program is the shortest possible generating a string x, if you do find a short program generating x then you can safely declare x non-random. Again, if you are worried by the choice of universal Turing machine or complete programming language choice, take into consideration that measures calculated with different universal Turing machines or programming languages will only differ by a “small” constant (the size of a translation program between the chosen machine and any other). This is the rationale of the so-called invariance theorem in the theory of Kolmogorov complexity.

    Now, it is true that the invariance theorem doesn’t tell how “small” is that constant or, otherwise said, how fast evaluations of K on different universal Turing machines or programming languages will converge to the same values. However, experimental work has been done on this respect with interesting results (that are able to classify and characterize space-time diagrams, images and even graphs and networks), strongly suggesting that the rate of convergence and constant involved is actually small and controllable. Here available references online:
    http://arxiv.org/abs/1306.0322
    http://arxiv.org/abs/1212.6745
    http://arxiv.org/abs/1211.1302
    http://arxiv.org/abs/1211.4891
    (forthcoming in the journal Computability)

    There is even an online Kolmogorov complexity calculator that provides upper bounds using these approaches:
    http://complexitycalculator.com/
    With explanations that we hope will help understand how to evaluate K and other complexity measures.

    References on Kolmogorov complexity and algorithmic randomness:
    – Calude’s “Information and Randomness” (Springer)
    – Li and Vitanyi’s “Kolmogorov complexity and its applications” (Springer) (already mentioned in 2 occasions by Scott himself)
    – Downey and Hirschfeldt’s Algorithmic Randomness and Complexity (Springer)
    – Nies’ “Computability and Randomness” (Oxford)

    I also dare to recommend my own edited volume:
    – Zenil, Randomness Through Computation: Some Answers, More Questions (World Scientific and Imperial College Press). (with contributions from most of the pioneers and current Kolmogorov complexity and randomness researchers explaining their work, challenges and future)

    Myself and some colleagues are also preparing a book to be published by Springer next year under the title “Methods and applications of Kolmogorov complexity” which shows the method we have been using by applying the algorithmic Coding theorem (a beautiful theorem that connects Kolmogorov complexity and frequency of production of a string from a random program) to evaluate K as an alternative to the lossless compression approach hitherto used (c.f. recommended papers above).

    Last, but not least, consider that Kolmogorov complexity (computable variations in the form of similarity distances and information metrics) have shown to be very successful at spam filtering, authorship detection, phylogenetic classification, among other applications.

    To answer last comment (#52), of course Solomonoff’s paper is on Kolmogorov complexity, MDL is clearly a byproduct of Kolmogorov compelxity too. Solomonoff together with Chaitin and Kolmogorov are fully credited for Kolmogorov(-Chaitin-Solomonoff) complexity! The 3 of them proved the invariance theorem, but Solomonoff was more interested in the problem of induction which led to the theory of algorithmic probability and optimal learning.

Leave a Reply