## Congratulations!

Congrats to Geoffrey Hinton, Yann LeCun, and Yoshua Bengio, who won the 2018 Turing Award for their work on deep learning (i.e., what used to be called neural nets). This might be the first Turing Award ever given for something where no one really understands why it works … and it’s years overdue.

Congrats to Avi Wigderson for winning the Knuth Prize. When I was asked to write a supporting nomination letter, my first suggestion was to submit a blank sheet of paper—since for anyone in theoretical computer science, there’s nothing that needs to be said about why Avi should win any awards we have. I hope Avi remains a guiding light of our community for many years to come.

And congrats to Mark Braverman for winning the Alan T. Waterman Award, one that I have some personal fondness for, along with materials scientist Jennifer Dionne. As Sasha Razborov once put it, after he (Sasha), I, and others recoiled from the task of proving the Linial-Nisan Conjecture, that polylog-wise independent distributions are indistinguishable from uniform by AC^{0} circuits, a “braver man” stepped in to do the job.

Comment #1 April 5th, 2019 at 1:58 pm

Why is the Turing Award so strangely out of phase? We’re more than three months deep in 2019 and they’re only now giving the 2018 prize.

Comment #2 April 5th, 2019 at 2:49 pm

anon #1: I was wondering that myself! Can anyone who knows how these things work explain for us?

Comment #3 April 5th, 2019 at 7:08 pm

Oh no, zero-based numbering again!

Comment #4 April 6th, 2019 at 2:20 am

scott:

this isn’t so strange. the Academy Awards are given out each year in Ja. or Feb. for the best movies that came out in the previous year. the only question is should the date be the date of the award ceremony or the date of what’s winning the award? it doesn’t matter as long as the date choice is consistent from year to year so we know when to start looking to see if our name is on the list of awardees LOL (like scientist who read articles by first looking at the title, then the abstract, then the references – to see if they’re mentioned, then the discussion and conclusion sections and finally (maybe) at the article itself.

Comment #5 April 6th, 2019 at 9:17 am

Dear Scott,

>> “no one really understands why [deep learning] works”

The fact that neural networks can compute any function (re. http://neuralnetworksanddeeplearning.com/chap4.html) would be enough, wouldn’t it? … Did you have some considerations beyond this fact in mind? some basic features that still go unexplained despite this fact?

BTW, congrats to the winners!

Best,

–Ajit

Comment #6 April 6th, 2019 at 2:16 pm

Ajit #5: No, that’s clearly and unequivocally

notenough. C programs can also represent any (finite) function, yet gradient descent on the space of C programs fails whereas gradient descent on the space of neural network weights often works. It’s not enough for a hypothesis class to be able to represent your data—roughly speaking, you need it to be feasible to optimize over the class, and in a way that typically ends up withsuccinctrepresentations that therefore generalize to new data.Comment #7 April 6th, 2019 at 8:38 pm

If we go with CS=AI , then I think the path of logical generalization is:

Probability Theory >>> Stochastic Models >>> Bayes Nets >>> Neural Nets (Deep Learning)

So if probability theory is at the foundation of deep learning, then, presumably they work for the same reason that Induction does – it’s the philosophical problem of Induction (the future resembles the past enough to make effective prediction feasible).

Comment #8 April 6th, 2019 at 10:29 pm

mjgeddes #7: I mean, obviously it has something to do with induction, but that still doesn’t explain the question at hand, why one representation class works and another one doesn’t.

Comment #9 April 7th, 2019 at 12:16 am

OK Scott,

Time to bring my super-human intuition to bear on the question 😉

If we generalize probability theory when we move from prediction problems to control problems, probabilistic problems transform into problems in information theory… and perhaps rate-distortion theory?

It seems that neural networks work because they’re compressing data- removing noise and crystallizing the relevant features. So data compression. And the notion of the information bottleneck.

https://en.wikipedia.org/wiki/Information_bottleneck_method

So Theory of Information Bottleneck ( Shwartz-Ziv and Tishby) seems intuitively along the right lines to me. There was an excellent Quanta magazine article on this:

https://www.quantamagazine.org/new-theory-cracks-open-the-black-box-of-deep-learning-20170921/

“The idea is that a network rids noisy input data of extraneous details as if by squeezing the information through a bottleneck, retaining only the features most relevant to general concepts.”

Comment #10 April 7th, 2019 at 1:23 am

Scott # 6:

1. Comparison to the space of C programs seems a bit overstretched here, because this space doesn’t seem to carry several features relevant to the gradient descent algorithm (or any other kind of a constrained optimization algorithm, e.g. PSO). The space of C programs seems to be devoid of such features as the ideas of variations between input instances, of cost functions, and of mappings between them. Hence, the idea of optimization is difficult to imagine.

Further, by definition, the space of all possible C programs loses the very idea of having the same structure internal to algorithmic processing (in case of ANN, it’s the graph structure) but being applicable to a wide sub-variety of one type of problem: the classification problem.

So, overall, the surprising fact is not that given any arbitrary function, there should exist at least one C program to compute it. The surprising fact is that despite the severity of the imposition of the graph structure with weights and biases for the connections, the restriction in terms of the uniformity in the size of the inputs, a seeming arbitrariness allowed in erecting the dividing lines for the output bins, and with very simple operations of just one-step nonlinear mappings and algebraic additions, the ANN should still come to exhibit

universality. The surprising part is that so rigidly specified a data structure should also turn out to be a universal “compute-r”.But then, I would like to point out that, the reasons why or how such a universality arises, is also by now fairly well understood.

2. In addition, also well understood are the following two features related to ANNs: (i) the ease with which certain classification tasks can be translated into the input-output structure as used by the ANNs, and (ii) the existence of a certain correspondence between (a) the idea of

levels of abstractionsin (the more automatic aspects of) cognitive processing (as it occurs in the actual nervous systems), and (b) the systematic differences in the processing and outputs produced at various hidden layers in (the mechanistic) deep learning. This part is no longer mysterious.3. It is also understood that not all classification tasks map well onto the ANNs paradigm. Only those problems in which

qualitativedifferences (i.e., the differences ofclasses) can be mapped on to continuous,quantitativechanges in the input instances, can be. The paradigm works because it assigns smaller changes in the outputs for those inputs that belong to the same class, and it still triggers bigger changes of classes when the inputs vary widely, falling into different classes. It can signal changes of classes and yet be tolerant of smaller differences in instances of the same class. The simultaneous existence of epistemological differentiation and integration is tied to the many:many mappings across the multiply connected nodes in between the consecutive layers. This part too is understood. Thus, how levels of abstractions arise when multiple layers are used, is understood.4. Before closing, let me quickly mention that despite the universality theorem, there also are situations where ANNs fail, and the reasons for such failures

alsoare pretty well understood, too.ANNs are famously unable to arrive at a tentative initial guess, then perform a second take in the light of the knowledge of what the initial guess possibly implies, and then perhaps even end up changing the answer even in a radical way, if the need be. ANNs can’t perform such “second thought”s. (Recurrent learning is at a relatively primitive stage of development.) For the ubiquitous FF networks, their first conclusion

istheir final conclusion—even if it comes packaged only in the probabilistic terms. The reasons for this limitation are well understood.Given the nonlinear relationships involved in ANNs, and inability to recover from an initial mistake, they also end up showing great sensitivity to what otherwise are small or trivial changes in the input instances. An elephant in a room throws the entire objects-recognition scheme haywire, and so does adding visually minor-looking fringes to an otherwise nondescript background. We do understand that there are such limitations too, even if we don’t understand all the limitations or all the pathalogies well enough.

5. So my basic point is that despite the aspects which we don’t understood, it still would border on exaggeration to suggest that we don’t understand why deep learning works.

Yes, we have to keep our eyes and ears open for applications and the cases in which deep learning can and does fail. Yes, the issue is interesting. Yes, further research is needed. Yes,

fundingfor further research is even more badly needed. … But all that still doesn’t mean thatno oneunderstands deep learning.Ido! [Amen!]Best,

–Ajit

[Yes, I will try to avoid very long comments in future…]

Comment #11 April 7th, 2019 at 10:05 am

Ajit #5 etc.

> “no one really understands why [deep learning] works”

There is a rapidly growing literature on this, but no light at the end of the tunnel yet.

To see what this refers to in practice one needs to look using key-words such as

‘overfitting’ and ‘overparametrization’, to see that older explanations based on capacity or Vapnik-Chervonenkis dimension do not pass the experimental tests.

The most cited reference for pointing out the problem is:

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals

https://arxiv.org/abs/1611.03530

A more recent lecture putting this better in a historical context is

http://www.mit.edu/~rakhlin/papers/myths.pdf

Comment #12 April 7th, 2019 at 5:32 pm

Ajit R. Jadhav #5: The same argument holds for monkeys at typewriters turning out Shakespeare, or anything else. This is usually not considered a good strategy.

Comment #13 April 7th, 2019 at 6:26 pm

Scott #0,

Reading Raklin’s presentation (from #11) make me realize that I might not even understand what an expert in CT would count as “an understanding of why deep learning works”. Would you mind to elaborate on your own take of what would count as a solution?

Comment #14 April 7th, 2019 at 11:35 pm

sf # 11:

Thanks.

0. I liked the 5-man article because I could at least partly understand it. I have no opinion on Prof. Sakhlin’s lecture because I don’t have any part of the context to even attempt understanding it at the level of details at which it analyses things. But I guess it’s some frontline research in CS, too. (I am a mechanical engineer-cum-physicist.)

1. From what I gather (and please leave reading this reply as soon as my verbiage becomes too thick—I won’t blame you), a central issue here is the same as encountered in most any data science task:

There is a burning need to find/discover/invent criteria which would quantitatively express the similarities and differences existing in the data itself, independent of this classification technique or that.

These criteria ideally should be “objective,” i.e. not dependent on the particulars of an algorithm or technique being used. In other words, we should have one or several different means for measuring the “distances” between functions.

2. Unless you have such criteria, there is no objective way to evaluate whether one classification technique works better than the others or not, or whether some other classification scheme at all works or not, and even, which classification scheme would better survive a wrong labeling of data. In the absence of objective criteria, any one could possibly come in at any time, refer to any kind of objectively un-characterized data, and use it to draw any conclusion he or she wishes to reach.

3. So, my question to the skeptic of deep learning is this: Why does DL work as well as it did on the Kaggle competition data? Or would you rather I overlooked that success as a mere accident, and begin advertising unsupervised learning to everyone?

My question to the proponent of deep learning is this: Why does adding the elephant to the room throw the objects recognition haywire? For what technical reasons?

My question to the entire community is even simpler, and the issue involved is what I touched upon, above. The question is this: What are the difference between the bitmap pictures of an equilateral triangle, a square, and a circle, in quantitative terms?

4. But it does not surprise me at all that stochastic gradient should also go on to fit a random labeling of the training data. Just why not?… especially if the data are complex enough? Realize, the issue has nothing to do with the technique used for finding the weights and biases in training; any algorithm other than stochastic gradient would do, too.

The meat of the matter is this: ANN is just a multi-constraint optimization algorithm. If you specify the constraints (the targets—the labels) wrongly, then even the wrongly labeled data, seen from a purely mathematical point of view, still is a labeled data. It still comes to together specify a mathematically valid set of constraints, and the network would have to try to reach that combination of weights and biases which best honors such (deliberately useless) data. This is not a limitation of ANN; it in a way is precisely its strength. ANN must work also on wrongly labeled data—provided that the data (e.g. images) carry sufficiently complex features within themselves to be able to support the wrong labeling.

[Parenthetically, to any mechanical engineer who is reading this comment: This issue is a generalization of the fact that you can always expect the FEM technique to solve any arbitrary (but valid) stress-field problem; that the technique is general enough to find values for the elements of the stiffness matrix for any set of BCs. What the FEM involves can, in a way, be seen a constrained optimization problem, too.]

5. Coming back to the issue of measuring circle vs. triangle vs. square: If you say that the objective measures characterizing their representative bitmaps are, say, 2.3, 3.7 and 5.0 respectively, and then if you wrongly label some data and still get an ANN to optimize itself to the collective constraints specified via these wrong labels, then the next issue I would look into would be this:

If the data-labels are not mangled, does technique A (say ANNs) objectively perform better than technique B (say clustering) or not? If yes, is it feasible to incorporate in the procedures at my data center a step to run objective checks on a small sample, so as to ensure that the input data were correctly labeled, and still continue using technique A or not? … I would proceed that way.

5. Coming back to the issue of objective measures for distances in functions and how epistemology can help understand the classification problem better, sometime ago, I wrote a document. Interested folks may find the link to the document at my blog here: https://ajitjadhav.wordpress.com/2018/09/25/some-running-thoughts-on-anns-and-ai-1/ . Please note, it’s just a preliminary version, and I am no longer working on it. However, if someone from a good university wishes to collaborate with me on such ideas, I am open.

[Sorry folks, for the long reply, once again! … Guess I will force myself to just a one-liner next time, even if it be only for asking for the email ID, so that the discussion could be taken offline…]

Best,

–Ajit

Comment #15 April 8th, 2019 at 12:36 am

“This might be the first Turing Award ever given for something where no one really understands why it works.”

Well, haven’t some of the awards (e.g., the cryptography ones) been given for results where the explanation of “why it works” depends on unresolved complexity class separation conjectures? This seems almost identical to the neural net case of having a solid empirical understanding built on a flimsy theoretical foundation.

Comment #16 April 8th, 2019 at 9:48 am

mjgeddes #9

An interesting overarching theory that is “better” IMHO, is free energy:

https://www.sciencedirect.com/science/article/pii/S0022249617300962

https://en.wikipedia.org/wiki/Free_energy_principle

Comment #17 April 8th, 2019 at 12:13 pm

Randall #16,

Everyone likes having a Theory Of Everything. But, if we were to adopt this one, aren’t we at risk of ending up like particle physicists emerging from some stringy dreams?

Comment #18 April 8th, 2019 at 1:19 pm

Wait, the comments are starting for everyone else at #0? They’re starting for me at #1.

Comment #19 April 8th, 2019 at 1:27 pm

Scott #18,

#0 referred to the main post. Complaining about 0-based numbering was a joke (“let’s pretend it explains that the prize is out of phase”). And a very good one, once explained.

Comment #20 April 8th, 2019 at 2:40 pm

Scott #18 or maybe #17?

There have been previous cases of the comment numbering I see being 1 off of what others were stating.

Reminds me of the old CS saying: “There’s only two hard things in computer science: Cache Invalidation, Naming things, and Off by 1 errors”.

Comment #21 April 8th, 2019 at 3:49 pm

I was just reading a paper from Max Tegmark about this:

“Why does deep and cheap learning work so well?”

https://arxiv.org/abs/1608.08225?

Comment #22 April 8th, 2019 at 3:57 pm

Maybe the reason it works so well is really just based on particularities of current computng hardware (compared to other methods).

Deep learning is extremely well suited to take advantage of the massive parallelism of GPUs (a tech itself driven initially by computer graphics rendering and bitcoin).

Without the massive recent jumps in GPU power, deep learning just wasn’t practical at all.

Comment #23 April 9th, 2019 at 10:03 am

@ Antony #17

Good point, isn’t all of physics based on illusion? For instance, we are told (by physicists) that the physical world is made of atoms , and the atoms are made of smaller particles called quarks , leptons, ect., ect. Then we are told that these particles are actually strings (string theory) but what are the strings made out of ? Are they “solid string “? Are quarks “solid quark”, not made of anything smaller? Where does this end, how can you have “something” that is not made of something smaller, or you can ask the $64K question “what does it mean for something to be solid” ?

Comment #24 April 10th, 2019 at 8:44 am

Breaking news! First image of a black hole released!

Scott Aaronson comments

“This might be the first picture ever of something no one really understands how it works … and it’s 53 millions years overdue.”

Comment #25 April 11th, 2019 at 12:47 pm

Speaking of old good black holes, does anyone understand the processing of the 4 pictures recently taken?

https://youtu.be/P7n2rYt9wfU

Comment #26 April 15th, 2019 at 7:05 pm

“No, that’s clearly and unequivocally not enough. C programs can also represent any (finite) function, yet gradient descent on the space of C programs fails whereas gradient descent on the space of neural network weights often works. It’s not enough for a hypothesis class to be able to represent your data—roughly speaking, you need it to be feasible to optimize over the class, and in a way that typically ends up with succinct representations that therefore generalize to new data.”

For why deep functions can be more efficient than shallow functions, I feel like the theory is already pretty reasonable:

https://www.nada.kth.se/~johanh/thesis.pdf

I suppose an open question might be to concretely explain why deep neural networks can generalize well despite having enough capacity to memorize the training data. Something like rademacher complexity doesn’t seem like it would work well enough because generally models can easily memorize the training set.

Regarding why gradient descent works, one thing that I think is interesting and perhaps overlooked is that the proof for gradient descent converging for convex functions does not actually use the full property that the entire function is convex – rather the function just needs to be convex along the actual optimization trajectory. I wouldn’t be surprised if a property like this turned out to hold. There’s also some compelling evidence that when you average the parameters along a single training trajectory you still get a pretty reasonable loss.

Comment #27 April 30th, 2019 at 4:26 pm

Probably nobody will read this tag-end comment, but for posterity…the late Hal White, an econometrician, proved back in the 1990s that if you assume a stationary distribution then an ANN is a consistent estimator for the property being evaluated (i.e. it converges to the true value as the amount of data goes to infinity). Now that doesn’t explain the finite-sample properties, but it does place ANNs firmly in the framework of non-linear multiple regression models. There’s no magic, just statistical estimation.

https://www.sciencedirect.com/science/article/pii/0893608090900045

Comment #28 September 6th, 2019 at 11:00 am

Regarding the “we don’t understand why deep learning works”. I belive this is an overstatement at this point. I would say we understand it at a level similar to how we understand a lot of applied physics, like in materials science.

We now have papers offering a general theoretical grounding for why deep neural networks generalize, https://arxiv.org/abs/1805.08522

and why modern over-parametrized nets can be succesfully trained with gradient descent: https://arxiv.org/abs/1806.07572 , https://arxiv.org/abs/1902.06720 , https://arxiv.org/pdf/1811.03804.pdf , https://arxiv.org/pdf/1811.03962.pdf . Some work uses insights from these to make some rigorous theorems on generalization too ( https://arxiv.org/pdf/1811.04918.pdf ), but the overall picture here is that we have a good coarse-grained understanding of optimization and generalization in the basic deep learning models (a lot of it, interestingly, relies on asymptotic limits like large width, reminiscent of how in physics, systems become tractable at thermodynamic limits).

There are many many details and things which are still mysterious and need to be worked out. Specially given the biology-like diversity of architectures people are producing every day. But we now have increasingly solid theoretical foundations, I would say. These two is why I liken it to materials science, where we understand the basic physics, and basic models sytems, but there is a large research frontier understanding the large diversity of materials people come up with, and the interesting effects they exhibit.

Both are, I’d argue, very far from being “alchemy” today.