We now have papers offering a general theoretical grounding for why deep neural networks generalize, https://arxiv.org/abs/1805.08522

and why modern over-parametrized nets can be succesfully trained with gradient descent: https://arxiv.org/abs/1806.07572 , https://arxiv.org/abs/1902.06720 , https://arxiv.org/pdf/1811.03804.pdf , https://arxiv.org/pdf/1811.03962.pdf . Some work uses insights from these to make some rigorous theorems on generalization too ( https://arxiv.org/pdf/1811.04918.pdf ), but the overall picture here is that we have a good coarse-grained understanding of optimization and generalization in the basic deep learning models (a lot of it, interestingly, relies on asymptotic limits like large width, reminiscent of how in physics, systems become tractable at thermodynamic limits).

There are many many details and things which are still mysterious and need to be worked out. Specially given the biology-like diversity of architectures people are producing every day. But we now have increasingly solid theoretical foundations, I would say. These two is why I liken it to materials science, where we understand the basic physics, and basic models sytems, but there is a large research frontier understanding the large diversity of materials people come up with, and the interesting effects they exhibit.

Both are, I’d argue, very far from being “alchemy” today.

]]>https://www.sciencedirect.com/science/article/pii/0893608090900045 ]]>

For why deep functions can be more efficient than shallow functions, I feel like the theory is already pretty reasonable:

https://www.nada.kth.se/~johanh/thesis.pdf

I suppose an open question might be to concretely explain why deep neural networks can generalize well despite having enough capacity to memorize the training data. Something like rademacher complexity doesn’t seem like it would work well enough because generally models can easily memorize the training set.

Regarding why gradient descent works, one thing that I think is interesting and perhaps overlooked is that the proof for gradient descent converging for convex functions does not actually use the full property that the entire function is convex – rather the function just needs to be convex along the actual optimization trajectory. I wouldn’t be surprised if a property like this turned out to hold. There’s also some compelling evidence that when you average the parameters along a single training trajectory you still get a pretty reasonable loss.

]]>Scott Aaronson comments

“This might be the first picture ever of something no one really understands how it works … and it’s 53 millions years overdue.”

Good point, isn’t all of physics based on illusion? For instance, we are told (by physicists) that the physical world is made of atoms , and the atoms are made of smaller particles called quarks , leptons, ect., ect. Then we are told that these particles are actually strings (string theory) but what are the strings made out of ? Are they “solid string “? Are quarks “solid quark”, not made of anything smaller? Where does this end, how can you have “something” that is not made of something smaller, or you can ask the $64K question “what does it mean for something to be solid” ?

]]>Deep learning is extremely well suited to take advantage of the massive parallelism of GPUs (a tech itself driven initially by computer graphics rendering and bitcoin).

Without the massive recent jumps in GPU power, deep learning just wasn’t practical at all. ]]>

“Why does deep and cheap learning work so well?”

https://arxiv.org/abs/1608.08225?

There have been previous cases of the comment numbering I see being 1 off of what others were stating.

Reminds me of the old CS saying: “There’s only two hard things in computer science: Cache Invalidation, Naming things, and Off by 1 errors”.

]]>#0 referred to the main post. Complaining about 0-based numbering was a joke (“let’s pretend it explains that the prize is out of phase”). And a very good one, once explained.

]]>