There Will Be a Scientific Theory of Deep Learning

(arxiv.org)

90 points | by jamie-simon 5 hours ago

10 comments

RyanShook 1 hour ago
Here's where I'm missing understanding: for decades the idea of neural networks had existed with minimal attention. Then in 2017 Attention Is All You Need gets released and since then there is an exponential explosion in deep learning. I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier.
[-]
- pash 1 hour ago
  The inflection point was 2012, when AlexNet [0], a deep convolutional neural net, achieved a step-change improvement in the ImageNet classification competition.
  After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.
  The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
  The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.
  Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.
  Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)
  On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.
  0. https://en.wikipedia.org/wiki/AlexNet
  1. https://en.wikipedia.org/wiki/Bitter_lesson
  [-]
  - getnormality 15 minutes ago
    > I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.
    Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?
    [-]
    - etiam 0 minutes ago
      Did you see this one?
      https://news.ycombinator.com/item?id=41732853
    - musebox35 6 minutes ago
      I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.
  - musebox35 17 minutes ago
    Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.
    For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.
  - coppsilgold 42 minutes ago
    Comparing Deep Learning with neuroscience may turn out to be erroneous. They may be orthogonal.
    The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.
    Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.
    Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.
    Interestingly, there are some attempts at a middle ground where a variable number of continuous tokens describe an image: <https://visual-gen.github.io/semanticist/>
    [-]
    - jvanderbot 19 minutes ago
      If you think a 2 year old is doing deep learning, you're probably wrong. But if you think natural selection was providing end to end loss optimization, you might be closer to right. An _awful lot_ of our brain structure and connectivity is born, vs learned, and that goes for Mice and Men.
  - tbrownaw 34 minutes ago
    > The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
    I'd thought it was some issue with training where older math didn't play nice with having too many layers.
    [-]
    - etiam 5 minutes ago
      Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.
      My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.
      One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.
- cgearhart 1 hour ago
  A much earlier major win for deep learning was AlexNet for image recognition in 2012. It dominated the competition and within a couple years it was effectively the only way to do image tasks. I think it was Jeremy Howard who wrote a paper around 2017 wondering when we’d get a transfer learning approach that worked as well for NLP as convnets did for images. The attention paper that year didn’t immediately dominate. The hardware wasn’t good enough and there wasn’t consensus on belief that scale would solve everything. It took like five more years before GPT3 took off and started this current wave.
  I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.
  I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.
- porcoda 1 hour ago
  As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.
  [-]
  - srean 35 minutes ago
    > NNs were largely dismissed
    I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.
    In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.
    It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.
    Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.
    In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)
- whateverboat 1 hour ago
  The same thing happened with matrices. We had matrices for 400 years, but the field of linear algebra and especially numerical linear algebra exploded only with advent of computers.
  In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.
- slashdave 26 minutes ago
  Deep-learning hinges on highly redundant solution space (highly redundant weights), along with normalized weights (optimization methodology is commoditized). The original neural network work had no such concepts.
- embedding-shape 1 hour ago
  > I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier
  But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.
- BigTTYGothGF 1 hour ago
  The modern neural net revival got kicked off long before 2017.
  [-]
  - noosphr 1 hour ago
    Alex net in 2012 is only 5 years earlier.
- quicklywilliam 1 hour ago
  Agreed, there is probably a theoretical world where we got enough money/compute together and had this explosion happen earlier.
  Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.
- teekert 1 hour ago
  If you are in the radiology field it started “exploding” much earlier, with CNNs.
- CamperBob2 1 hour ago
  the concept of a transformer could have been used on much slower hardware much earlier.
  It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.
- wslh 1 hour ago
  Don't understimate the massive data you need to make those networks tick. Also, impracticable in slow training algorithms, beyond if they were in GPUs or CPUs.
le-mark 49 minutes ago
> We argue complexity conceals underlying regularity, and that deep learning will indeed admit a scientific theory
That would be amazing, but personally I’m skeptical.
sweezyjeezy 1 hour ago
Deep learning works at a very high level because 'it can keep learning from more data' better than any other approaches. But without the 'stupid amount of data' that is available now, the architecture would be kind of irrelevant. Unless you are going some way to explain both sides of the model-data equation I don't feel you have a solid basis to build a scientific theory, e.g. 'why reasoning models can reason'. The model is the product of both the architecture and training data.
My fear is that this is as hopeless right now as explaining why humans or other animals can learn certain things from their huge amount of input data. We'll gain better empirical understanding, but it won't ever be fundamental computer science again, because the giga-datasets are the fundamental complexity not the architecture.
stared 18 minutes ago
Well, "There Will Be a Scientific Theory of Deep Learning" looks like flag planting - an academic variant of "I told you so!", but one that is a citation magnet.
UltraSane 1 hour ago
I think we need the equivalent of general relativity for latent spaces.
adzm 3 hours ago
I'm only partially through this paper, but it's written in a very engaging and thoughtful manner.
There is so much to digest here but it's fascinating seeing it all put together!
4b11b4 2 hours ago
wow.. this would be cool. Instead of just.. guessing "shapes"
[-]
- NitpickLawyer 1 hour ago
  tbf, we've learned (ha!) more from smashing teeny tiny particles and "looking" at what comes out than from say 40 years of string theory. Sometimes doing stuff works, and the theory (hopefully) follows.
amelius 1 hour ago
"A New Kind of Science" ...