We will get to the point where you can quickly bootstrap i.e. an LLM can train a better LLM in a loop, leave it and it can really learn. Like learn learn.
"Train yourself to solve this problem see OBJECTIVE.md"
> Data efficiency matters because compute grows much faster than data
[2] (referencing a paper from 2022)
I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.
Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.
That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.
You seem to be making two points:
- synthetic data is a valuable direction to pursue when you have compute
- chinchilla scaling laws have some flaws for small models
Both of these are side points to the core purpose of the Slowrun.
The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)
> you can simply generate more, and higher quality, artificial data
this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.
good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.
> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that
I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.
"Train yourself to solve this problem see OBJECTIVE.md"
I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.
Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.
That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.
The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)
this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.
good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.
I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.