RE: I’d like to get a better understanding of orders of magnitude for compute in deep learning.
Understanding orders of magnitude in deep learning compute involves a strong grasp of how linear scaling rules apply to deep learning computing power.
1. **Training time for deep learning models scales linearly with the number of parameters**: If we have a model with 10 million parameters, it will generally take 10 times as long to train as a model with 1 million parameters.
2. **Training time decreases linearly with the amount of computing power**: With double the computing power, you can generally cut training time in half.
3. **Scaling loss**: Eventually, you reach a stage of diminishing returns. If your model is already large and you double its size, you might not see a full 2x improvement in performance. These diminishing returns can be attributed to several factors like overfitting or hardware limitations.
Deep learning compute isn't just about "bigger is better" - it's about balancing the requirements of your model (its size, the size of your dataset, the complexity of the problem you're trying to solve) with the resources you have available.
Some great resources for a deeper understanding include Andrej Karpathy's 'Deep Learning Scaling is Predictable, Empirically', which provides insights into how and why these linear scaling rules work, and the 'Deep Learning Book' by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, which presents comprehensive coverage of the technical aspects of deep learning.