I’d like to get a better understanding of orders of magnitude for compute in deep learning.
Understanding orders of magnitude in deep learning compute involves a strong grasp of how linear scaling rules apply to deep learning computing power.
1. **Training time for deep learning models scales linearly with the number of parameters**: If we have a model with 10 million parameters, it will generally take 10 times as long to train as a model with 1 million parameters.
2. **Training time decreases linearly with the amount of computing power**: With double the computing power, you can generally cut training time in half.
3. **Scaling loss**: Eventually, you reach a stage of diminishing returns. If your model is already large and you double its size, you might not see a full 2x improvement in performance. These diminishing returns can be attributed to several factors like overfitting or hardware limitations.
Deep learning compute isn't just about "bigger is better" - it's about balancing the requirements of your model (its size, the size of your dataset, the complexity of the problem you're trying to solve) with the resources you have available.
Some great resources for a deeper understanding include Andrej Karpathy's 'Deep Learning Scaling is Predictable, Empirically', which provides insights into how and why these linear scaling rules work, and the 'Deep Learning Book' by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, which presents comprehensive coverage of the technical aspects of deep learning.
Understanding orders of magnitude in compute in deep learning involves understanding three main factors:
1. **Data Size**: Deep learning often utilizes large amounts of data. Larger data sets often require more computing power. For instance, training a model on a dataset of 10,000 images will demand less compute power than a dataset of 1 million images.
2. **Model Complexity**: More complex models with more layers and/or larger layer sizes demand more compute resources. For instance, small neural networks might be manageable on a personal computer, but a large transformer model like GPT-3 needs significant resources.
3. **Iterations**: Training models for many epochs or iterations can require significant compute resources. Also consider the number of hyperparameter settings you want to try, as each setting change effectively multiplies the resources needed.
To better understand the compute needs and constraints you'll be working with, I recommend working with toolkits like TensorFlow's Profiler which allows you to visualise the time and memory information of your model.
Remember, this is a vastly simplified explanation. The actual calculation can be more complicated depending on factors like the type of hardware you're using, other tasks the machine is performing, and optimizations you may be able to make on your model. So, always be ready to experiment, optimize and profile!