Transformer FLOPs

https://www.adamcasson.com/posts/transformer-flops

Counting the number of floating-point operations (FLOPs) in Transformers is a useful way to estimate compute requirements and measure efficiency. As training runs get larger and larger (thus more expensive) it becomes more important to understand how many FLOPs we need to do and how well we utilize our hardware.

Counting FLOPs in Transformers (https://www.adamcasson.com/posts/transformer-flops#counting-flops-in-transformers)

One commonly used method for counting FLOPs is from the OpenAI scaling law paper which uses

Cforward+backward≈6N

for estimating the number of FLOPs per token during the training of a decoder-only Transformer where NN is the number of non-embedding parameters in the model. T