The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.
For each layer (L):So the total sum is:
- for each attention head (h):
- K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
- Q = d_model * d_head (same)
- K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
- new value vector for new token: d_model * d_model
- new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
- fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)
L * (
h * (
2 * d_model * d_head +
n_ctx * d_head +
d_model * d_model +
n_ctx * d_model
) +
2 * d_model * d_ff
)
This is coded at: llm_count_mults.py.
Bibliography:
Ancestors
- Theoretical peak performance of GPT inference
- GPT model
- Generative pre-trained transformer
- Large language model
- Text-to-text model
- AI text generation
- Generative AI by modality
- Generative AI
- AI by capability
- Artificial intelligence
- Machine learning
- Computer
- Information technology
- Area of technology
- Technology
- Home