ID photo of Ciro Santilli taken in 2013 right eyeCiro Santilli OurBigBook logoOurBigBook.com  Sponsor 中国独裁统治 China Dictatorship 新疆改造中心、六四事件、法轮功、郝海东、709大抓捕、2015巴拿马文件 邓家贵、低端人口、西藏骚乱
The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.
For each layer (L):
  • for each attention head (h):
    • K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
    • Q = d_model * d_head (same)
    • K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
    • new value vector for new token: d_model * d_model
    • new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
  • fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)
So the total sum is:
L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)
This is coded at: llm_count_mults.py.
Bibliography:

Ancestors (16)

  1. Theoretical peak performance of GPT inference
  2. GPT model
  3. Generative pre-trained transformer
  4. Large language model
  5. Text-to-text model
  6. AI text generation
  7. Generative AI by modality
  8. Generative AI
  9. AI by capability
  10. Artificial intelligence
  11. Machine learning
  12. Computer
  13. Information technology
  14. Area of technology
  15. Technology
  16. Home