Theoretical peak performance of GPT inference

For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.

It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.

Bibliography:

www.reddit.com/r/LocalLLaMA/comments/1brcnps/is_inferencing_memory_bandwidth_limited/
zeux.io/2024/03/15/llm-inference-sol/

8 jax-ml.github.io/scaling-book/

Table of contents 274 1
- Number of multiplications per token in a GPT model Theoretical peak performance of GPT inference 166

 Ancestors (15)

GPT model
Generative pre-trained transformer
Large language model
Text-to-text model
AI text generation
Generative AI by modality
Generative AI
AI by capability
Artificial intelligence
Machine learning
Computer
Information technology
Area of technology
Technology
 Home

 Incoming links (1)

LLM inference batching