ID photo of Ciro Santilli taken in 2013 right eyeCiro Santilli OurBigBook logoOurBigBook.com  Sponsor 中国独裁统治 China Dictatorship 新疆改造中心、六四事件、法轮功、郝海东、709大抓捕、2015巴拿马文件 邓家贵、低端人口、西藏骚乱
For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.
It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.
Bibliography:8 jax-ml.github.io/scaling-book/

Ancestors (15)

  1. GPT model
  2. Generative pre-trained transformer
  3. Large language model
  4. Text-to-text model
  5. AI text generation
  6. Generative AI by modality
  7. Generative AI
  8. AI by capability
  9. Artificial intelligence
  10. Machine learning
  11. Computer
  12. Information technology
  13. Area of technology
  14. Technology
  15. Home