LLM benchmark

Benchmarking LLMs is an extremely difficult issue.

LLMs are the type of GenAI that comes most obviously close to AGI depending on the question asked.

Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.

Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.

Bibliography:

www.reddit.com/r/LocalLLaMA/comments/1b933of/llm_benchmarks_are_bullshit/

Table of contents 457 5
- Simplest questions that LLMs get wrong LLM benchmark 351 1
  - Easy Problems That LLMs Get Wrong by Sean Williams and James Huckle Simplest questions that LLMs get wrong 320
- List of LLM benchmarks LLM benchmark 24 2
  - MMLU List of LLM benchmarks
  - Humanity's Last Exam List of LLM benchmarks 24

 Ancestors (13)

Large language model
Text-to-text model
AI text generation
Generative AI by modality
Generative AI
AI by capability
Artificial intelligence
Machine learning
Computer
Information technology
Area of technology
Technology
 Home