├── .DS_Store
├── images
    ├── 1.png
    ├── 10.png
    ├── 11.png
    ├── 2.png
    ├── 3.png
    ├── 4.png
    ├── 5.png
    ├── 6.png
    ├── 7.png
    ├── 8.png
    ├── 9.png
    ├── .DS_Store
    └── 2025-LLM-Hallucination-Index.png
├── hallucination-index-2023.md
└── README.md


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/.DS_Store


--------------------------------------------------------------------------------
/images/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/1.png


--------------------------------------------------------------------------------
/images/10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/10.png


--------------------------------------------------------------------------------
/images/11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/11.png


--------------------------------------------------------------------------------
/images/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/2.png


--------------------------------------------------------------------------------
/images/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/3.png


--------------------------------------------------------------------------------
/images/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/4.png


--------------------------------------------------------------------------------
/images/5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/5.png


--------------------------------------------------------------------------------
/images/6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/6.png


--------------------------------------------------------------------------------
/images/7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/7.png


--------------------------------------------------------------------------------
/images/8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/8.png


--------------------------------------------------------------------------------
/images/9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/9.png


--------------------------------------------------------------------------------
/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/.DS_Store


--------------------------------------------------------------------------------
/images/2025-LLM-Hallucination-Index.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/2025-LLM-Hallucination-Index.png


--------------------------------------------------------------------------------
/hallucination-index-2023.md:
--------------------------------------------------------------------------------
 1 | ![LLM Hallucination Index](https://user-images.githubusercontent.com/138050654/283170284-a2f30e59-3b92-4d6c-a904-31f622b5c911.png)
 2 | # 🌟 LLM Hallucination Index 🌟
 3 | ## About the Index 💥
 4 | The Hallucination Index is an ongoing initiative to evaluate and rank the most popular LLMs based on propensity to hallucinate. The index uses a comprehensive set of datasets, chosen for their diversity and ability to challenge the model’s abilities to stay on task. 
 5 | 
 6 | **Why**: There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations. 
 7 | 
 8 | **What**: The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation. 
 9 | 
10 | **How**: The Index ranks leading 11 LLMs performance across three task types. The LLMs were evaluated using 7 popular datasets. To measure hallucinations, the Hallucination Index employs 2 metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method Chainpoll.
11 | 
12 | We share full details about our methodology [here](http://rungalileo.io/hallucinationindex).
13 | 
14 | ## LLM Rankings for Q&A with RAG
15 | Task: A model that, when presented with a question, uses retrieved information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding.
16 | 
17 | ![LLM Rankings for QA with RAG](https://user-images.githubusercontent.com/138050654/283152708-7ef81e54-7b40-4f4d-b416-90c414f877bd.png)
18 | 
19 | ## LLM Rankings for Q&A without RAG
20 | Task: A model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned without referencing external sources of information.
21 | 
22 | ![LLM Rankings for QA without RAG](https://user-images.githubusercontent.com/138050654/283152214-e30b0b6c-1fa2-4484-88b2-5a554903bc93.png)
23 | 
24 | ## LLM Rankings for Long-Form Text Generation
25 | Task: Using generative AI to create extensive and coherent pieces of text such as reports, articles, essays, or stories. For this use-case, AI models are trained on large datasets to understand context, maintain subject relevance, and mimic a natural writing style over longer passages.
26 | 
27 | ![LLM ranking for long form text generation](https://user-images.githubusercontent.com/138050654/283153011-e4cb1107-7e61-4032-879c-3644819482d5.png)
28 | 
29 | Note: Rankings were last updated on November 15, 2023.
30 | 
31 | ## Evaluation metrics ✨
32 | The metrics used to evaluate output quality and propensity for hallucination are powered by [ChainPoll](https://arxiv.org/abs/2310.18344).
33 | 
34 | 
35 | ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks.
36 | 
37 | Learn more about our experiment and results over [here](http://rungalileo.io/hallucinationindex).
38 | 
39 | ![chainpoll metrics](https://user-images.githubusercontent.com/138050654/283151959-48d35b24-4bd3-4f2e-97b1-7ee811fa070e.png)
40 | 
41 | ## Models evaluated
42 | 
43 | **Open AI**
44 | - GPT-4-0613
45 | - GPT-3.5-turbo-1106
46 | - GPT-3.5-turbo-0613
47 | - GPT-3.5-turbo-instruct
48 | 
49 | **Meta**
50 | - Llama-2-70b-chat
51 | - Llama-2-13b-chat
52 | - Llama-2-7b-chat
53 | 
54 | **Hugging Face**
55 | - Zephyr-7b-beta
56 | 
57 | **Mosaic**
58 | - MPT-7b-instruct
59 | 
60 | **Mistral**
61 | - Mistral-7b-instruct-v0.1
62 | 
63 | **TII UAE**
64 | - Falcon-40b-instruct
65 | 
66 | ## What next?
67 | We are excited about this initiative and plan to update Hallucination Index on a quarterly basis. To get an LLM added to the Hallucination Index reach out [here](https://www.rungalileo.io/hallucinationindex).


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🌟 LLM Hallucination Index - RAG Special 🌟
  2 | 
  3 | <p align="center">
  4 |   <img src="/images/2025-LLM-Hallucination-Index.png" />
  5 | </p>
  6 | 
  7 | <p align="center">
  8 |   <a href="https://galileo.ai/hallucination-index">https://galileo.ai/hallucination-index</a>
  9 | </p>
 10 | 
 11 | # About the Index
 12 | 
 13 | <p align="center">
 14 |   <img src="images/2.png" />
 15 | </p>
 16 | 
 17 | # Attributes Tested
 18 | 
 19 | There were two key LLM attributes we wanted to test as part of this Index - context length and open vs. closed-source.
 20 | 
 21 | ## Context Length
 22 | 
 23 | With the rising popularity of RAG, we wanted to see how context length affects model performance. Providing an LLM with context data is akin to giving a student a cheat sheet for an open-book exam. We tested three scenarios:
 24 | 
 25 | | Context Length | Task Description |
 26 | | -------------- | ----------- |
 27 | | Short Context  | Provide the LLM with < 5k tokens of context data, equivalent to a few pages of information. |
 28 | | Medium Context | Provide the LLM with 5k - 25k tokens of context data, equivalent to a book chapter. |
 29 | | Long Context   | Provide the LLM with 40k - 100k tokens of context data, equivalent to an entire book. |
 30 | 
 31 | ## Open vs. Closed Source
 32 | 
 33 | The open-source vs. closed-source software debate has waged on since the Free Software Movement (FSM) in the late 1980s. This debate has reached a fever pitch during the LLM Arms Race. The assumption is closed-source LLMs, with their access to proprietary training data, will perform better, but we wanted to put this assumption to the test.
 34 | 
 35 | ## Prompting Techniques
 36 | 
 37 | We experimented with a prompting technique known as Chain-of-Note, which has shown promise for enhancing performance in short-context scenarios, to see if it similarly benefits medium and long contexts.
 38 | 
 39 | # Models Evaluated
 40 | 
 41 | We tested 22 models, 10 closed-source models and 12 open-source models, from leading foundation model brands like OpenAI, Anthropic, Meta, Google, Mistral, and more.
 42 | 
 43 | <p align="center">
 44 |   <img src="images/3.png" />
 45 | </p>
 46 | 
 47 | # Major Trends
 48 | 
 49 | <p align="center">
 50 |   <img src="images/4.png" />
 51 | </p>
 52 | 
 53 | # Overall Winners
 54 | 
 55 | <p align="center">
 56 |   <img src="images/5.png" />
 57 | </p>
 58 | 
 59 | # Short Context RAG Insights
 60 | 
 61 | <p align="center">
 62 |   <img src="images/6.png" />
 63 | </p>
 64 | 
 65 | # Medium Context RAG Insights
 66 | 
 67 | <p align="center">
 68 |   <img src="images/7.png" />
 69 | </p>
 70 | 
 71 | # Long Context RAG Insights
 72 | 
 73 | <p align="center">
 74 |   <img src="images/8.png" />
 75 | </p>
 76 | 
 77 | # Methodology
 78 | 
 79 | ## Short Context RAG (SCR)
 80 | 
 81 | We evaluated SCR using a rigorous set of datasets to test the model's robustness in handling short contexts. One of our key methodologies was Chainpoll with GPT-4o. This involves polling the model multiple times using a chain of thought technique, allowing us to:
 82 | 
 83 | 1. Quantify potential hallucinations.
 84 | 2. Offer context-based explanations, a crucial feature for RAG systems.
 85 | 
 86 | ## Medium and Long Context RAG (MCR & LCR)
 87 | 
 88 | Our focus here was on assessing models’ ability to comprehensively understand extensive texts in medium and long contexts. The procedure involved:
 89 | 
 90 | - Extracting text from 10,000 recent documents of a company.
 91 | - Dividing the text into chunks and designating one as the "needle chunk."
 92 | - Constructing retrieval questions answerable using the needle chunk embedded in the context.
 93 | 
 94 | ### Context Lengths Evaluated
 95 | 
 96 | - **Medium**: 5k, 10k, 15k, 20k, 25k tokens
 97 | - **Long**: 40k, 60k, 80k, 100k tokens
 98 | 
 99 | ### Task Design Considerations
100 | 
101 | 1. All text in context must be from a single domain.
102 | 2. Responses should be correct even with short context, confirming the influence of longer contexts.
103 | 3. Questions should not be answerable from pre-training memory or general knowledge.
104 | 4. Measure the influence of information position by keeping everything constant except the location of the needle.
105 | 5. Avoid standard datasets to prevent test leakage.
106 | 
107 | ### Effect of Prompting Technique on Performance
108 | 
109 | We experimented with a prompting technique known as Chain-of-Note, which has shown promise for enhancing performance in short-context scenarios, to see if it similarly benefits medium and long contexts.
110 | 
111 | ### Evaluation
112 | 
113 | Adherence to context was evaluated using a custom LLM-based assessment, checking for the relevant answer within the response.
114 | 
115 | # Hallucination Detection
116 | 
117 | To evaluate a model’s propensity to hallucinate, we employed a high-performance evaluation technique to assess contextual adherence and factual accuracy. Learn more about Galileo’s [Context Adherence](https://www.rungalileo.io/research) and [ChainPoll](https://www.rungalileo.io/blog/chainpoll).
118 | 
119 | <p align="center">
120 |   <img src="images/9.png" />
121 | </p>
122 | 
123 | # Inner Working of ChainPoll
124 | 
125 | <p align="center">
126 |   <img src="images/10.png" />
127 | </p>
128 | 
129 | # About Galileo
130 | 
131 | <p align="center">
132 |   <img src="images/11.png" />
133 | </p>
134 | 
135 | # Get the Full Report with More Insights 🌟
136 | 
137 | For an in-depth understanding, we recommend checking out [https://www.rungalileo.io/hallucinationindex](https://www.rungalileo.io/hallucinationindex).
138 | 


--------------------------------------------------------------------------------