├── .DS_Store ├── images ├── 1.png ├── 10.png ├── 11.png ├── 2.png ├── 3.png ├── 4.png ├── 5.png ├── 6.png ├── 7.png ├── 8.png ├── 9.png ├── .DS_Store └── 2025-LLM-Hallucination-Index.png ├── hallucination-index-2023.md └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/.DS_Store -------------------------------------------------------------------------------- /images/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/1.png -------------------------------------------------------------------------------- /images/10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/10.png -------------------------------------------------------------------------------- /images/11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/11.png -------------------------------------------------------------------------------- /images/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/2.png -------------------------------------------------------------------------------- /images/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/3.png -------------------------------------------------------------------------------- /images/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/4.png -------------------------------------------------------------------------------- /images/5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/5.png -------------------------------------------------------------------------------- /images/6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/6.png -------------------------------------------------------------------------------- /images/7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/7.png -------------------------------------------------------------------------------- /images/8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/8.png -------------------------------------------------------------------------------- /images/9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/9.png -------------------------------------------------------------------------------- /images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/.DS_Store -------------------------------------------------------------------------------- /images/2025-LLM-Hallucination-Index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rungalileo/hallucination-index/HEAD/images/2025-LLM-Hallucination-Index.png -------------------------------------------------------------------------------- /hallucination-index-2023.md: -------------------------------------------------------------------------------- 1 | ![LLM Hallucination Index](https://user-images.githubusercontent.com/138050654/283170284-a2f30e59-3b92-4d6c-a904-31f622b5c911.png) 2 | # 🌟 LLM Hallucination Index 🌟 3 | ## About the Index 💥 4 | The Hallucination Index is an ongoing initiative to evaluate and rank the most popular LLMs based on propensity to hallucinate. The index uses a comprehensive set of datasets, chosen for their diversity and ability to challenge the model’s abilities to stay on task. 5 | 6 | **Why**: There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations. 7 | 8 | **What**: The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation. 9 | 10 | **How**: The Index ranks leading 11 LLMs performance across three task types. The LLMs were evaluated using 7 popular datasets. To measure hallucinations, the Hallucination Index employs 2 metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method Chainpoll. 11 | 12 | We share full details about our methodology [here](http://rungalileo.io/hallucinationindex). 13 | 14 | ## LLM Rankings for Q&A with RAG 15 | Task: A model that, when presented with a question, uses retrieved information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding. 16 | 17 | ![LLM Rankings for QA with RAG](https://user-images.githubusercontent.com/138050654/283152708-7ef81e54-7b40-4f4d-b416-90c414f877bd.png) 18 | 19 | ## LLM Rankings for Q&A without RAG 20 | Task: A model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned without referencing external sources of information. 21 | 22 | ![LLM Rankings for QA without RAG](https://user-images.githubusercontent.com/138050654/283152214-e30b0b6c-1fa2-4484-88b2-5a554903bc93.png) 23 | 24 | ## LLM Rankings for Long-Form Text Generation 25 | Task: Using generative AI to create extensive and coherent pieces of text such as reports, articles, essays, or stories. For this use-case, AI models are trained on large datasets to understand context, maintain subject relevance, and mimic a natural writing style over longer passages. 26 | 27 | ![LLM ranking for long form text generation](https://user-images.githubusercontent.com/138050654/283153011-e4cb1107-7e61-4032-879c-3644819482d5.png) 28 | 29 | Note: Rankings were last updated on November 15, 2023. 30 | 31 | ## Evaluation metrics ✨ 32 | The metrics used to evaluate output quality and propensity for hallucination are powered by [ChainPoll](https://arxiv.org/abs/2310.18344). 33 | 34 | 35 | ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks. 36 | 37 | Learn more about our experiment and results over [here](http://rungalileo.io/hallucinationindex). 38 | 39 | ![chainpoll metrics](https://user-images.githubusercontent.com/138050654/283151959-48d35b24-4bd3-4f2e-97b1-7ee811fa070e.png) 40 | 41 | ## Models evaluated 42 | 43 | **Open AI** 44 | - GPT-4-0613 45 | - GPT-3.5-turbo-1106 46 | - GPT-3.5-turbo-0613 47 | - GPT-3.5-turbo-instruct 48 | 49 | **Meta** 50 | - Llama-2-70b-chat 51 | - Llama-2-13b-chat 52 | - Llama-2-7b-chat 53 | 54 | **Hugging Face** 55 | - Zephyr-7b-beta 56 | 57 | **Mosaic** 58 | - MPT-7b-instruct 59 | 60 | **Mistral** 61 | - Mistral-7b-instruct-v0.1 62 | 63 | **TII UAE** 64 | - Falcon-40b-instruct 65 | 66 | ## What next? 67 | We are excited about this initiative and plan to update Hallucination Index on a quarterly basis. To get an LLM added to the Hallucination Index reach out [here](https://www.rungalileo.io/hallucinationindex). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🌟 LLM Hallucination Index - RAG Special 🌟 2 | 3 |

4 | 5 |

6 | 7 |

8 | https://galileo.ai/hallucination-index 9 |

10 | 11 | # About the Index 12 | 13 |

14 | 15 |

16 | 17 | # Attributes Tested 18 | 19 | There were two key LLM attributes we wanted to test as part of this Index - context length and open vs. closed-source. 20 | 21 | ## Context Length 22 | 23 | With the rising popularity of RAG, we wanted to see how context length affects model performance. Providing an LLM with context data is akin to giving a student a cheat sheet for an open-book exam. We tested three scenarios: 24 | 25 | | Context Length | Task Description | 26 | | -------------- | ----------- | 27 | | Short Context | Provide the LLM with < 5k tokens of context data, equivalent to a few pages of information. | 28 | | Medium Context | Provide the LLM with 5k - 25k tokens of context data, equivalent to a book chapter. | 29 | | Long Context | Provide the LLM with 40k - 100k tokens of context data, equivalent to an entire book. | 30 | 31 | ## Open vs. Closed Source 32 | 33 | The open-source vs. closed-source software debate has waged on since the Free Software Movement (FSM) in the late 1980s. This debate has reached a fever pitch during the LLM Arms Race. The assumption is closed-source LLMs, with their access to proprietary training data, will perform better, but we wanted to put this assumption to the test. 34 | 35 | ## Prompting Techniques 36 | 37 | We experimented with a prompting technique known as Chain-of-Note, which has shown promise for enhancing performance in short-context scenarios, to see if it similarly benefits medium and long contexts. 38 | 39 | # Models Evaluated 40 | 41 | We tested 22 models, 10 closed-source models and 12 open-source models, from leading foundation model brands like OpenAI, Anthropic, Meta, Google, Mistral, and more. 42 | 43 |

44 | 45 |

46 | 47 | # Major Trends 48 | 49 |

50 | 51 |

52 | 53 | # Overall Winners 54 | 55 |

56 | 57 |

58 | 59 | # Short Context RAG Insights 60 | 61 |

62 | 63 |

64 | 65 | # Medium Context RAG Insights 66 | 67 |

68 | 69 |

70 | 71 | # Long Context RAG Insights 72 | 73 |

74 | 75 |

76 | 77 | # Methodology 78 | 79 | ## Short Context RAG (SCR) 80 | 81 | We evaluated SCR using a rigorous set of datasets to test the model's robustness in handling short contexts. One of our key methodologies was Chainpoll with GPT-4o. This involves polling the model multiple times using a chain of thought technique, allowing us to: 82 | 83 | 1. Quantify potential hallucinations. 84 | 2. Offer context-based explanations, a crucial feature for RAG systems. 85 | 86 | ## Medium and Long Context RAG (MCR & LCR) 87 | 88 | Our focus here was on assessing models’ ability to comprehensively understand extensive texts in medium and long contexts. The procedure involved: 89 | 90 | - Extracting text from 10,000 recent documents of a company. 91 | - Dividing the text into chunks and designating one as the "needle chunk." 92 | - Constructing retrieval questions answerable using the needle chunk embedded in the context. 93 | 94 | ### Context Lengths Evaluated 95 | 96 | - **Medium**: 5k, 10k, 15k, 20k, 25k tokens 97 | - **Long**: 40k, 60k, 80k, 100k tokens 98 | 99 | ### Task Design Considerations 100 | 101 | 1. All text in context must be from a single domain. 102 | 2. Responses should be correct even with short context, confirming the influence of longer contexts. 103 | 3. Questions should not be answerable from pre-training memory or general knowledge. 104 | 4. Measure the influence of information position by keeping everything constant except the location of the needle. 105 | 5. Avoid standard datasets to prevent test leakage. 106 | 107 | ### Effect of Prompting Technique on Performance 108 | 109 | We experimented with a prompting technique known as Chain-of-Note, which has shown promise for enhancing performance in short-context scenarios, to see if it similarly benefits medium and long contexts. 110 | 111 | ### Evaluation 112 | 113 | Adherence to context was evaluated using a custom LLM-based assessment, checking for the relevant answer within the response. 114 | 115 | # Hallucination Detection 116 | 117 | To evaluate a model’s propensity to hallucinate, we employed a high-performance evaluation technique to assess contextual adherence and factual accuracy. Learn more about Galileo’s [Context Adherence](https://www.rungalileo.io/research) and [ChainPoll](https://www.rungalileo.io/blog/chainpoll). 118 | 119 |

120 | 121 |

122 | 123 | # Inner Working of ChainPoll 124 | 125 |

126 | 127 |

128 | 129 | # About Galileo 130 | 131 |

132 | 133 |

134 | 135 | # Get the Full Report with More Insights 🌟 136 | 137 | For an in-depth understanding, we recommend checking out [https://www.rungalileo.io/hallucinationindex](https://www.rungalileo.io/hallucinationindex). 138 | --------------------------------------------------------------------------------