neural-bridge/rag-hallucination-dataset-1000

Name: neural-bridge/rag-hallucination-dataset-1000
Creator: neural-bridge
Published: 2024-02-05 18:26:49
License: 暂无描述

Hugging Face2024-02-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/neural-bridge/rag-hallucination-dataset-1000

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: context dtype: string - name: question dtype: string - name: answer dtype: string splits: - name: train num_bytes: 2917432.8 num_examples: 800 - name: test num_bytes: 729358.2 num_examples: 200 download_size: 2300801 dataset_size: 3646791 task_categories: - question-answering language: - en size_categories: - 1K<n<10K license: apache-2.0 tags: - retrieval-augmented-generation - hallucination --- # **Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000** **Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000 is an English dataset designed to reduce the hallucination in RAG-optimized models, built by [Neural Bridge AI](https://www.neuralbridge.ai/), and released under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).** ## **Dataset Description** #### Dataset Summary Hallucination in large language models (LLMs) refers to the generation of incorrect, nonsensical, or unrelated text that does not stem from an accurate or real source of information. Retrieval Augmented Generation (RAG) Hallucination Dataset addresses this issue by making LLMs response for the topics that the models don't have sufficient knowledge by simply saying "This question cannot be answered." This kind of responses is crucial for reducing hallucinations, ensuring the generation of relevant, accurate, and context-specific output. RAG Hallucination Dataset 1000 consists of triple-feature entries, each containing "context", "question", and "answer" fields. The answer filed in all entries consist of the following sentence: "This question cannot be answered." The dataset is constructed to enhance the model performance on the questions of which answers aren't in the context This collection, comprising 1000 entries, leverages context data from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), specifically designed to train RAG-optimized models for applications in question answering and beyond, with a focus on minimizing hallucinations. ```python from datasets import load_dataset rag_hallucination_dataset = load_dataset("neural-bridge/rag-hallucination-dataset-1000") ``` #### Languages The text in the dataset is in English. The associated BCP-47 code is `en`. ## **Dataset Structure** #### Data Instances A typical data point comprises a context, a question about the context, and an answer for the question. The context is obtained from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), and the question and answer for each data point are generated by GPT-4. An example from the dataset looks like the following: ``` { context: ... question: ... answer: ... } ``` #### Data Fields - `context`: A string consisting of a range of tokens. - `question`: A string consisting of a question that cannot be answerable by purely looking at the context. - `answer`: A string consisting of an answer for the question. It is always the following: "This question cannot be answered." #### Data Splits The data is split into a training and test set. The split sizes are as follow: | | Train | Test | | ----- | ------ | ---- | | RAG Hallucination Dataset 1000 | 800 | 200 | ## Source Data The data points in the dataset are from the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset. ## **Neural Bridge AI RAG Datasets Index** | Model | Link | | ----- | ------ | | RAG Full 20000 | [link](https://huggingface.co/datasets/neural-bridge/rag-full-20000) | | RAG Dataset 12000 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) | | RAG Dataset 1200 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-1200) | | RAG Hallucination Dataset 1000 | [link](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000) | ## **License** This public extract is made available under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). Users should also abide to the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) ToU.

提供机构：

neural-bridge

原始信息汇总

Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000 概述

数据集基本信息

名称：Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000
语言：英语
目的：减少RAG优化模型中的幻觉现象
创建者：Neural Bridge AI
许可：Apache License 2.0

数据集用途

该数据集旨在通过提供特定的训练数据，帮助优化RAG模型，以减少在生成过程中出现的幻觉问题。

5,000+

优质数据集

54 个

任务类型

进入经典数据集