neural-bridge/rag-hallucination-dataset-1000
收藏Hugging Face2024-02-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/neural-bridge/rag-hallucination-dataset-1000
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: context
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: train
num_bytes: 2917432.8
num_examples: 800
- name: test
num_bytes: 729358.2
num_examples: 200
download_size: 2300801
dataset_size: 3646791
task_categories:
- question-answering
language:
- en
size_categories:
- 1K<n<10K
license: apache-2.0
tags:
- retrieval-augmented-generation
- hallucination
---
# **Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000**
**Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000 is an English dataset designed to reduce the hallucination in RAG-optimized models, built by [Neural Bridge AI](https://www.neuralbridge.ai/), and released under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).**
## **Dataset Description**
#### Dataset Summary
Hallucination in large language models (LLMs) refers to the generation of incorrect, nonsensical, or unrelated text that does not stem from an accurate or real source of information. Retrieval Augmented Generation (RAG) Hallucination Dataset addresses this issue by making LLMs response for the topics that the models don't have sufficient knowledge by simply saying "This question cannot be answered." This kind of responses is crucial for reducing hallucinations, ensuring the generation of relevant, accurate, and context-specific output.
RAG Hallucination Dataset 1000 consists of triple-feature entries, each containing "context", "question", and "answer" fields. The answer filed in all entries consist of the following sentence: "This question cannot be answered." The dataset is constructed to enhance the model performance on the questions of which answers aren't in the context This collection, comprising 1000 entries, leverages context data from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), specifically designed to train RAG-optimized models for applications in question answering and beyond, with a focus on minimizing hallucinations.
```python
from datasets import load_dataset
rag_hallucination_dataset = load_dataset("neural-bridge/rag-hallucination-dataset-1000")
```
#### Languages
The text in the dataset is in English. The associated BCP-47 code is `en`.
## **Dataset Structure**
#### Data Instances
A typical data point comprises a context, a question about the context, and an answer for the question. The context is obtained from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), and the question and answer for each data point are generated by GPT-4.
An example from the dataset looks like the following:
```
{
context: ...
question: ...
answer: ...
}
```
#### Data Fields
- `context`: A string consisting of a range of tokens.
- `question`: A string consisting of a question that cannot be answerable by purely looking at the context.
- `answer`: A string consisting of an answer for the question. It is always the following: "This question cannot be answered."
#### Data Splits
The data is split into a training and test set. The split sizes are as follow:
| | Train | Test |
| ----- | ------ | ---- |
| RAG Hallucination Dataset 1000 | 800 | 200 |
## Source Data
The data points in the dataset are from the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset.
## **Neural Bridge AI RAG Datasets Index**
| Model | Link |
| ----- | ------ |
| RAG Full 20000 | [link](https://huggingface.co/datasets/neural-bridge/rag-full-20000) |
| RAG Dataset 12000 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) |
| RAG Dataset 1200 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-1200) |
| RAG Hallucination Dataset 1000 | [link](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000) |
## **License**
This public extract is made available under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). Users should also abide to the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) ToU.
提供机构:
neural-bridge
原始信息汇总
Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000 概述
数据集基本信息
- 名称:Retrieval-Augmented Generation (RAG) Hallucination Dataset 1000
- 语言:英语
- 目的:减少RAG优化模型中的幻觉现象
- 创建者:Neural Bridge AI
- 许可:Apache License 2.0
数据集用途
该数据集旨在通过提供特定的训练数据,帮助优化RAG模型,以减少在生成过程中出现的幻觉问题。



