neural-bridge/rag-full-20000

Name: neural-bridge/rag-full-20000
Creator: neural-bridge
Published: 2024-02-05 18:24:39
License: 暂无描述

Hugging Face2024-02-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/neural-bridge/rag-full-20000

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: clear_prompt dtype: string splits: - name: train num_bytes: 43183498.53262665 num_examples: 17433 - name: test num_bytes: 10797732.467373349 num_examples: 4359 download_size: 32335855 dataset_size: 53981231 task_categories: - question-answering language: - en size_categories: - 10K<n<100K license: apache-2.0 tags: - retrieval-augmented-generation --- # **Retrieval-Augmented Generation (RAG) Full 20000** **Retrieval-Augmented Generation (RAG) Full 20000 is an English dataset designed for RAG-optimized models, built by [Neural Bridge AI](https://www.neuralbridge.ai/), and released under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).** ## **Dataset Description** #### Dataset Summary Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly boosts the models' ability to produce relevant, accurate, and context-specific output by extending their capabilities to specialized domains or an organization's internal data, without the need for retraining. RAG offers a cost-effective method to leverage the vast data processing power of LLMs, equipped with billions of parameters, for tasks such as question-answering, language translation, and sentence completion, ensuring that the output is always up-to-date and applicable to various contexts. RAG's importance lies in its potential to address the inherent challenges of LLMs, such as unpredictability in responses, reliance on static and potentially outdated training data, and the risk of disseminating incorrect or non-authoritative information. These issues can negatively affect user trust in AI-powered applications, making RAG's ability to guide LLMs toward authoritative sources for information retrieval invaluable. RAG has multiple benefits, including cost-effective implementation and maintenance, access to current information, improved user trust through accurate information and source attribution, and greater control for developers over the information retrieval process. This approach allows for the dynamic updating of LLMs with the latest research, statistics, or news, directly addressing the challenges of maintaining relevancy and accuracy in rapidly changing knowledge landscapes. Additionally, it empowers organizations to deploy generative AI more confidently across a wider range of applications, enhancing both the user experience and the reliability of AI-driven interactions. Retrieval-Augmented Generation (RAG) Full 20000 dataset is a sigle-feature dataset, with each entry containing a "clear_prompt" field, designed to help build RAG-optimized models. This data consists of 20000 entries, and the data is from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), [gsm8k](https://huggingface.co/datasets/gsm8k), and [RAG Hallucination Dataset 1000](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000). ```python from datasets import load_dataset rag_full = load_dataset("neural-bridge/rag-full-20000") ``` #### Languages The text in the dataset is in English. The associated BCP-47 code is `en`. ## **Dataset Structure** #### Data Instances A typical data point comprises the "clear_prompt" field, which is the concatenation of "context" (optional), "question", and "answer" fields. The context is obtained from [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) and [RAG Hallucination Dataset 1000](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000). The question and answer for each data point are neither obtained by [gsm8k](https://huggingface.co/datasets/gsm8k) nor generated by GPT-4. An example from the dataset looks like the following: ``` { clear_prompt: ... } ``` #### Data Fields - `clear_prompt`: A string consisting of a range of tokens. It includes the "context (optional)", "question", and "answer" fields between "##CONTEXT##", "##QUESTION##", and "##ANSWER##" tags respectively. #### Data Splits The data is split into a training and test set. The split sizes are as follow: | | Train | Test | | ----- | ------ | ---- | | RAG Full 20000 | 17433 | 4359 | ## Source Data The data points in the dataset are from the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), [gsm8k](https://huggingface.co/datasets/gsm8k), and [RAG Hallucination Dataset 1000](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000) datasets. ## **Neural Bridge AI RAG Datasets Index** | Model | Link | | ----- | ------ | | RAG Full 20000 | [link](https://huggingface.co/datasets/neural-bridge/rag-full-20000) | | RAG Dataset 12000 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) | | RAG Dataset 1200 | [link](https://huggingface.co/datasets/neural-bridge/rag-dataset-1200) | | RAG Hallucination Dataset 1000 | [link](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000) | ## **License** This public extract is made available under [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). Users should also abide to the [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), [gsm8k](https://huggingface.co/datasets/gsm8k), and [RAG Hallucination Dataset 1000](https://huggingface.co/datasets/neural-bridge/rag-hallucination-dataset-1000) ToUs.

提供机构：

neural-bridge

原始信息汇总

数据集概述

数据集名称

Retrieval-Augmented Generation (RAG) Full 20000

数据集描述

Retrieval-Augmented Generation (RAG) Full 20000 是一个为优化 RAG 模型设计的英语数据集，由 Neural Bridge AI 构建，并基于 Apache 2.0 许可证发布。该数据集通过允许大型语言模型在生成响应前咨询外部权威知识库，显著提升了模型生成相关、准确和上下文特定输出的能力。

数据集特征

特征名称: clear_prompt
数据类型: string

数据集结构

数据实例

每个数据点包含一个 "clear_prompt" 字段，该字段是 "context"（可选）、"question" 和 "answer" 字段的组合。

数据字段

clear_prompt: 包含 "context"（可选）、"question" 和 "answer" 字段的字符串，分别由 "##CONTEXT##"、"##QUESTION##" 和 "##ANSWER##" 标签分隔。

数据分割

数据集分为训练集和测试集：

训练集: 17433 条数据
测试集: 4359 条数据

数据来源

数据点来自以下数据集：

许可证

该数据集基于 Apache 2.0 许可证发布。

5,000+

优质数据集

54 个

任务类型

进入经典数据集