NomaDamas/split_search_qa
收藏Hugging Face2024-01-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NomaDamas/split_search_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
dataset_info:
- config_name: corpus
features:
- name: query_id
dtype: string
- name: snippets
dtype: string
- name: air_date
dtype: string
- name: category
dtype: string
- name: value
dtype: string
- name: round
dtype: string
- name: show_number
dtype: int32
- name: doc_id
dtype: string
- name: __index_level_0__
dtype: int64
splits:
- name: train
num_bytes: 6252715344
num_examples: 14120776
download_size: 3271155810
dataset_size: 6252715344
- config_name: qa_data
features:
- name: query_id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: search_results
struct:
- name: related_links
sequence: string
- name: snippets
sequence: string
- name: titles
sequence: string
- name: urls
sequence: string
- name: doc_id
sequence: string
- name: __index_level_0__
dtype: int64
splits:
- name: train
num_bytes: 6503932619
num_examples: 173397
- name: test
num_bytes: 1830028629
num_examples: 43350
download_size: 5008413626
dataset_size: 8333961248
configs:
- config_name: corpus
data_files:
- split: train
path: corpus/train-*
- config_name: qa_data
data_files:
- split: train
path: qa_data/train-*
- split: test
path: qa_data/test-*
---
# preprocessed_SearchQA
The SearchQA question-answer pairs originate from J! Archive2, which comprehensively archives all question-answer pairs
from the renowned television show Jeopardy! The passages, sourced from Google search web page snippets.
We offer passage metadata, encompassing details like 'air_date,' 'category,' 'value,' 'round,' and 'show_number,'
enabling you to enhance retrieval performance at your discretion.
Should you require further details about SearchQA, please refer to below links.
[Github](https://github.com/nyu-dl/dl4ir-searchQA)<br>
[Paper](https://arxiv.org/abs/1704.05179)<br>
The dataset is derived from [searhQA](https://huggingface.co/datasets/search_qa).<br>
This preprocessed dataset is for RAG. For more information about our task, visit our [repository](https://github.com/NomaDamas/RAGchain)!<br>
Preprocess SearchQA dataset code for RAG benchmark. <br>
More information, refer to this link! [huggingface](https://huggingface.co/datasets/NomaDamas/search_qa_split)
提供机构:
NomaDamas
原始信息汇总
数据集概述
数据集配置
-
corpus
- 特征
query_id: 字符串snippets: 字符串air_date: 字符串category: 字符串value: 字符串round: 字符串show_number: 整数 (int32)doc_id: 字符串__index_level_0__: 整数 (int64)
- 分割
train: 字节数 6252715344, 样本数 14120776
- 下载大小: 3271155810 字节
- 数据集大小: 6252715344 字节
- 特征
-
qa_data
- 特征
query_id: 字符串question: 字符串answer: 字符串search_results: 结构体related_links: 字符串序列snippets: 字符串序列titles: 字符串序列urls: 字符串序列
doc_id: 字符串序列__index_level_0__: 整数 (int64)
- 分割
train: 字节数 6503932619, 样本数 173397test: 字节数 1830028629, 样本数 43350
- 下载大小: 5008413626 字节
- 数据集大小: 8333961248 字节
- 特征
数据文件
-
corpus
train:corpus/train-*
-
qa_data
train:qa_data/train-*test:qa_data/test-*



