NomaDamas/split_search_qa

Name: NomaDamas/split_search_qa
Creator: NomaDamas
Published: 2024-01-04 13:52:53
License: 暂无描述

Hugging Face2024-01-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NomaDamas/split_search_qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown dataset_info: - config_name: corpus features: - name: query_id dtype: string - name: snippets dtype: string - name: air_date dtype: string - name: category dtype: string - name: value dtype: string - name: round dtype: string - name: show_number dtype: int32 - name: doc_id dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 6252715344 num_examples: 14120776 download_size: 3271155810 dataset_size: 6252715344 - config_name: qa_data features: - name: query_id dtype: string - name: question dtype: string - name: answer dtype: string - name: search_results struct: - name: related_links sequence: string - name: snippets sequence: string - name: titles sequence: string - name: urls sequence: string - name: doc_id sequence: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 6503932619 num_examples: 173397 - name: test num_bytes: 1830028629 num_examples: 43350 download_size: 5008413626 dataset_size: 8333961248 configs: - config_name: corpus data_files: - split: train path: corpus/train-* - config_name: qa_data data_files: - split: train path: qa_data/train-* - split: test path: qa_data/test-* --- # preprocessed_SearchQA The SearchQA question-answer pairs originate from J! Archive2, which comprehensively archives all question-answer pairs from the renowned television show Jeopardy! The passages, sourced from Google search web page snippets. We offer passage metadata, encompassing details like 'air_date,' 'category,' 'value,' 'round,' and 'show_number,' enabling you to enhance retrieval performance at your discretion. Should you require further details about SearchQA, please refer to below links. [Github](https://github.com/nyu-dl/dl4ir-searchQA) [Paper](https://arxiv.org/abs/1704.05179) The dataset is derived from [searhQA](https://huggingface.co/datasets/search_qa). This preprocessed dataset is for RAG. For more information about our task, visit our [repository](https://github.com/NomaDamas/RAGchain)! Preprocess SearchQA dataset code for RAG benchmark. More information, refer to this link! [huggingface](https://huggingface.co/datasets/NomaDamas/search_qa_split)

提供机构：

NomaDamas

原始信息汇总

数据集概述

数据集配置

corpus
- 特征
  - query_id: 字符串
  - snippets: 字符串
  - air_date: 字符串
  - category: 字符串
  - value: 字符串
  - round: 字符串
  - show_number: 整数 (int32)
  - doc_id: 字符串
  - __index_level_0__: 整数 (int64)
- 分割
  - train: 字节数 6252715344, 样本数 14120776
- 下载大小: 3271155810 字节
- 数据集大小: 6252715344 字节
qa_data
- 特征
  - query_id: 字符串
  - question: 字符串
  - answer: 字符串
  - search_results: 结构体
    - related_links: 字符串序列
    - snippets: 字符串序列
    - titles: 字符串序列
    - urls: 字符串序列
  - doc_id: 字符串序列
  - __index_level_0__: 整数 (int64)
- 分割
  - train: 字节数 6503932619, 样本数 173397
  - test: 字节数 1830028629, 样本数 43350
- 下载大小: 5008413626 字节
- 数据集大小: 8333961248 字节

数据文件

corpus
- train: corpus/train-*
qa_data
- train: qa_data/train-*
- test: qa_data/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集