lnwang/retrieval_qa
收藏Hugging Face2023-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lnwang/retrieval_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
- ja
- es
- de
- ru
license: apache-2.0
size_categories:
- 1K<n<10K
dataset_info:
- config_name: de
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 268775
num_examples: 196
download_size: 0
dataset_size: 268775
- config_name: default
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 233289
num_examples: 196
download_size: 0
dataset_size: 233289
- config_name: en
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 233289
num_examples: 196
download_size: 0
dataset_size: 233289
- config_name: es
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 267456
num_examples: 196
download_size: 0
dataset_size: 267456
- config_name: ja
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 268010
num_examples: 196
download_size: 0
dataset_size: 268010
- config_name: ru
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 413438
num_examples: 196
download_size: 191766
dataset_size: 413438
- config_name: zh_cn
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 200707
num_examples: 196
download_size: 0
dataset_size: 200707
- config_name: zh_tw
features:
- name: region
dtype: string
- name: doc
dtype: string
- name: query
dtype: string
- name: choice
sequence:
sequence: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 201205
num_examples: 196
download_size: 0
dataset_size: 201205
configs:
- config_name: de
data_files:
- split: test
path: de/test-*
- config_name: default
data_files:
- split: test
path: data/test-*
- config_name: en
data_files:
- split: test
path: en/test-*
- config_name: es
data_files:
- split: test
path: es/test-*
- config_name: ja
data_files:
- split: test
path: ja/test-*
- config_name: ru
data_files:
- split: test
path: ru/test-*
- config_name: zh_cn
data_files:
- split: test
path: zh_cn/test-*
- config_name: zh_tw
data_files:
- split: test
path: zh_tw/test-*
tags:
- art
---
# Retrieval_QA: A Simple Multilingual Benchmark For Retrieval Encoder Models
<!-- Provide a quick summary of the dataset. -->
The purpose of this dataset is to provide a simple and easy-to-use benchmark for retrieval encoder models, which helps researchers quickly select the most effective retrieval encoder for text extraction and achieve optimal results in subsequent retrieval tasks such as retrieval-augmented-generation (RAG). The dataset contains multiple document-question pairs, where each document is a short text about the history, culture, or other information of a country or region, and each question is a query relevant to the content of the corresponding document.
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
Users may select a retrieval encoder model to encode each document and query into corresponding embeddings, and then use vector matching methods such as FAISS to identify the most relevant documents for each query as regression results.
+ **Curated by**: <a href='https://wln20.github.io'>Luning Wang</a>
+ **Language(s)**: English, Chinese(Simplified, Traditional), Japanse, Spanish, German, Russian
+ **License**: Apache-2.0
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/wln20/Retrieval_QA
- **Paper:** TBD
- **Demo:** TBD
## Uses
The dataset is available on 🤗 Huggingface, you can conveniently use it in python with 🤗 Datasets:
```python
from datasets import load_dataset
dataset_en = load_dataset('lnwang/retrieval_qa', name='en')
# dataset_zh_cn = load_dataset('lnwang/retrieval_qa', name='zh_cn')
# dataset_zh_tw = load_dataset('lnwang/retrieval_qa', name='zh_tw')
```
Now we support three languages: English(en), Simplified-Chinese(zh_cn), Traditional-Chinese(zh_tw), Japanese(ja), Spanish(es), German(de), Russian(ru). You can specify the `name` argument in `load_dataset()` to get the corresponding subset.
For more usages, please follow the examples in the github repository of this project.
## Dataset Creation
The raw data was generated by GPT-3.5-turbo, using carefully designed prompts by human. The data was also cleaned to remove controversial and incorrect information.
提供机构:
lnwang
原始信息汇总
数据集概述
数据集描述
该数据集旨在为检索编码器模型提供一个简单易用的基准,帮助研究人员快速选择最有效的检索编码器进行文本提取,并在后续的检索任务(如检索增强生成(RAG))中取得最佳结果。数据集包含多个文档-问题对,每个文档是一个关于国家或地区历史、文化或其他信息的短文本,每个问题是对应文档内容的查询。
数据集详情
语言
- 英语 (en)
- 简体中文 (zh_cn)
- 繁体中文 (zh_tw)
- 日语 (ja)
- 西班牙语 (es)
- 德语 (de)
- 俄语 (ru)
许可证
- Apache-2.0
数据集配置
-
config_name: de
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 268775
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 268775
- 特征:
-
config_name: default
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 233289
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 233289
- 特征:
-
config_name: en
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 233289
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 233289
- 特征:
-
config_name: es
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 267456
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 267456
- 特征:
-
config_name: ja
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 268010
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 268010
- 特征:
-
config_name: ru
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 413438
- num_examples: 196
- test:
- download_size: 191766
- dataset_size: 413438
- 特征:
-
config_name: zh_cn
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 200707
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 200707
- 特征:
-
config_name: zh_tw
- 特征:
- region: string
- doc: string
- query: string
- choice: sequence of string
- answer: string
- 分割:
- test:
- num_bytes: 201205
- num_examples: 196
- test:
- download_size: 0
- dataset_size: 201205
- 特征:
数据文件配置
-
config_name: de
- data_files:
- split: test
- path: de/test-*
- split: test
- data_files:
-
config_name: default
- data_files:
- split: test
- path: data/test-*
- split: test
- data_files:
-
config_name: en
- data_files:
- split: test
- path: en/test-*
- split: test
- data_files:
-
config_name: es
- data_files:
- split: test
- path: es/test-*
- split: test
- data_files:
-
config_name: ja
- data_files:
- split: test
- path: ja/test-*
- split: test
- data_files:
-
config_name: ru
- data_files:
- split: test
- path: ru/test-*
- split: test
- data_files:
-
config_name: zh_cn
- data_files:
- split: test
- path: zh_cn/test-*
- split: test
- data_files:
-
config_name: zh_tw
- data_files:
- split: test
- path: zh_tw/test-*
- split: test
- data_files:
标签
- art



