five

lnwang/retrieval_qa

收藏
Hugging Face2023-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lnwang/retrieval_qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh - ja - es - de - ru license: apache-2.0 size_categories: - 1K<n<10K dataset_info: - config_name: de features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 268775 num_examples: 196 download_size: 0 dataset_size: 268775 - config_name: default features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 233289 num_examples: 196 download_size: 0 dataset_size: 233289 - config_name: en features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 233289 num_examples: 196 download_size: 0 dataset_size: 233289 - config_name: es features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 267456 num_examples: 196 download_size: 0 dataset_size: 267456 - config_name: ja features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 268010 num_examples: 196 download_size: 0 dataset_size: 268010 - config_name: ru features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 413438 num_examples: 196 download_size: 191766 dataset_size: 413438 - config_name: zh_cn features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 200707 num_examples: 196 download_size: 0 dataset_size: 200707 - config_name: zh_tw features: - name: region dtype: string - name: doc dtype: string - name: query dtype: string - name: choice sequence: sequence: string - name: answer dtype: string splits: - name: test num_bytes: 201205 num_examples: 196 download_size: 0 dataset_size: 201205 configs: - config_name: de data_files: - split: test path: de/test-* - config_name: default data_files: - split: test path: data/test-* - config_name: en data_files: - split: test path: en/test-* - config_name: es data_files: - split: test path: es/test-* - config_name: ja data_files: - split: test path: ja/test-* - config_name: ru data_files: - split: test path: ru/test-* - config_name: zh_cn data_files: - split: test path: zh_cn/test-* - config_name: zh_tw data_files: - split: test path: zh_tw/test-* tags: - art --- # Retrieval_QA: A Simple Multilingual Benchmark For Retrieval Encoder Models <!-- Provide a quick summary of the dataset. --> The purpose of this dataset is to provide a simple and easy-to-use benchmark for retrieval encoder models, which helps researchers quickly select the most effective retrieval encoder for text extraction and achieve optimal results in subsequent retrieval tasks such as retrieval-augmented-generation (RAG). The dataset contains multiple document-question pairs, where each document is a short text about the history, culture, or other information of a country or region, and each question is a query relevant to the content of the corresponding document. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> Users may select a retrieval encoder model to encode each document and query into corresponding embeddings, and then use vector matching methods such as FAISS to identify the most relevant documents for each query as regression results. + **Curated by**: <a href='https://wln20.github.io'>Luning Wang</a> + **Language(s)**: English, Chinese(Simplified, Traditional), Japanse, Spanish, German, Russian + **License**: Apache-2.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/wln20/Retrieval_QA - **Paper:** TBD - **Demo:** TBD ## Uses The dataset is available on 🤗 Huggingface, you can conveniently use it in python with 🤗 Datasets: ```python from datasets import load_dataset dataset_en = load_dataset('lnwang/retrieval_qa', name='en') # dataset_zh_cn = load_dataset('lnwang/retrieval_qa', name='zh_cn') # dataset_zh_tw = load_dataset('lnwang/retrieval_qa', name='zh_tw') ``` Now we support three languages: English(en), Simplified-Chinese(zh_cn), Traditional-Chinese(zh_tw), Japanese(ja), Spanish(es), German(de), Russian(ru). You can specify the `name` argument in `load_dataset()` to get the corresponding subset. For more usages, please follow the examples in the github repository of this project. ## Dataset Creation The raw data was generated by GPT-3.5-turbo, using carefully designed prompts by human. The data was also cleaned to remove controversial and incorrect information.
提供机构:
lnwang
原始信息汇总

数据集概述

数据集描述

该数据集旨在为检索编码器模型提供一个简单易用的基准,帮助研究人员快速选择最有效的检索编码器进行文本提取,并在后续的检索任务(如检索增强生成(RAG))中取得最佳结果。数据集包含多个文档-问题对,每个文档是一个关于国家或地区历史、文化或其他信息的短文本,每个问题是对应文档内容的查询。

数据集详情

语言

  • 英语 (en)
  • 简体中文 (zh_cn)
  • 繁体中文 (zh_tw)
  • 日语 (ja)
  • 西班牙语 (es)
  • 德语 (de)
  • 俄语 (ru)

许可证

  • Apache-2.0

数据集配置

  • config_name: de

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 268775
        • num_examples: 196
    • download_size: 0
    • dataset_size: 268775
  • config_name: default

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 233289
        • num_examples: 196
    • download_size: 0
    • dataset_size: 233289
  • config_name: en

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 233289
        • num_examples: 196
    • download_size: 0
    • dataset_size: 233289
  • config_name: es

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 267456
        • num_examples: 196
    • download_size: 0
    • dataset_size: 267456
  • config_name: ja

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 268010
        • num_examples: 196
    • download_size: 0
    • dataset_size: 268010
  • config_name: ru

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 413438
        • num_examples: 196
    • download_size: 191766
    • dataset_size: 413438
  • config_name: zh_cn

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 200707
        • num_examples: 196
    • download_size: 0
    • dataset_size: 200707
  • config_name: zh_tw

    • 特征:
      • region: string
      • doc: string
      • query: string
      • choice: sequence of string
      • answer: string
    • 分割:
      • test:
        • num_bytes: 201205
        • num_examples: 196
    • download_size: 0
    • dataset_size: 201205

数据文件配置

  • config_name: de

    • data_files:
      • split: test
        • path: de/test-*
  • config_name: default

    • data_files:
      • split: test
        • path: data/test-*
  • config_name: en

    • data_files:
      • split: test
        • path: en/test-*
  • config_name: es

    • data_files:
      • split: test
        • path: es/test-*
  • config_name: ja

    • data_files:
      • split: test
        • path: ja/test-*
  • config_name: ru

    • data_files:
      • split: test
        • path: ru/test-*
  • config_name: zh_cn

    • data_files:
      • split: test
        • path: zh_cn/test-*
  • config_name: zh_tw

    • data_files:
      • split: test
        • path: zh_tw/test-*

标签

  • art
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作