five

dkoterwa/mlqa_filtered

收藏
Hugging Face2024-04-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/dkoterwa/mlqa_filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: context dtype: string - name: question dtype: string - name: id dtype: string - name: lang dtype: string - name: answer dtype: string splits: - name: train num_bytes: 46553350 num_examples: 41019 download_size: 24529122 dataset_size: 46553350 configs: - config_name: default data_files: - split: train path: data/train-* --- # mlqa filtered version For a better dataset description, please visit the official site of the source dataset: [LINK](https://huggingface.co/datasets/mlqa) <br> <br> **This dataset was prepared by converting mlqa dataset**. I've concatenated versions of the dataset for languages of interest and retrieved a text answers from "answers" column. **I additionaly share the code which I used to convert the original dataset to make everything more clear** ``` def download_mlqa(subset_name): dataset_valid = load_dataset("mlqa", subset_name, split="validation").to_pandas() dataset_test = load_dataset("mlqa", subset_name, split="test").to_pandas() full_dataset = pd.concat([dataset_valid, dataset_test]) full_dataset.reset_index(drop=True, inplace=True) return full_dataset needed_langs = ["mlqa.en.en", "mlqa.de.de", "mlqa.ar.ar", "mlqa.es.es", "mlqa.vi.vi", "mlqa.zh.zh"] datasets = [] for lang in tqdm(needed_langs): dataset = download_mlqa(lang) dataset["lang"] = lang.split(".")[2] datasets.append(dataset) full_mlqa = pd.concat(datasets) full_mlqa.reset_index(drop=True, inplace=True) full_mlqa["answer"] = [answer_dict["text"][0] for answer_dict in full_mlqa["answers"]] full_mlqa.drop("answers", axis=1, inplace=True) ``` **How to download** ``` from datasets import load_dataset data = load_dataset("dkoterwa/mlqa_filtered") ```
提供机构:
dkoterwa
原始信息汇总

数据集概述

数据集特征

  • context: 数据类型为字符串
  • question: 数据类型为字符串
  • id: 数据类型为字符串
  • lang: 数据类型为字符串
  • answer: 数据类型为字符串

数据集分割

  • train:
    • 示例数量: 41019
    • 数据大小: 46553350字节

数据集大小

  • 下载大小: 24529122字节
  • 数据集总大小: 46553350字节

配置

  • config_name: default
    • data_files:
      • split: train
        • path: data/train-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作