dkoterwa/mlqa_filtered
收藏Hugging Face2024-04-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/dkoterwa/mlqa_filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: context
dtype: string
- name: question
dtype: string
- name: id
dtype: string
- name: lang
dtype: string
- name: answer
dtype: string
splits:
- name: train
num_bytes: 46553350
num_examples: 41019
download_size: 24529122
dataset_size: 46553350
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# mlqa filtered version
For a better dataset description, please visit the official site of the source dataset: [LINK](https://huggingface.co/datasets/mlqa) <br>
<br>
**This dataset was prepared by converting mlqa dataset**. I've concatenated versions of the dataset for languages of interest and retrieved a text answers from "answers" column.
**I additionaly share the code which I used to convert the original dataset to make everything more clear**
```
def download_mlqa(subset_name):
dataset_valid = load_dataset("mlqa", subset_name, split="validation").to_pandas()
dataset_test = load_dataset("mlqa", subset_name, split="test").to_pandas()
full_dataset = pd.concat([dataset_valid, dataset_test])
full_dataset.reset_index(drop=True, inplace=True)
return full_dataset
needed_langs = ["mlqa.en.en", "mlqa.de.de", "mlqa.ar.ar", "mlqa.es.es", "mlqa.vi.vi", "mlqa.zh.zh"]
datasets = []
for lang in tqdm(needed_langs):
dataset = download_mlqa(lang)
dataset["lang"] = lang.split(".")[2]
datasets.append(dataset)
full_mlqa = pd.concat(datasets)
full_mlqa.reset_index(drop=True, inplace=True)
full_mlqa["answer"] = [answer_dict["text"][0] for answer_dict in full_mlqa["answers"]]
full_mlqa.drop("answers", axis=1, inplace=True)
```
**How to download**
```
from datasets import load_dataset
data = load_dataset("dkoterwa/mlqa_filtered")
```
提供机构:
dkoterwa
原始信息汇总
数据集概述
数据集特征
- context: 数据类型为字符串
- question: 数据类型为字符串
- id: 数据类型为字符串
- lang: 数据类型为字符串
- answer: 数据类型为字符串
数据集分割
- train:
- 示例数量: 41019
- 数据大小: 46553350字节
数据集大小
- 下载大小: 24529122字节
- 数据集总大小: 46553350字节
配置
- config_name: default
- data_files:
- split: train
- path: data/train-*
- split: train
- data_files:



