nthakur/miracl-raft-instruct-v0.3
收藏Hugging Face2024-05-15 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/nthakur/miracl-raft-instruct-v0.3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ar
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 14982233
num_examples: 3468
download_size: 6045510
dataset_size: 14982233
- config_name: bn
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 11733406
num_examples: 1624
download_size: 3813737
dataset_size: 11733406
- config_name: en
features:
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 10683026
num_examples: 2857
download_size: 5023962
dataset_size: 10683026
- config_name: es
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: 'null'
splits:
- name: train
num_bytes: 11354520
num_examples: 2159
download_size: 5487056
dataset_size: 11354520
- config_name: fa
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: 'null'
splits:
- name: train
num_bytes: 9649594
num_examples: 2104
download_size: 3663263
dataset_size: 9649594
- config_name: fi
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 7499617
num_examples: 2878
download_size: 3460902
dataset_size: 7499617
- config_name: fr
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: 'null'
splits:
- name: train
num_bytes: 3949017
num_examples: 1137
download_size: 1814688
dataset_size: 3949017
- config_name: hi
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: 'null'
splits:
- name: train
num_bytes: 7562791
num_examples: 1165
download_size: 2435919
dataset_size: 7562791
- config_name: id
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 16239206
num_examples: 4054
download_size: 0
dataset_size: 16239206
- config_name: ja
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 11660195
num_examples: 3466
download_size: 5364233
dataset_size: 11660195
- config_name: ko
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 2695533
num_examples: 859
download_size: 1216566
dataset_size: 2695533
- config_name: ru
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 21509628
num_examples: 4567
download_size: 9123295
dataset_size: 21509628
- config_name: sw
features:
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 3513668
num_examples: 1866
download_size: 1352345
dataset_size: 3513668
- config_name: te
features:
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 11932878
num_examples: 3283
download_size: 3701406
dataset_size: 11932878
- config_name: th
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: string
splits:
- name: train
num_bytes: 16713912
num_examples: 2965
download_size: 0
dataset_size: 16713912
- config_name: zh
features:
- name: prompt
dtype: string
- name: query_id
dtype: string
- name: positive_ids
sequence: string
- name: negative_ids
sequence: 'null'
- name: outputs
list:
- name: model
dtype: string
- name: output
dtype: string
- name: tydiqa_answer
sequence: 'null'
splits:
- name: train
num_bytes: 4033934
num_examples: 1311
download_size: 2011543
dataset_size: 4033934
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- config_name: bn
data_files:
- split: train
path: bn/train-*
- config_name: en
data_files:
- split: train
path: en/train-*
- config_name: es
data_files:
- split: train
path: es/train-*
- config_name: fa
data_files:
- split: train
path: fa/train-*
- config_name: fi
data_files:
- split: train
path: fi/train-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- config_name: hi
data_files:
- split: train
path: hi/train-*
- config_name: id
data_files:
- split: train
path: id/train-*
- config_name: ja
data_files:
- split: train
path: ja/train-*
- config_name: ko
data_files:
- split: train
path: ko/train-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- config_name: sw
data_files:
- split: train
path: sw/train-*
- config_name: te
data_files:
- split: train
path: te/train-*
- config_name: th
data_files:
- split: train
path: th/train-*
- config_name: zh
data_files:
- split: train
path: zh/train-*
---
# Dataset Card for "miracl-raft-instruct-v0.3"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset includes configurations for multiple languages such as Arabic, Bengali, English, Spanish, Persian, Finnish, French, Hindi, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, Thai, and Chinese. Each configuration details features like prompt, query_id, positive_ids, negative_ids, outputs (containing model and output), and tydiqa_answer. The dataset contains training splits with specifics on the number of examples and sizes in bytes. It is designed for tasks involving natural language processing and possibly question answering, given the presence of the tydiqa_answer feature.
提供机构:
nthakur
原始信息汇总
数据集概述
数据集配置
- config_name: 数据集配置名称,包括 ar, bn, en, es, fa, fi, fr, hi, id, ja, ko, ru, sw, te, th, zh 等。
数据集特征
- prompt: 字符串类型。
- query_id: 字符串类型。
- positive_ids: 字符串序列类型。
- negative_ids: 字符串序列类型,部分配置为 null。
- outputs: 列表类型,包含:
- model: 字符串类型。
- output: 字符串类型。
- tydiqa_answer: 字符串序列类型,部分配置为 null。
数据集分割
- train: 训练集,每个配置的训练集大小和示例数量不同。
数据集大小
- 每个配置的训练集大小(num_bytes)和下载大小(download_size)不同,具体数值在每个配置下分别列出。
示例
- ar: 训练集大小为14982233字节,包含3468个示例,下载大小为6045510字节。
- bn: 训练集大小为11733406字节,包含1624个示例,下载大小为3813737字节。
- en: 训练集大小为10683026字节,包含2857个示例,下载大小为5023962字节。
- es: 训练集大小为11354520字节,包含2159个示例,下载大小为5487056字节。
- fa: 训练集大小为9649594字节,包含2104个示例,下载大小为3663263字节。
- fi: 训练集大小为7499617字节,包含2878个示例,下载大小为3460902字节。
- fr: 训练集大小为3949017字节,包含1137个示例,下载大小为1814688字节。
- hi: 训练集大小为7562791字节,包含1165个示例,下载大小为2435919字节。
- id: 训练集大小为16239206字节,包含4054个示例,下载大小为0字节。
- ja: 训练集大小为11660195字节,包含3466个示例,下载大小为5364233字节。
- ko: 训练集大小为2695533字节,包含859个示例,下载大小为1216566字节。
- ru: 训练集大小为21509628字节,包含4567个示例,下载大小为9123295字节。
- sw: 训练集大小为3513668字节,包含1866个示例,下载大小为1352345字节。
- te: 训练集大小为11932878字节,包含3283个示例,下载大小为3701406字节。
- th: 训练集大小为16713912字节,包含2965个示例,下载大小为0字节。
- zh: 训练集大小为4033934字节,包含1311个示例,下载大小为2011543字节。



