mugezhang/xstorycloze_eval_multirepr
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mugezhang/xstorycloze_eval_multirepr
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ar
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 585234
num_examples: 360
- name: eval
num_bytes: 2411312
num_examples: 1511
download_size: 1870986
dataset_size: 2996546
- config_name: en
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 505474
num_examples: 360
- name: eval
num_bytes: 2111146
num_examples: 1511
download_size: 1674876
dataset_size: 2616620
- config_name: es
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 521883
num_examples: 360
- name: eval
num_bytes: 2173097
num_examples: 1511
download_size: 1770808
dataset_size: 2694980
- config_name: hi
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 783950
num_examples: 360
- name: eval
num_bytes: 3287031
num_examples: 1511
download_size: 2129787
dataset_size: 4070981
- config_name: ru
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 647677
num_examples: 360
- name: eval
num_bytes: 2710016
num_examples: 1511
download_size: 2070414
dataset_size: 3357693
- config_name: zh
features:
- name: story_id
dtype: string
- name: input_sentence_1
dtype: string
- name: input_sentence_2
dtype: string
- name: input_sentence_3
dtype: string
- name: input_sentence_4
dtype: string
- name: sentence_quiz1
dtype: string
- name: sentence_quiz2
dtype: string
- name: answer_right_ending
dtype: int32
- name: input_sentence_1_phonemes
dtype: string
- name: input_sentence_1_ipa_stripped
dtype: string
- name: input_sentence_1_romanized
dtype: string
- name: input_sentence_2_phonemes
dtype: string
- name: input_sentence_2_ipa_stripped
dtype: string
- name: input_sentence_2_romanized
dtype: string
- name: input_sentence_3_phonemes
dtype: string
- name: input_sentence_3_ipa_stripped
dtype: string
- name: input_sentence_3_romanized
dtype: string
- name: input_sentence_4_phonemes
dtype: string
- name: input_sentence_4_ipa_stripped
dtype: string
- name: input_sentence_4_romanized
dtype: string
- name: sentence_quiz1_phonemes
dtype: string
- name: sentence_quiz1_ipa_stripped
dtype: string
- name: sentence_quiz1_romanized
dtype: string
- name: sentence_quiz2_phonemes
dtype: string
- name: sentence_quiz2_ipa_stripped
dtype: string
- name: sentence_quiz2_romanized
dtype: string
splits:
- name: train
num_bytes: 635275
num_examples: 360
- name: eval
num_bytes: 2652217
num_examples: 1511
download_size: 1825805
dataset_size: 3287492
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- split: eval
path: ar/eval-*
- config_name: en
data_files:
- split: train
path: en/train-*
- split: eval
path: en/eval-*
- config_name: es
data_files:
- split: train
path: es/train-*
- split: eval
path: es/eval-*
- config_name: hi
data_files:
- split: train
path: hi/train-*
- split: eval
path: hi/eval-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- split: eval
path: ru/eval-*
- config_name: zh
data_files:
- split: train
path: zh/train-*
- split: eval
path: zh/eval-*
---
数据集信息如下,包含6个多语言配置版本,分别为阿拉伯语(ar)、英语(en)、西班牙语(es)、印地语(hi)、俄语(ru)及中文(zh):
1. 配置名称:ar(阿拉伯语)
特征字段包含26项,具体如下:
1. story_id:故事ID,数据类型为字符串
2. input_sentence_1:输入语句1,数据类型为字符串
3. input_sentence_2:输入语句2,数据类型为字符串
4. input_sentence_3:输入语句3,数据类型为字符串
5. input_sentence_4:输入语句4,数据类型为字符串
6. sentence_quiz1:测试语句1,数据类型为字符串
7. sentence_quiz2:测试语句2,数据类型为字符串
8. answer_right_ending:正确结尾答案,数据类型为int32
9. input_sentence_1_phonemes:输入语句1音素序列,数据类型为字符串
10. input_sentence_1_ipa_stripped:输入语句1剥离版国际音标(International Phonetic Alphabet,简称IPA),数据类型为字符串
11. input_sentence_1_romanized:输入语句1罗马音转写形式,数据类型为字符串
12. input_sentence_2_phonemes:输入语句2音素序列,数据类型为字符串
13. input_sentence_2_ipa_stripped:输入语句2剥离版IPA,数据类型为字符串
14. input_sentence_2_romanized:输入语句2罗马音转写形式,数据类型为字符串
15. input_sentence_3_phonemes:输入语句3音素序列,数据类型为字符串
16. input_sentence_3_ipa_stripped:输入语句3剥离版IPA,数据类型为字符串
17. input_sentence_3_romanized:输入语句3罗马音转写形式,数据类型为字符串
18. input_sentence_4_phonemes:输入语句4音素序列,数据类型为字符串
19. input_sentence_4_ipa_stripped:输入语句4剥离版IPA,数据类型为字符串
20. input_sentence_4_romanized:输入语句4罗马音转写形式,数据类型为字符串
21. sentence_quiz1_phonemes:测试语句1音素序列,数据类型为字符串
22. sentence_quiz1_ipa_stripped:测试语句1剥离版IPA,数据类型为字符串
23. sentence_quiz1_romanized:测试语句1罗马音转写形式,数据类型为字符串
24. sentence_quiz2_phonemes:测试语句2音素序列,数据类型为字符串
25. sentence_quiz2_ipa_stripped:测试语句2剥离版IPA,数据类型为字符串
26. sentence_quiz2_romanized:测试语句2罗马音转写形式,数据类型为字符串
该配置的数据划分如下:训练集(train)字节数为585234,样本总量360;评估集(eval)字节数为2411312,样本总量1511。该配置总下载大小为1870986,数据集总存储大小为2996546。
2. 配置名称:en(英语)
特征字段与阿拉伯语配置完全一致。数据划分:训练集字节数505474,样本量360;评估集字节数2111146,样本量1511。总下载大小为1674876,数据集总存储大小为2616620。
3. 配置名称:es(西班牙语)
特征字段与前述配置完全一致。数据划分:训练集字节数521883,样本量360;评估集字节数2173097,样本量1511。总下载大小为1770808,数据集总存储大小为2694980。
4. 配置名称:hi(印地语)
特征字段与前述配置完全一致。数据划分:训练集字节数783950,样本量360;评估集字节数3287031,样本量1511。总下载大小为2129787,数据集总存储大小为4070981。
5. 配置名称:ru(俄语)
特征字段与前述配置完全一致。数据划分:训练集字节数647677,样本量360;评估集字节数2710016,样本量1511。总下载大小为2070414,数据集总存储大小为3357693。
6. 配置名称:zh(中文)
特征字段与前述配置完全一致。数据划分:训练集字节数635275,样本量360;评估集字节数2652217,样本量1511。总下载大小为1825805,数据集总存储大小为3287492。
所有语言配置的数据文件映射规则统一为:训练集数据文件路径格式为`{配置名称}/train-*`,评估集数据文件路径格式为`{配置名称}/eval-*`,例如阿拉伯语(ar)配置的训练集路径为`ar/train-*`,评估集路径为`ar/eval-*`,其余语言配置均遵循该路径规则。
提供机构:
mugezhang



