five

mugezhang/xstorycloze_eval_multirepr

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mugezhang/xstorycloze_eval_multirepr
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ar features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 585234 num_examples: 360 - name: eval num_bytes: 2411312 num_examples: 1511 download_size: 1870986 dataset_size: 2996546 - config_name: en features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 505474 num_examples: 360 - name: eval num_bytes: 2111146 num_examples: 1511 download_size: 1674876 dataset_size: 2616620 - config_name: es features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 521883 num_examples: 360 - name: eval num_bytes: 2173097 num_examples: 1511 download_size: 1770808 dataset_size: 2694980 - config_name: hi features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 783950 num_examples: 360 - name: eval num_bytes: 3287031 num_examples: 1511 download_size: 2129787 dataset_size: 4070981 - config_name: ru features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 647677 num_examples: 360 - name: eval num_bytes: 2710016 num_examples: 1511 download_size: 2070414 dataset_size: 3357693 - config_name: zh features: - name: story_id dtype: string - name: input_sentence_1 dtype: string - name: input_sentence_2 dtype: string - name: input_sentence_3 dtype: string - name: input_sentence_4 dtype: string - name: sentence_quiz1 dtype: string - name: sentence_quiz2 dtype: string - name: answer_right_ending dtype: int32 - name: input_sentence_1_phonemes dtype: string - name: input_sentence_1_ipa_stripped dtype: string - name: input_sentence_1_romanized dtype: string - name: input_sentence_2_phonemes dtype: string - name: input_sentence_2_ipa_stripped dtype: string - name: input_sentence_2_romanized dtype: string - name: input_sentence_3_phonemes dtype: string - name: input_sentence_3_ipa_stripped dtype: string - name: input_sentence_3_romanized dtype: string - name: input_sentence_4_phonemes dtype: string - name: input_sentence_4_ipa_stripped dtype: string - name: input_sentence_4_romanized dtype: string - name: sentence_quiz1_phonemes dtype: string - name: sentence_quiz1_ipa_stripped dtype: string - name: sentence_quiz1_romanized dtype: string - name: sentence_quiz2_phonemes dtype: string - name: sentence_quiz2_ipa_stripped dtype: string - name: sentence_quiz2_romanized dtype: string splits: - name: train num_bytes: 635275 num_examples: 360 - name: eval num_bytes: 2652217 num_examples: 1511 download_size: 1825805 dataset_size: 3287492 configs: - config_name: ar data_files: - split: train path: ar/train-* - split: eval path: ar/eval-* - config_name: en data_files: - split: train path: en/train-* - split: eval path: en/eval-* - config_name: es data_files: - split: train path: es/train-* - split: eval path: es/eval-* - config_name: hi data_files: - split: train path: hi/train-* - split: eval path: hi/eval-* - config_name: ru data_files: - split: train path: ru/train-* - split: eval path: ru/eval-* - config_name: zh data_files: - split: train path: zh/train-* - split: eval path: zh/eval-* ---

数据集信息如下,包含6个多语言配置版本,分别为阿拉伯语(ar)、英语(en)、西班牙语(es)、印地语(hi)、俄语(ru)及中文(zh): 1. 配置名称:ar(阿拉伯语) 特征字段包含26项,具体如下: 1. story_id:故事ID,数据类型为字符串 2. input_sentence_1:输入语句1,数据类型为字符串 3. input_sentence_2:输入语句2,数据类型为字符串 4. input_sentence_3:输入语句3,数据类型为字符串 5. input_sentence_4:输入语句4,数据类型为字符串 6. sentence_quiz1:测试语句1,数据类型为字符串 7. sentence_quiz2:测试语句2,数据类型为字符串 8. answer_right_ending:正确结尾答案,数据类型为int32 9. input_sentence_1_phonemes:输入语句1音素序列,数据类型为字符串 10. input_sentence_1_ipa_stripped:输入语句1剥离版国际音标(International Phonetic Alphabet,简称IPA),数据类型为字符串 11. input_sentence_1_romanized:输入语句1罗马音转写形式,数据类型为字符串 12. input_sentence_2_phonemes:输入语句2音素序列,数据类型为字符串 13. input_sentence_2_ipa_stripped:输入语句2剥离版IPA,数据类型为字符串 14. input_sentence_2_romanized:输入语句2罗马音转写形式,数据类型为字符串 15. input_sentence_3_phonemes:输入语句3音素序列,数据类型为字符串 16. input_sentence_3_ipa_stripped:输入语句3剥离版IPA,数据类型为字符串 17. input_sentence_3_romanized:输入语句3罗马音转写形式,数据类型为字符串 18. input_sentence_4_phonemes:输入语句4音素序列,数据类型为字符串 19. input_sentence_4_ipa_stripped:输入语句4剥离版IPA,数据类型为字符串 20. input_sentence_4_romanized:输入语句4罗马音转写形式,数据类型为字符串 21. sentence_quiz1_phonemes:测试语句1音素序列,数据类型为字符串 22. sentence_quiz1_ipa_stripped:测试语句1剥离版IPA,数据类型为字符串 23. sentence_quiz1_romanized:测试语句1罗马音转写形式,数据类型为字符串 24. sentence_quiz2_phonemes:测试语句2音素序列,数据类型为字符串 25. sentence_quiz2_ipa_stripped:测试语句2剥离版IPA,数据类型为字符串 26. sentence_quiz2_romanized:测试语句2罗马音转写形式,数据类型为字符串 该配置的数据划分如下:训练集(train)字节数为585234,样本总量360;评估集(eval)字节数为2411312,样本总量1511。该配置总下载大小为1870986,数据集总存储大小为2996546。 2. 配置名称:en(英语) 特征字段与阿拉伯语配置完全一致。数据划分:训练集字节数505474,样本量360;评估集字节数2111146,样本量1511。总下载大小为1674876,数据集总存储大小为2616620。 3. 配置名称:es(西班牙语) 特征字段与前述配置完全一致。数据划分:训练集字节数521883,样本量360;评估集字节数2173097,样本量1511。总下载大小为1770808,数据集总存储大小为2694980。 4. 配置名称:hi(印地语) 特征字段与前述配置完全一致。数据划分:训练集字节数783950,样本量360;评估集字节数3287031,样本量1511。总下载大小为2129787,数据集总存储大小为4070981。 5. 配置名称:ru(俄语) 特征字段与前述配置完全一致。数据划分:训练集字节数647677,样本量360;评估集字节数2710016,样本量1511。总下载大小为2070414,数据集总存储大小为3357693。 6. 配置名称:zh(中文) 特征字段与前述配置完全一致。数据划分:训练集字节数635275,样本量360;评估集字节数2652217,样本量1511。总下载大小为1825805,数据集总存储大小为3287492。 所有语言配置的数据文件映射规则统一为:训练集数据文件路径格式为`{配置名称}/train-*`,评估集数据文件路径格式为`{配置名称}/eval-*`,例如阿拉伯语(ar)配置的训练集路径为`ar/train-*`,评估集路径为`ar/eval-*`,其余语言配置均遵循该路径规则。
提供机构:
mugezhang
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作