barto17/common_languages_preprocessed
收藏Hugging Face2023-09-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/barto17/common_languages_preprocessed
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: labels
dtype:
class_label:
names:
'0': Arabic
'1': Basque
'2': Breton
'3': Catalan
'4': Chinese_China
'5': Chinese_Hongkong
'6': Chinese_Taiwan
'7': Chuvash
'8': Czech
'9': Dhivehi
'10': Dutch
'11': English
'12': Esperanto
'13': Estonian
'14': French
'15': Frisian
'16': Georgian
'17': German
'18': Greek
'19': Hakha_Chin
'20': Indonesian
'21': Interlingua
'22': Italian
'23': Japanese
'24': Kabyle
'25': Kinyarwanda
'26': Kyrgyz
'27': Latvian
'28': Maltese
'29': Mangolian
'30': Persian
'31': Polish
'32': Portuguese
'33': Romanian
'34': Romansh_Sursilvan
'35': Russian
'36': Sakha
'37': Slovenian
'38': Spanish
'39': Swedish
'40': Tamil
'41': Tatar
'42': Turkish
'43': Ukranian
'44': Welsh
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
splits:
- name: train
num_bytes: 2076244
num_examples: 22194
- name: test
num_bytes: 559808
num_examples: 5963
download_size: 1604084
dataset_size: 2636052
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# Dataset Card for "common_languages_preprocessed"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
barto17
原始信息汇总
数据集概述
特征信息
- 名称: labels
- 数据类型: class_label
- 类别名称:
- 0: Arabic
- 1: Basque
- 2: Breton
- 3: Catalan
- 4: Chinese_China
- 5: Chinese_Hongkong
- 6: Chinese_Taiwan
- 7: Chuvash
- 8: Czech
- 9: Dhivehi
- 10: Dutch
- 11: English
- 12: Esperanto
- 13: Estonian
- 14: French
- 15: Frisian
- 16: Georgian
- 17: German
- 18: Greek
- 19: Hakha_Chin
- 20: Indonesian
- 21: Interlingua
- 22: Italian
- 23: Japanese
- 24: Kabyle
- 25: Kinyarwanda
- 26: Kyrgyz
- 27: Latvian
- 28: Maltese
- 29: Mangolian
- 30: Persian
- 31: Polish
- 32: Portuguese
- 33: Romanian
- 34: Romansh_Sursilvan
- 35: Russian
- 36: Sakha
- 37: Slovenian
- 38: Spanish
- 39: Swedish
- 40: Tamil
- 41: Tatar
- 42: Turkish
- 43: Ukranian
- 44: Welsh
- 名称: input_ids
- 序列类型: int32
- 名称: attention_mask
- 序列类型: int8
数据分割
- 训练集:
- 字节数: 2076244
- 样本数: 22194
- 测试集:
- 字节数: 559808
- 样本数: 5963
数据集大小
- 下载大小: 1604084
- 数据集大小: 2636052
配置信息
- 配置名称: default
- 数据文件:
- 训练集路径: data/train-*
- 测试集路径: data/test-*
- 数据文件:



