jpbello/common_language_preprocessed
收藏Hugging Face2023-09-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jpbello/common_language_preprocessed
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: client_id
dtype: string
- name: path
dtype: string
- name: sentence
dtype: string
- name: age
dtype: string
- name: gender
dtype: string
- name: label
dtype:
class_label:
names:
'0': Arabic
'1': Basque
'2': Breton
'3': Catalan
'4': Chinese_China
'5': Chinese_Hongkong
'6': Chinese_Taiwan
'7': Chuvash
'8': Czech
'9': Dhivehi
'10': Dutch
'11': English
'12': Esperanto
'13': Estonian
'14': French
'15': Frisian
'16': Georgian
'17': German
'18': Greek
'19': Hakha_Chin
'20': Indonesian
'21': Interlingua
'22': Italian
'23': Japanese
'24': Kabyle
'25': Kinyarwanda
'26': Kyrgyz
'27': Latvian
'28': Maltese
'29': Mangolian
'30': Persian
'31': Polish
'32': Portuguese
'33': Romanian
'34': Romansh_Sursilvan
'35': Russian
'36': Sakha
'37': Slovenian
'38': Spanish
'39': Swedish
'40': Tamil
'41': Tatar
'42': Turkish
'43': Ukranian
'44': Welsh
- name: input_values
sequence: float32
- name: attention_mask
sequence: int32
splits:
- name: train
num_bytes: 13848986619
num_examples: 22194
- name: validation
num_bytes: 3461442109
num_examples: 5888
- name: test
num_bytes: 3473659131
num_examples: 5963
download_size: 8143061729
dataset_size: 20784087859
---
# Dataset Card for "common_language_preprocessed"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
jpbello
原始信息汇总
数据集概述
数据集配置
- 配置名称: default
- 数据文件:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*
数据集信息
- 特征:
- client_id: 字符串类型
- path: 字符串类型
- sentence: 字符串类型
- age: 字符串类型
- gender: 字符串类型
- label: 类别标签,包含以下语言:
- 0: Arabic
- 1: Basque
- 2: Breton
- 3: Catalan
- 4: Chinese_China
- 5: Chinese_Hongkong
- 6: Chinese_Taiwan
- 7: Chuvash
- 8: Czech
- 9: Dhivehi
- 10: Dutch
- 11: English
- 12: Esperanto
- 13: Estonian
- 14: French
- 15: Frisian
- 16: Georgian
- 17: German
- 18: Greek
- 19: Hakha_Chin
- 20: Indonesian
- 21: Interlingua
- 22: Italian
- 23: Japanese
- 24: Kabyle
- 25: Kinyarwanda
- 26: Kyrgyz
- 27: Latvian
- 28: Maltese
- 29: Mangolian
- 30: Persian
- 31: Polish
- 32: Portuguese
- 33: Romanian
- 34: Romansh_Sursilvan
- 35: Russian
- 36: Sakha
- 37: Slovenian
- 38: Spanish
- 39: Swedish
- 40: Tamil
- 41: Tatar
- 42: Turkish
- 43: Ukranian
- 44: Welsh
- input_values: 浮点数序列
- attention_mask: 整数序列
数据集分割
- 训练集:
- 字节数: 13848986619
- 样本数: 22194
- 验证集:
- 字节数: 3461442109
- 样本数: 5888
- 测试集:
- 字节数: 3473659131
- 样本数: 5963
数据集大小
- 下载大小: 8143061729
- 数据集大小: 20784087859



