rinabuoy/Khmer-ALT
收藏Hugging Face2024-01-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rinabuoy/Khmer-ALT
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: SNT.URLID
dtype: string
- name: SNT.URLID.SNTID
dtype: string
- name: url
dtype: string
- name: translation
dtype:
translation:
languages:
- bg
- en
- en_tok
- fil
- hi
- id
- ja
- khm
- lo
- ms
- my
- th
- vi
- zh
- name: text
dtype: string
splits:
- name: train
num_bytes: 79348840
num_examples: 18088
- name: validation
num_bytes: 4305969
num_examples: 1000
- name: test
num_bytes: 4423852
num_examples: 1019
download_size: 40147709
dataset_size: 88078661
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
The dataset includes multiple features such as SNT.URLID, SNT.URLID.SNTID, url, translation, and text. The translation feature supports translations in multiple languages including Bulgarian, English, Filipino, Hindi, Indonesian, Japanese, Khmer, Lao, Malay, Burmese, Thai, Vietnamese, and Chinese. The dataset is divided into training, validation, and test sets, containing 18088, 1000, and 1019 samples respectively. The download size of the dataset is 40147709 bytes, and the total size is 88078661 bytes.
提供机构:
rinabuoy
原始信息汇总
数据集概述
特征信息
- SNT.URLID: 数据类型为字符串。
- SNT.URLID.SNTID: 数据类型为字符串。
- url: 数据类型为字符串。
- translation: 包含多种语言的翻译,具体语言包括:
- bg
- en
- en_tok
- fil
- hi
- id
- ja
- khm
- lo
- ms
- my
- th
- vi
- zh
- text: 数据类型为字符串。
数据分割
- train: 包含18088个样本,总字节数为79348840。
- validation: 包含1000个样本,总字节数为4305969。
- test: 包含1019个样本,总字节数为4423852。
数据集大小
- 下载大小: 40147709字节。
- 数据集大小: 88078661字节。
配置信息
- config_name: default
- data_files:
- train: 路径为
data/train-* - validation: 路径为
data/validation-* - test: 路径为
data/test-*
- train: 路径为
- data_files:



