hac541309/open-lid-dataset
收藏Hugging Face2023-10-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hac541309/open-lid-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- ko
- fr
- aa
- hi
license: gpl-3.0
size_categories:
- 100M<n<1B
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: src
dtype: string
- name: lang
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 22252477927
num_examples: 121165414
download_size: 16613981282
dataset_size: 22252477927
---
This dataset is built from the open source data accompanying ["An Open Dataset and Model for Language Identification" (Burchell et al., 2023)](https://arxiv.org/abs/2305.13820)
The repository containing the actual data can be found here : https://github.com/laurieburchell/open-lid-dataset.
The license for this recreation itself follows the original upstream dataset as GPLv3+.
However, individual datasets within it follow [each of their own licenses.](https://github.com/laurieburchell/open-lid-dataset/blob/main/licenses.md)
The "src" column lists the sources. "lang" column lists the language code in alpha-3/ISO 639-2 format followed by the script. "text" column contains the sentence.
Conversion to huggingface dataset and upload to hub done by [Chris Ha](https://github.com/chris-ha458)
Original authors built the dataset for LID models for 201 languages. I thought such a dataset could also be used for a tokenizer for 201 languages.
This dataset was processed and uploaded using huggingface datasets.
[Link to original author](https://huggingface.co/laurievb/OpenLID)
提供机构:
hac541309
原始信息汇总
数据集概述
基本信息
- 语言: 英语(en)、韩语(ko)、法语(fr)、阿法语(aa)、印地语(hi)
- 许可证: GPL-3.0
- 大小分类: 100M<n<1B
配置信息
- 默认配置 (
config_name: default)- 数据文件路径:
data/train-* - 分割类型: 训练(train)
- 数据文件路径:
数据集详情
- 特征:
- src: 字符串类型
- lang: 字符串类型
- text: 字符串类型
- 分割详情:
- 训练分割 (
name: train)- 字节数: 22252477927
- 示例数: 121165414
- 训练分割 (
- 下载大小: 16613981282
- 数据集大小: 22252477927



