hac541309/open-lid-dataset

Name: hac541309/open-lid-dataset
Creator: hac541309
Published: 2023-10-27 01:18:24
License: 暂无描述

Hugging Face2023-10-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/hac541309/open-lid-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - ko - fr - aa - hi license: gpl-3.0 size_categories: - 100M<n<1B configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: src dtype: string - name: lang dtype: string - name: text dtype: string splits: - name: train num_bytes: 22252477927 num_examples: 121165414 download_size: 16613981282 dataset_size: 22252477927 --- This dataset is built from the open source data accompanying ["An Open Dataset and Model for Language Identification" (Burchell et al., 2023)](https://arxiv.org/abs/2305.13820) The repository containing the actual data can be found here : https://github.com/laurieburchell/open-lid-dataset. The license for this recreation itself follows the original upstream dataset as GPLv3+. However, individual datasets within it follow [each of their own licenses.](https://github.com/laurieburchell/open-lid-dataset/blob/main/licenses.md) The "src" column lists the sources. "lang" column lists the language code in alpha-3/ISO 639-2 format followed by the script. "text" column contains the sentence. Conversion to huggingface dataset and upload to hub done by [Chris Ha](https://github.com/chris-ha458) Original authors built the dataset for LID models for 201 languages. I thought such a dataset could also be used for a tokenizer for 201 languages. This dataset was processed and uploaded using huggingface datasets. [Link to original author](https://huggingface.co/laurievb/OpenLID)

提供机构：

hac541309

原始信息汇总

数据集概述

基本信息

语言: 英语（en）、韩语（ko）、法语（fr）、阿法语（aa）、印地语（hi）
许可证: GPL-3.0
大小分类: 100M<n<1B

配置信息

默认配置 (config_name: default)
- 数据文件路径: data/train-*
- 分割类型: 训练（train）

数据集详情

特征:
- src: 字符串类型
- lang: 字符串类型
- text: 字符串类型
分割详情:
- 训练分割 (name: train)
  - 字节数: 22252477927
  - 示例数: 121165414
下载大小: 16613981282
数据集大小: 22252477927

5,000+

优质数据集

54 个

任务类型

进入经典数据集