five

clt-dlsu/wikitext_tl39

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/clt-dlsu/wikitext_tl39
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - fil - tl license: - gpl-3.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: wikitext-tl-39 pretty_name: WikiText-TL-39 dataset_info: features: - name: text dtype: string config_name: wikitext-tl-39 splits: - name: test num_bytes: 46182996 num_examples: 376737 - name: train num_bytes: 217182748 num_examples: 1766072 - name: validation num_bytes: 46256674 num_examples: 381763 download_size: 116335234 dataset_size: 309622418 --- # Dataset Card for WikiText-TL-39 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Filipino Text Benchmarks](https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks) - **Repository:** - **Paper:** [Evaluating language model finetuning techniques for low-resource languages](https://arxiv.org/abs/1907.00409) - **Leaderboard:** - **Point of Contact:** Jan Christian Blaise Cruz (jan_christian_cruz@dlsu.edu.ph) ### Dataset Summary Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019). ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Filipino/Tagalog ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields - `text` (`str`) The dataset is in plaintext and only has one field ("text") as it is compiled for language modeling. ### Data Splits Split | Documents | Tokens ------|-----------|------- Train | 120,975 | 39M Valid | 25,919 | 8M Test | 25,921 | 8M Please see the paper for more details on the dataset splits ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data Tagalog Wikipedia #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@jcblaisecruz02](https://github.com/jcblaisecruz02) for adding this dataset.
提供机构:
clt-dlsu
原始信息汇总

WikiText-TL-39 数据集概述

数据集描述

数据集摘要

WikiText-TL-39 是一个大规模、未标注的文本数据集,训练集中包含 3900 万个词元。该数据集受原始 WikiText 长期依赖数据集(Merity 等人,2016)启发,TL 代表 "Tagalog"。该数据集在 Cruz & Cheng (2019) 中发布。

支持的任务和排行榜

[更多信息需补充]

语言

菲律宾语/塔加洛语

数据集结构

数据实例

[更多信息需补充]

数据字段

  • text (str)

数据集为纯文本格式,仅包含一个字段("text"),用于语言建模。

数据分割

分割 文档数量 词元数量
训练集 120,975 39M
验证集 25,919 8M
测试集 25,921 8M

更多关于数据集分割的详细信息请参见论文。

数据集创建

策划理由

[更多信息需补充]

源数据

塔加洛语维基百科

初始数据收集和规范化

[更多信息需补充]

源语言生产者

[更多信息需补充]

标注

标注过程

[更多信息需补充]

标注者

[更多信息需补充]

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

[更多信息需补充]

许可信息

[更多信息需补充]

引用信息

[更多信息需补充]

贡献

感谢 @jcblaisecruz02 添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作