clt-dlsu/wikitext_tl39
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/clt-dlsu/wikitext_tl39
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- fil
- tl
license:
- gpl-3.0
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
paperswithcode_id: wikitext-tl-39
pretty_name: WikiText-TL-39
dataset_info:
features:
- name: text
dtype: string
config_name: wikitext-tl-39
splits:
- name: test
num_bytes: 46182996
num_examples: 376737
- name: train
num_bytes: 217182748
num_examples: 1766072
- name: validation
num_bytes: 46256674
num_examples: 381763
download_size: 116335234
dataset_size: 309622418
---
# Dataset Card for WikiText-TL-39
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Filipino Text Benchmarks](https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks)
- **Repository:**
- **Paper:** [Evaluating language model finetuning techniques for low-resource languages](https://arxiv.org/abs/1907.00409)
- **Leaderboard:**
- **Point of Contact:** Jan Christian Blaise Cruz (jan_christian_cruz@dlsu.edu.ph)
### Dataset Summary
Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019).
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
Filipino/Tagalog
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
- `text` (`str`)
The dataset is in plaintext and only has one field ("text") as it is compiled for language modeling.
### Data Splits
Split | Documents | Tokens
------|-----------|-------
Train | 120,975 | 39M
Valid | 25,919 | 8M
Test | 25,921 | 8M
Please see the paper for more details on the dataset splits
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
Tagalog Wikipedia
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to [@jcblaisecruz02](https://github.com/jcblaisecruz02) for adding this dataset.
提供机构:
clt-dlsu
原始信息汇总
WikiText-TL-39 数据集概述
数据集描述
数据集摘要
WikiText-TL-39 是一个大规模、未标注的文本数据集,训练集中包含 3900 万个词元。该数据集受原始 WikiText 长期依赖数据集(Merity 等人,2016)启发,TL 代表 "Tagalog"。该数据集在 Cruz & Cheng (2019) 中发布。
支持的任务和排行榜
[更多信息需补充]
语言
菲律宾语/塔加洛语
数据集结构
数据实例
[更多信息需补充]
数据字段
text(str)
数据集为纯文本格式,仅包含一个字段("text"),用于语言建模。
数据分割
| 分割 | 文档数量 | 词元数量 |
|---|---|---|
| 训练集 | 120,975 | 39M |
| 验证集 | 25,919 | 8M |
| 测试集 | 25,921 | 8M |
更多关于数据集分割的详细信息请参见论文。
数据集创建
策划理由
[更多信息需补充]
源数据
塔加洛语维基百科
初始数据收集和规范化
[更多信息需补充]
源语言生产者
[更多信息需补充]
标注
标注过程
[更多信息需补充]
标注者
[更多信息需补充]
个人和敏感信息
[更多信息需补充]
使用数据的注意事项
数据集的社会影响
[更多信息需补充]
偏见的讨论
[更多信息需补充]
其他已知限制
[更多信息需补充]
附加信息
数据集策展人
[更多信息需补充]
许可信息
[更多信息需补充]
引用信息
[更多信息需补充]
贡献
感谢 @jcblaisecruz02 添加此数据集。



