Birchlabs/c4-t5-ragged
收藏Hugging Face2024-02-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Birchlabs/c4-t5-ragged
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: C4
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license:
- odc-by
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
paperswithcode_id: c4
---
# C4, T5 tokenized, in ragged array format
Processed distribution of Google's [C4](https://www.tensorflow.org/datasets/catalog/c4) dataset: a colossal, cleaned version of [Common Crawl](https://commoncrawl.org)'s web crawl corpus.
Uses the text data from [`allenai/c4`](https://huggingface.co/datasets/allenai/c4).
Includes `en` subset only.
T5 tokenizer was applied to the text.
Distributed as a ragged array.
Converted via [`json_to_ragged.py`](https://github.com/Birch-san/pre-tokenize/blob/main/script/json_to_ragged.py).
Download size of all shards:
| Split | Data+Lengths Size | Divided across `n` Shards | Typical shard size: `data.npy` | Typical shard size: `len.npy` |
|-|-|-|-|-|
| Train | 293G | 1024 | 344M | 1.4M |
| Test | 299M | 8 | 44M | 179K |
| **Total** | **296G** | _N/A_ | _N/A_ | _N/A_ |
The data is uncompressed, in order to preserve support for random-seeking.
`.data.npy` would probably benefit from compression, because token sequences exhibit patterns.
Tokenization achieves a ~44% compression ratio.
Allen AI's original gzipped JSONL text data achieved a ~61% compression ratio.
So tokenized is about 13% bigger.
Download everything via:
```bash
pip install hf_transfer huggingface-cli
HF_HUB_ENABLE_HF_TRANSFER=True huggingface-cli download --repo-type dataset --local-dir . --local-dir-use-symlinks False Birchlabs/c4-t5-ragged .
```
Download a single ragged array to try it out:
```bash
huggingface-cli download --repo-type dataset --local-dir . --local-dir-use-symlinks False Birchlabs/c4-t5-ragged en/validation/c4-validation.00000-of-00008.{data,len}.npy
```
Read ragged arrays like so:
https://github.com/Birch-san/pre-tokenize/blob/main/script/read_ragged.py
The basic idea is:
`data.npy` is a very long 1D numpy array of tokens.
`len.npy` is a 1D numpy array describing how long is each sample in `data.npy`.
To read sample 0 from `data.npy`, you would:
- start at index 0 in `data.npy`
- check sample 0's length (position 0 in `len.npy`)
- read from index 0 to index 0 + length-of-sample-0
To read sample 1 from `data.npy`, you would:
- start at the end of sample 0.
- check sample 1's length (position 1 in `len.npy`)
- read from end-of-sample-0 to end-of-sample-0 + length-of-sample-1
We can obtain an index of sample ending positions by adding each sample length as we go along (lengths.cumsum()).
We can obtain an index of sample starting positions by prepending the aforementioned endings index with a 0.
[`read_ragged.py`](https://github.com/Birch-san/pre-tokenize/blob/main/script/read_ragged.py) demonstrates how to create this index, and use it to achieve random access.
**This isn't ready for use in torch DataLoader.**
This dataset format is intended as a _precursor_, from which you could create a dataset in a different format.
For example, you might want to iterate over every sample here, chunking by a fixed context length, and output the samples via .parquet chunks for use with torch DataLoader.
That's an easy way out, but your disk won't thank you if you do fully-random access.
An approach that hits the disk less / requires less RAM, would be to implement an IterableDataset, where you iterate sequentially over shards but shuffle within-shard (or shuffle within a smaller-than-shard buffer).
You might also want to perform analyses over the `.len.npy` to decide how to pack these sequences (e.g. packing a 128 and 384 sequence into a 512 context length).
You can do such an analysis via GraphCore's [packedBERT](https://github.com/graphcore/tutorials/tree/sdk-release-2.1/blogs_code/packedBERT).
Then you could process the data into a "packed" dataset.
### Source Data
#### Initial Data Collection and Normalization
The C4 and mC4 datasets are collections text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication. You can find the code that has been used to build this dataset in [c4.py](https://github.com/tensorflow/datasets/blob/5952d3d60d60e1727786fa7a9a23d24bb463d4d6/tensorflow_datasets/text/c4.py) by Tensorflow Datasets.
C4 dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by [langdetect](https://github.com/Mimino666/langdetect) was discarded.
To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages.
### Licensing Information
We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.
### Acknowledgements
Big ups to the good folks at [Common Crawl](https://commoncrawl.org) whose data made this possible ([consider donating](http://commoncrawl.org/donate/)!), to Google for creating the code that curates and filters the data, and to Huggingface, who had no issue with hosting these 3TB of data for public download!
Thanks [Allen AI](https://allenai.org/) for sharing the text that was processed to make this dataset.
提供机构:
Birchlabs
原始信息汇总
数据集概述
基本信息
- 数据集名称: C4
- 标注创建者: 无标注
- 语言创建者: 发现
- 语言: 英语
- 许可证: ODC-BY
- 大小类别:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
- 源数据集: 原始数据集
- 任务类别:
- 文本生成
- 填充掩码
- 任务ID:
- 语言建模
- 掩码语言建模
- PapersWithCode ID: c4
数据集描述
- 数据来源: 处理自Google的C4数据集,是Common Crawl网络爬虫语料库的巨大、清洁版本。
- 子集: 仅包含英语子集。
- 处理方式: 使用T5分词器对文本进行分词,并以不规则数组格式分发。
- 转换脚本: 通过
json_to_ragged.py脚本进行转换。
数据大小
- 训练集: 293G,分为1024个分片,典型分片大小为344M(数据)和1.4M(长度)。
- 测试集: 299M,分为8个分片,典型分片大小为44M(数据)和179K(长度)。
- 总计: 296G。
数据格式
- 数据文件: 未压缩,以支持随机访问。
- 分词压缩比: 约44%。
- 原始数据压缩比: 约61%。
下载方式
-
全部下载: bash pip install hf_transfer huggingface-cli HF_HUB_ENABLE_HF_TRANSFER=True huggingface-cli download --repo-type dataset --local-dir . --local-dir-use-symlinks False Birchlabs/c4-t5-ragged .
-
单个分片下载: bash huggingface-cli download --repo-type dataset --local-dir . --local-dir-use-symlinks False Birchlabs/c4-t5-ragged en/validation/c4-validation.00000-of-00008.{data,len}.npy
数据读取
- 读取脚本:
read_ragged.py - 数据结构:
data.npy: 很长的1D numpy数组,包含所有分词。len.npy: 1D numpy数组,描述每个样本在data.npy中的长度。
数据集用途
- 预处理: 该数据集格式作为预处理数据,可用于创建其他格式的数据集。
- 分析: 可以通过分析
.len.npy来决定如何打包序列。 - 优化: 可以实现IterableDataset,以减少磁盘访问和内存需求。
源数据
- 数据收集和规范化:
- 数据来源: Common Crawl网络爬虫。
- 处理方式: 使用langdetect和CLD3进行语言检测和过滤。
- 代码: 可在Tensorflow Datasets的
c4.py中找到。
许可证信息
- 许可证: ODC-BY
- 附加条款: 使用该数据集还需遵守Common Crawl的使用条款。
致谢
- 贡献者: Common Crawl、Google、Huggingface、Allen AI。



