FrankCCCCC/lm1b
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/FrankCCCCC/lm1b
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- fill-mask
language:
- en
size_categories:
- 1B<n<10B
---
# LM1B - One Billion Word Benchmark
## Dataset Description
The One Billion Word Benchmark is a large language modeling dataset.
It contains approximately one billion words of training data derived from news articles.
## How was this dataset built?
We download the full LM1B dataset from TensorFlow Datasets (TFDS) and convert it to HuggingFace format automatically. The full script is in `lm1b.py`. The required environment is:
- tensorflow==2.20.0
- tensorflow-datasets==4.9.9
- huggingface_hub==1.3.3
- datasets==4.4.1
```bash
pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1
python lm1b_builder.py --action all
```
## Dataset Structure
### Data Fields
- `text`: A string containing the text content
### Data Splits
| Split | Examples |
|-------|----------|
| train | 30,301,028 |
| test | 306,688 |
## Citation
```bibtex
@inproceedings{chelba2013one,
title={One billion word benchmark for measuring progress in statistical language modeling},
author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony},
booktitle={Interspeech},
year={2014}
}
```
## License
Apache 2.0
---
许可证: Apache-2.0
任务类别:
- 文本生成
- 掩码填充
语言:
- 英语
规模类别:
- 10亿词 < 数据量 < 100亿词
---
# LM1B——十亿词基准数据集(One Billion Word Benchmark)
## 数据集概览
十亿词基准数据集(One Billion Word Benchmark)是一款用于大语言模型(Large Language Model)建模的大型数据集,其包含约10亿条源自新闻文章的训练语料。
## 数据集构建方式
我们从TensorFlow数据集(TensorFlow Datasets,TFDS)下载完整的LM1B数据集,并自动将其转换为HuggingFace格式。完整的转换脚本位于`lm1b.py`文件中,所需运行环境如下:
- tensorflow==2.20.0
- tensorflow-datasets==4.9.9
- huggingface_hub==1.3.3
- datasets==4.4.1
bash
pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1
python lm1b_builder.py --action all
## 数据集结构
### 数据字段
- `text`:存储文本内容的字符串
### 数据划分
| 数据集划分 | 样本数量 |
|----------|----------|
| 训练集 | 30,301,028 |
| 测试集 | 306,688 |
## 引用文献
bibtex
@inproceedings{chelba2013one,
title={One billion word benchmark for measuring progress in statistical language modeling},
author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony},
booktitle={Interspeech},
year={2014}
}
## 许可证
Apache 2.0
提供机构:
FrankCCCCC



