FrankCCCCC/lm1b

Name: FrankCCCCC/lm1b
Creator: FrankCCCCC
Published: 2026-01-28 19:01:23
License: 暂无描述

Hugging Face2026-01-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/FrankCCCCC/lm1b

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - fill-mask language: - en size_categories: - 1B<n<10B --- # LM1B - One Billion Word Benchmark ## Dataset Description The One Billion Word Benchmark is a large language modeling dataset. It contains approximately one billion words of training data derived from news articles. ## How was this dataset built? We download the full LM1B dataset from TensorFlow Datasets (TFDS) and convert it to HuggingFace format automatically. The full script is in `lm1b.py`. The required environment is: - tensorflow==2.20.0 - tensorflow-datasets==4.9.9 - huggingface_hub==1.3.3 - datasets==4.4.1 ```bash pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1 python lm1b_builder.py --action all ``` ## Dataset Structure ### Data Fields - `text`: A string containing the text content ### Data Splits | Split | Examples | |-------|----------| | train | 30,301,028 | | test | 306,688 | ## Citation ```bibtex @inproceedings{chelba2013one, title={One billion word benchmark for measuring progress in statistical language modeling}, author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony}, booktitle={Interspeech}, year={2014} } ``` ## License Apache 2.0

--- 许可证: Apache-2.0 任务类别: - 文本生成 - 掩码填充语言: - 英语规模类别: - 10亿词 < 数据量 < 100亿词 --- # LM1B——十亿词基准数据集（One Billion Word Benchmark） ## 数据集概览十亿词基准数据集（One Billion Word Benchmark）是一款用于大语言模型（Large Language Model）建模的大型数据集，其包含约10亿条源自新闻文章的训练语料。 ## 数据集构建方式我们从TensorFlow数据集（TensorFlow Datasets，TFDS）下载完整的LM1B数据集，并自动将其转换为HuggingFace格式。完整的转换脚本位于`lm1b.py`文件中，所需运行环境如下： - tensorflow==2.20.0 - tensorflow-datasets==4.9.9 - huggingface_hub==1.3.3 - datasets==4.4.1 bash pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1 python lm1b_builder.py --action all ## 数据集结构 ### 数据字段 - `text`：存储文本内容的字符串 ### 数据划分 | 数据集划分 | 样本数量 | |----------|----------| | 训练集 | 30,301,028 | | 测试集 | 306,688 | ## 引用文献 bibtex @inproceedings{chelba2013one, title={One billion word benchmark for measuring progress in statistical language modeling}, author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony}, booktitle={Interspeech}, year={2014} } ## 许可证 Apache 2.0

提供机构：

FrankCCCCC

5,000+

优质数据集

54 个

任务类型

进入经典数据集