five

FrankCCCCC/lm1b

收藏
Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/FrankCCCCC/lm1b
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - fill-mask language: - en size_categories: - 1B<n<10B --- # LM1B - One Billion Word Benchmark ## Dataset Description The One Billion Word Benchmark is a large language modeling dataset. It contains approximately one billion words of training data derived from news articles. ## How was this dataset built? We download the full LM1B dataset from TensorFlow Datasets (TFDS) and convert it to HuggingFace format automatically. The full script is in `lm1b.py`. The required environment is: - tensorflow==2.20.0 - tensorflow-datasets==4.9.9 - huggingface_hub==1.3.3 - datasets==4.4.1 ```bash pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1 python lm1b_builder.py --action all ``` ## Dataset Structure ### Data Fields - `text`: A string containing the text content ### Data Splits | Split | Examples | |-------|----------| | train | 30,301,028 | | test | 306,688 | ## Citation ```bibtex @inproceedings{chelba2013one, title={One billion word benchmark for measuring progress in statistical language modeling}, author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony}, booktitle={Interspeech}, year={2014} } ``` ## License Apache 2.0

--- 许可证: Apache-2.0 任务类别: - 文本生成 - 掩码填充 语言: - 英语 规模类别: - 10亿词 < 数据量 < 100亿词 --- # LM1B——十亿词基准数据集(One Billion Word Benchmark) ## 数据集概览 十亿词基准数据集(One Billion Word Benchmark)是一款用于大语言模型(Large Language Model)建模的大型数据集,其包含约10亿条源自新闻文章的训练语料。 ## 数据集构建方式 我们从TensorFlow数据集(TensorFlow Datasets,TFDS)下载完整的LM1B数据集,并自动将其转换为HuggingFace格式。完整的转换脚本位于`lm1b.py`文件中,所需运行环境如下: - tensorflow==2.20.0 - tensorflow-datasets==4.9.9 - huggingface_hub==1.3.3 - datasets==4.4.1 bash pip install tensorflow==2.20.0 tensorflow-datasets==4.9.9 huggingface_hub==1.3.3 datasets==4.4.1 python lm1b_builder.py --action all ## 数据集结构 ### 数据字段 - `text`:存储文本内容的字符串 ### 数据划分 | 数据集划分 | 样本数量 | |----------|----------| | 训练集 | 30,301,028 | | 测试集 | 306,688 | ## 引用文献 bibtex @inproceedings{chelba2013one, title={One billion word benchmark for measuring progress in statistical language modeling}, author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony}, booktitle={Interspeech}, year={2014} } ## 许可证 Apache 2.0
提供机构:
FrankCCCCC
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作