finewiki-10M
收藏魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/codelion/finewiki-10M
下载链接
链接失效反馈官方服务:
资源简介:
# FineWiki Sampled Dataset (10,000,000 tokens)
This is a sampled subset of [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) containing approximately **10,000,000 tokens**.
## Dataset Details
### Source
- **Original Dataset**: HuggingFaceFW/finewiki (English subset, train split)
- **Sampling Method**: Reservoir sampling (unbiased random sampling)
- **Target Token Count**: 10,000,000 tokens
- **Tokenizer**: GPT-2 (50,257 vocabulary)
### Sampling Statistics
- **Documents Sampled**: 7,088
- **Average Tokens/Doc**: 1411.0
- **Random Seed**: 42
### Sampling Method
This dataset was created using **reservoir sampling**, which ensures:
- ✅ Unbiased random sample from the full dataset
- ✅ Every document has equal probability of being selected
- ✅ No distribution bias (early/late documents equally represented)
- ✅ Streaming-based (no need to download full dataset)
The sampling algorithm:
1. Streams through HuggingFaceFW/finewiki without downloading
2. Uses GPT-2 tokenizer to count tokens per document
3. Maintains a reservoir of documents using standard reservoir sampling
4. Stops when target token count is reached
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("codelion/finewiki-10M")
# Access the training data
for example in dataset['train']:
print(example['text'])
print(example['title'])
print(example['url'])
```
## Dataset Structure
Each example contains all fields from the original FineWiki dataset:
- **text** (string): The Wikipedia article text (primary content)
- **id** (string): Unique identifier
- **wikiname** (string): Wikipedia source name
- **page_id** (int64): Wikipedia page ID
- **title** (string): Article title
- **url** (string): Source Wikipedia URL
- **date_modified** (string): Last modification date
- **in_language** (string): Language code (always 'en' for this subset)
- **wikidata_id** (string): Wikidata identifier
- **bytes_html** (int64): Size of HTML content
- **wikitext** (string): Original wikitext markup
- **version** (int64): Article version number
- **infoboxes** (string): Extracted infobox data
- **has_math** (bool): Whether article contains mathematical formulas
## Use Cases
This sampled dataset is ideal for:
- 🔬 Small-scale language model pretraining experiments
- 📊 Dataset composition studies
- ⚡ Quick prototyping and testing
- 💰 Low-cost training runs
## Citation
If you use this model/dataset, please cite:
```bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
```
For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
## License
Apache 2.0 (same as original FineWiki dataset)
## Dataset Card Authors
CodeLion
## Dataset Card Contact
For questions or issues, please open an issue on the dataset repository.
# FineWiki 采样数据集(10,000,000 Token(Token))
本数据集为[HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)的采样子集,总Token(Token)数约为**10,000,000**。
## 数据集详情
### 数据源
- **原始数据集**:HuggingFaceFW/finewiki(英文子集,训练拆分)
- **采样方法**:蓄水池采样(无偏随机采样)
- **目标Token(Token)数**:10,000,000
- **分词器(Tokenizer)**:GPT-2(词汇量50,257)
### 采样统计信息
- **采样文档数**:7,088
- **单文档平均Token(Token)数**:1411.0
- **随机种子**:42
### 采样方法
本数据集采用**蓄水池采样**方法构建,可确保:
- ✅ 从完整数据集中获取无偏随机样本
- ✅ 每份文档被选中的概率均等
- ✅ 无分布偏移(早期与晚期文档均被均匀覆盖)
- ✅ 流式处理(无需下载完整数据集)
采样算法步骤如下:
1. 流式遍历HuggingFaceFW/finewiki数据集,无需提前下载完整数据
2. 使用GPT-2分词器统计每份文档的Token(Token)数量
3. 通过标准蓄水池采样算法维护文档蓄水池
4. 达到目标Token(Token)数后停止采样
## 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("codelion/finewiki-10M")
# 访问训练数据
for example in dataset['train']:
print(example['text'])
print(example['title'])
print(example['url'])
## 数据集结构
每个样本包含原始FineWiki数据集的全部字段:
- **text**(字符串):维基百科文章正文(核心内容)
- **id**(字符串):唯一标识符
- **wikiname**(字符串):维基百科来源名称
- **page_id**(int64):维基百科页面ID
- **title**(字符串):文章标题
- **url**(字符串):维基百科来源URL
- **date_modified**(字符串):最后修改日期
- **in_language**(字符串):语言代码(此子集固定为`en`)
- **wikidata_id**(字符串):维基数据标识符
- **bytes_html**(int64):HTML内容大小
- **wikitext**(字符串):原始维基文本标记
- **version**(int64):文章版本号
- **infoboxes**(字符串):提取的信息框数据
- **has_math**(布尔值):文章是否包含数学公式
## 应用场景
本采样数据集适用于:
- 🔬 小规模大语言模型(Large Language Model, LLM)预训练实验
- 📊 数据集组合研究
- ⚡ 快速原型开发与测试
- 💰 低成本训练运行
## 引用
若使用本模型/数据集,请引用以下文献:
bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
更多详情请参阅[官方博客](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
## 许可证
Apache 2.0(与原始FineWiki数据集一致)
## 数据集卡片作者
CodeLion
## 数据集卡片联系方式
如有疑问或问题,请在数据集仓库中提交Issue。
提供机构:
maas
创建时间:
2025-11-02



