finewiki-1B
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/finewiki-1B
下载链接
链接失效反馈官方服务:
资源简介:
# FineWiki Sampled Dataset (1,000,000,332 tokens)
This is a sampled subset of [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) containing approximately **1,000,000,332 tokens**.
## Dataset Details
### Source
- **Original Dataset**: HuggingFaceFW/finewiki (English subset, train split)
- **Sampling Method**: Reservoir sampling (unbiased random sampling)
- **Target Token Count**: 1,000,000,332 tokens
- **Tokenizer**: GPT-2 (50,257 vocabulary)
### Sampling Statistics
- **Documents Sampled**: 52,721
- **Average Tokens/Doc**: 18971.0
- **Random Seed**: 42
### Sampling Method
This dataset was created using **reservoir sampling**, which ensures:
- ✅ Unbiased random sample from the full dataset
- ✅ Every document has equal probability of being selected
- ✅ No distribution bias (early/late documents equally represented)
- ✅ Streaming-based (no need to download full dataset)
The sampling algorithm:
1. Streams through HuggingFaceFW/finewiki without downloading
2. Uses GPT-2 tokenizer to count tokens per document
3. Maintains a reservoir of documents using standard reservoir sampling
4. Stops when target token count is reached
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("codelion/finewiki-1B")
# Access the training data
for example in dataset['train']:
print(example['text'])
print(example['title'])
print(example['url'])
```
## Dataset Structure
Each example contains all fields from the original FineWiki dataset:
- **text** (string): The Wikipedia article text (primary content)
- **id** (string): Unique identifier
- **wikiname** (string): Wikipedia source name
- **page_id** (int64): Wikipedia page ID
- **title** (string): Article title
- **url** (string): Source Wikipedia URL
- **date_modified** (string): Last modification date
- **in_language** (string): Language code (always 'en' for this subset)
- **wikidata_id** (string): Wikidata identifier
- **bytes_html** (int64): Size of HTML content
- **wikitext** (string): Original wikitext markup
- **version** (int64): Article version number
- **infoboxes** (string): Extracted infobox data
- **has_math** (bool): Whether article contains mathematical formulas
## Use Cases
This sampled dataset is ideal for:
- 🔬 Small-scale language model pretraining experiments
- 📊 Dataset composition studies
- ⚡ Quick prototyping and testing
- 💰 Low-cost training runs
## Citation
If you use this model/dataset, please cite:
```bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
```
For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
## License
Apache 2.0 (same as original FineWiki dataset)
## Dataset Card Authors
CodeLion
## Dataset Card Contact
For questions or issues, please open an issue on the dataset repository.
# FineWiki 采样数据集(1,000,000,332 个Token)
本数据集是 [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) 的采样子集,总Token数约为**1,000,000,332**。
## 数据集详情
### 数据源
- **原始数据集**:HuggingFaceFW/finewiki(英语子集,训练拆分集)
- **采样方法**:蓄水池采样(无偏随机采样)
- **目标Token数**:1,000,000,332
- **分词器(Tokenizer)**:GPT-2(词汇表大小为50257)
### 采样统计数据
- **采样文档数**:52,721
- **单文档平均Token数**:18971.0
- **随机种子**:42
### 采样方法
本数据集采用**蓄水池采样**方法构建,可保障:
- ✅ 对完整数据集进行无偏随机采样
- ✅ 所有文档被选中的概率均等
- ✅ 无分布偏差(早期与晚期文档均被均匀覆盖)
- ✅ 基于流式处理(无需下载完整数据集)
采样算法如下:
1. 流式读取HuggingFaceFW/finewiki数据集,无需提前下载
2. 使用GPT-2分词器统计每份文档的Token数
3. 采用标准蓄水池采样算法维护采样文档池
4. 当累计Token数达到目标值时停止采样
## 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("codelion/finewiki-1B")
# 访问训练数据
for example in dataset['train']:
print(example['text'])
print(example['title'])
print(example['url'])
## 数据集结构
每个样本均保留原始FineWiki数据集的全部字段:
- **text**(字符串类型):维基百科文章正文(核心内容)
- **id**(字符串类型):唯一标识符
- **wikiname**(字符串类型):维基百科来源名称
- **page_id**(int64类型):维基百科页面ID
- **title**(字符串类型):文章标题
- **url**(字符串类型):维基百科来源链接
- **date_modified**(字符串类型):最后修改日期
- **in_language**(字符串类型):语言代码(本子集固定为'en')
- **wikidata_id**(字符串类型):维基数据标识符
- **bytes_html**(int64类型):HTML内容大小(字节数)
- **wikitext**(字符串类型):原始维基文本标记内容
- **version**(int64类型):文章版本号
- **infoboxes**(字符串类型):提取的信息框数据
- **has_math**(布尔类型):文章是否包含数学公式
## 应用场景
本采样数据集适用于:
- 🔬 小规模大语言模型预训练实验
- 📊 数据集构成研究
- ⚡ 快速原型开发与测试
- 💰 低成本训练任务
## 引用声明
若使用本模型/数据集,请引用以下文献:
bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
更多细节请参阅[官方博客](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
## 许可协议
Apache 2.0(与原始FineWiki数据集一致)
## 数据集卡片作者
CodeLion
## 数据集卡片联系方式
如有疑问或问题,请在数据集仓库中提交Issue。
提供机构:
maas
创建时间:
2025-11-03



