finewiki-10M

Name: finewiki-10M
Creator: maas
Published: 2025-12-05 16:55:51
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/codelion/finewiki-10M

下载链接

链接失效反馈

官方服务：

资源简介：

# FineWiki Sampled Dataset (10,000,000 tokens) This is a sampled subset of [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) containing approximately **10,000,000 tokens**. ## Dataset Details ### Source - **Original Dataset**: HuggingFaceFW/finewiki (English subset, train split) - **Sampling Method**: Reservoir sampling (unbiased random sampling) - **Target Token Count**: 10,000,000 tokens - **Tokenizer**: GPT-2 (50,257 vocabulary) ### Sampling Statistics - **Documents Sampled**: 7,088 - **Average Tokens/Doc**: 1411.0 - **Random Seed**: 42 ### Sampling Method This dataset was created using **reservoir sampling**, which ensures: - ✅ Unbiased random sample from the full dataset - ✅ Every document has equal probability of being selected - ✅ No distribution bias (early/late documents equally represented) - ✅ Streaming-based (no need to download full dataset) The sampling algorithm: 1. Streams through HuggingFaceFW/finewiki without downloading 2. Uses GPT-2 tokenizer to count tokens per document 3. Maintains a reservoir of documents using standard reservoir sampling 4. Stops when target token count is reached ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/finewiki-10M") # Access the training data for example in dataset['train']: print(example['text']) print(example['title']) print(example['url']) ``` ## Dataset Structure Each example contains all fields from the original FineWiki dataset: - **text** (string): The Wikipedia article text (primary content) - **id** (string): Unique identifier - **wikiname** (string): Wikipedia source name - **page_id** (int64): Wikipedia page ID - **title** (string): Article title - **url** (string): Source Wikipedia URL - **date_modified** (string): Last modification date - **in_language** (string): Language code (always 'en' for this subset) - **wikidata_id** (string): Wikidata identifier - **bytes_html** (int64): Size of HTML content - **wikitext** (string): Original wikitext markup - **version** (int64): Article version number - **infoboxes** (string): Extracted infobox data - **has_math** (bool): Whether article contains mathematical formulas ## Use Cases This sampled dataset is ideal for: - 🔬 Small-scale language model pretraining experiments - 📊 Dataset composition studies - ⚡ Quick prototyping and testing - 💰 Low-cost training runs ## Citation If you use this model/dataset, please cite: ```bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } ``` For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/). ## License Apache 2.0 (same as original FineWiki dataset) ## Dataset Card Authors CodeLion ## Dataset Card Contact For questions or issues, please open an issue on the dataset repository.

# FineWiki 采样数据集（10,000,000 Token（Token））本数据集为[HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)的采样子集，总Token（Token）数约为**10,000,000**。 ## 数据集详情 ### 数据源 - **原始数据集**：HuggingFaceFW/finewiki（英文子集，训练拆分） - **采样方法**：蓄水池采样（无偏随机采样） - **目标Token（Token）数**：10,000,000 - **分词器（Tokenizer）**：GPT-2（词汇量50,257） ### 采样统计信息 - **采样文档数**：7,088 - **单文档平均Token（Token）数**：1411.0 - **随机种子**：42 ### 采样方法本数据集采用**蓄水池采样**方法构建，可确保： - ✅ 从完整数据集中获取无偏随机样本 - ✅ 每份文档被选中的概率均等 - ✅ 无分布偏移（早期与晚期文档均被均匀覆盖） - ✅ 流式处理（无需下载完整数据集）采样算法步骤如下： 1. 流式遍历HuggingFaceFW/finewiki数据集，无需提前下载完整数据 2. 使用GPT-2分词器统计每份文档的Token（Token）数量 3. 通过标准蓄水池采样算法维护文档蓄水池 4. 达到目标Token（Token）数后停止采样 ## 使用方法 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("codelion/finewiki-10M") # 访问训练数据 for example in dataset['train']: print(example['text']) print(example['title']) print(example['url']) ## 数据集结构每个样本包含原始FineWiki数据集的全部字段： - **text**（字符串）：维基百科文章正文（核心内容） - **id**（字符串）：唯一标识符 - **wikiname**（字符串）：维基百科来源名称 - **page_id**（int64）：维基百科页面ID - **title**（字符串）：文章标题 - **url**（字符串）：维基百科来源URL - **date_modified**（字符串）：最后修改日期 - **in_language**（字符串）：语言代码（此子集固定为`en`） - **wikidata_id**（字符串）：维基数据标识符 - **bytes_html**（int64）：HTML内容大小 - **wikitext**（字符串）：原始维基文本标记 - **version**（int64）：文章版本号 - **infoboxes**（字符串）：提取的信息框数据 - **has_math**（布尔值）：文章是否包含数学公式 ## 应用场景本采样数据集适用于： - 🔬 小规模大语言模型（Large Language Model, LLM）预训练实验 - 📊 数据集组合研究 - ⚡ 快速原型开发与测试 - 💰 低成本训练运行 ## 引用若使用本模型/数据集，请引用以下文献： bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } 更多详情请参阅[官方博客](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。 ## 许可证 Apache 2.0（与原始FineWiki数据集一致） ## 数据集卡片作者 CodeLion ## 数据集卡片联系方式如有疑问或问题，请在数据集仓库中提交Issue。

提供机构：

maas

创建时间：

2025-11-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集