five

finewiki-10M

收藏
魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/codelion/finewiki-10M
下载链接
链接失效反馈
官方服务:
资源简介:
# FineWiki Sampled Dataset (10,000,000 tokens) This is a sampled subset of [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) containing approximately **10,000,000 tokens**. ## Dataset Details ### Source - **Original Dataset**: HuggingFaceFW/finewiki (English subset, train split) - **Sampling Method**: Reservoir sampling (unbiased random sampling) - **Target Token Count**: 10,000,000 tokens - **Tokenizer**: GPT-2 (50,257 vocabulary) ### Sampling Statistics - **Documents Sampled**: 7,088 - **Average Tokens/Doc**: 1411.0 - **Random Seed**: 42 ### Sampling Method This dataset was created using **reservoir sampling**, which ensures: - ✅ Unbiased random sample from the full dataset - ✅ Every document has equal probability of being selected - ✅ No distribution bias (early/late documents equally represented) - ✅ Streaming-based (no need to download full dataset) The sampling algorithm: 1. Streams through HuggingFaceFW/finewiki without downloading 2. Uses GPT-2 tokenizer to count tokens per document 3. Maintains a reservoir of documents using standard reservoir sampling 4. Stops when target token count is reached ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/finewiki-10M") # Access the training data for example in dataset['train']: print(example['text']) print(example['title']) print(example['url']) ``` ## Dataset Structure Each example contains all fields from the original FineWiki dataset: - **text** (string): The Wikipedia article text (primary content) - **id** (string): Unique identifier - **wikiname** (string): Wikipedia source name - **page_id** (int64): Wikipedia page ID - **title** (string): Article title - **url** (string): Source Wikipedia URL - **date_modified** (string): Last modification date - **in_language** (string): Language code (always 'en' for this subset) - **wikidata_id** (string): Wikidata identifier - **bytes_html** (int64): Size of HTML content - **wikitext** (string): Original wikitext markup - **version** (int64): Article version number - **infoboxes** (string): Extracted infobox data - **has_math** (bool): Whether article contains mathematical formulas ## Use Cases This sampled dataset is ideal for: - 🔬 Small-scale language model pretraining experiments - 📊 Dataset composition studies - ⚡ Quick prototyping and testing - 💰 Low-cost training runs ## Citation If you use this model/dataset, please cite: ```bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } ``` For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/). ## License Apache 2.0 (same as original FineWiki dataset) ## Dataset Card Authors CodeLion ## Dataset Card Contact For questions or issues, please open an issue on the dataset repository.

# FineWiki 采样数据集(10,000,000 Token(Token)) 本数据集为[HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)的采样子集,总Token(Token)数约为**10,000,000**。 ## 数据集详情 ### 数据源 - **原始数据集**:HuggingFaceFW/finewiki(英文子集,训练拆分) - **采样方法**:蓄水池采样(无偏随机采样) - **目标Token(Token)数**:10,000,000 - **分词器(Tokenizer)**:GPT-2(词汇量50,257) ### 采样统计信息 - **采样文档数**:7,088 - **单文档平均Token(Token)数**:1411.0 - **随机种子**:42 ### 采样方法 本数据集采用**蓄水池采样**方法构建,可确保: - ✅ 从完整数据集中获取无偏随机样本 - ✅ 每份文档被选中的概率均等 - ✅ 无分布偏移(早期与晚期文档均被均匀覆盖) - ✅ 流式处理(无需下载完整数据集) 采样算法步骤如下: 1. 流式遍历HuggingFaceFW/finewiki数据集,无需提前下载完整数据 2. 使用GPT-2分词器统计每份文档的Token(Token)数量 3. 通过标准蓄水池采样算法维护文档蓄水池 4. 达到目标Token(Token)数后停止采样 ## 使用方法 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("codelion/finewiki-10M") # 访问训练数据 for example in dataset['train']: print(example['text']) print(example['title']) print(example['url']) ## 数据集结构 每个样本包含原始FineWiki数据集的全部字段: - **text**(字符串):维基百科文章正文(核心内容) - **id**(字符串):唯一标识符 - **wikiname**(字符串):维基百科来源名称 - **page_id**(int64):维基百科页面ID - **title**(字符串):文章标题 - **url**(字符串):维基百科来源URL - **date_modified**(字符串):最后修改日期 - **in_language**(字符串):语言代码(此子集固定为`en`) - **wikidata_id**(字符串):维基数据标识符 - **bytes_html**(int64):HTML内容大小 - **wikitext**(字符串):原始维基文本标记 - **version**(int64):文章版本号 - **infoboxes**(字符串):提取的信息框数据 - **has_math**(布尔值):文章是否包含数学公式 ## 应用场景 本采样数据集适用于: - 🔬 小规模大语言模型(Large Language Model, LLM)预训练实验 - 📊 数据集组合研究 - ⚡ 快速原型开发与测试 - 💰 低成本训练运行 ## 引用 若使用本模型/数据集,请引用以下文献: bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } 更多详情请参阅[官方博客](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。 ## 许可证 Apache 2.0(与原始FineWiki数据集一致) ## 数据集卡片作者 CodeLion ## 数据集卡片联系方式 如有疑问或问题,请在数据集仓库中提交Issue。
提供机构:
maas
创建时间:
2025-11-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作