pengxiang/nap-parallel-packing-demo
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/pengxiang/nap-parallel-packing-demo
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- parallel-packing
- pretraining
- fineweb
size_categories:
- 1K<n<10K
---
# NAP Parallel Packing Demo
Parallel-packed pretraining data built from [FineWeb sample-10BT](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
**Core idea**: blocks within each sample are semantically related but not duplicates; block order is shuffled to break privileged sequential ordering.
## Format
Each line in `train.jsonl` is a JSON object:
```json
{
"text": "<blk>block 1 text</blk><blk>block 2 text</blk><blk>block 3 text</blk>",
"blocks": ["block 1 text", "block 2 text", "block 3 text"],
"metadata": {
"source_para_ids": [123, 456, 789],
"num_blocks": 3,
"total_tokens": 1200,
"type": "parallel"
}
}
```
Two sample types:
- **parallel** (50%): 3 semantically related paragraphs from different documents, shuffled
- **sequential** (50%): single document text preserving natural order
## Pipeline
1. Paragraph splitting from FineWeb (quality-filtered)
2. Embedding with Qwen3-Embedding-0.6B (Matryoshka 1024d + 256d)
3. FAISS IVF-PQ index on 256d vectors
4. kNN retrieval + cosine dedup + MMR diversity selection
5. Block assembly with \ tags + length/balance filtering
6. 50/50 mixing with sequential data
## Stats
- Source: 5,000 FineWeb documents
- Paragraphs: 4,677
- Parallel samples: 821
- Total samples: 1,642
- Block-pair cosine similarity: ~0.33 mean
许可证:Apache-2.0
任务类别:文本生成
语言:英语
标签:并行打包、预训练、FineWeb
样本规模:1K<n<10K
# NAP并行打包演示数据集
本数据集为基于[FineWeb样本10BT](https://huggingface.co/datasets/HuggingFaceFW/fineweb)构建的并行打包预训练数据。
**核心思路**:每个样本内的各个块在语义上具有相关性,但并非重复内容;块的顺序被随机打乱,以打破特权式的序列顺序。
## 数据格式
`train.jsonl`中的每一行均为一个JSON对象:
json
{
"text": "<blk>块1文本</blk><blk>块2文本</blk><blk>块3文本</blk>",
"blocks": ["块1文本", "块2文本", "块3文本"],
"metadata": {
"source_para_ids": [123, 456, 789],
"num_blocks": 3,
"total_tokens": 1200,
"type": "parallel"
}
}
数据集包含两种样本类型:
- **并行型**(占比50%):来自不同文档的3个语义相关段落,经顺序打乱处理
- **顺序型**(占比50%):保留自然语序的单文档文本
## 处理流程
1. 从经过质量过滤的FineWeb数据中拆分段落
2. 使用Qwen3-Embedding-0.6B模型进行嵌入(采用Matryoshka 1024维+256维配置)
3. 基于256维向量构建FAISS IVF-PQ索引
4. 执行k近邻检索、余弦去重与MMR多样性筛选
5. 使用<blk>标签组装块,并进行长度与均衡性过滤
6. 以50:50的比例与顺序型数据进行混合
## 统计信息
- 数据源:5000份FineWeb文档
- 段落总数:4677
- 并行样本数:821
- 总样本数:1642
- 块对余弦相似度均值约为0.33
提供机构:
pengxiang



