five

aaa23123/eagle-data-curation

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aaa23123/eagle-data-curation
下载链接
链接失效反馈
官方服务:
资源简介:
# Merged Dataset Standard Filtered This folder contains the final training-ready dataset produced by the current `standard` filtering pipeline. ## Files - `merged_dataset.filtered.standard.back.jsonl`: final filtered dataset, schema-consistent with the raw input ## Filtering Strategy The current pipeline uses the `standard` strategy defined in: - `/home/dhz/eagle-data-curation/configs/process-open-perfectblend.standard.yaml` Applied operators and parameters: ```yaml process: - text_length_filter: min_len: 20 max_len: 24000 - alphanumeric_filter: tokenization: false min_ratio: 0.02 - character_repetition_filter: rep_len: 10 max_ratio: 0.6 - document_deduplicator: lowercase: true ignore_non_character: true - document_simhash_deduplicator: tokenization: space window_size: 6 lowercase: true ignore_pattern: '\\p{P}' num_blocks: 10 hamming_distance: 3 ``` ## Data Integrity The final output keeps the same schema as the raw dataset. Top-level fields: - `id` - `conversations` - `reasoning_effort` - `status` Conversation message fields: - user messages: `role`, `content` - assistant messages: `role`, `content`, `thinking` Validation result on the full output: - top-level schema mismatches: `0` - user message schema mismatches: `0` - assistant message schema mismatches: `0` - assistant messages missing `thinking`: `0` - empty conversations: `0` ## Counts - raw samples: `1,411,259` - kept samples: `1,326,396` - dropped samples: `84,863` - keep ratio: `93.9867%` ## Generation Commands ```bash conda activate data-juicer cd /home/dhz/eagle-data-curation python scripts/prepare_perfectblend.py python scripts/run_dj_filter.py --config configs/process-open-perfectblend.standard.yaml ``` The second command runs `dj-process` and then automatically restores the filtered output into the final schema-consistent training file.
提供机构:
aaa23123
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作