five

Luxel/hua-qwen35-tinker-training-data

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Luxel/hua-qwen35-tinker-training-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - eval-awareness - qwen - tinker - synthetic-data - fine-tuning pretty_name: Hua Qwen3.5 Tinker Training Data size_categories: - 100K<n<1M --- # Hua Qwen3.5 Tinker Training Data This dataset contains the Qwen3.5 identity-rewritten, Tinker-rendered training inputs prepared in the `Non-verbal-Eval-Awareness` project for a Hua-style model-organism reproduction. It mirrors the local generated files from `pipelines/training/data/`: - `data/derived/sdf_all_raw_qwen.jsonl`: all 311,083 normalized synthetic-document fine-tuning rows, rewritten from Llama/Nemotron identity artifacts to Qwen identity artifacts. - `data/derived/ei_replay_full_trace_qwen.jsonl`: 41,290 paper-replay expert-iteration full-trace rows from the four accepted EI files. - `data/analysis/token_counts_qwen35_tinker_qwen_rewrite.json`: exact Qwen3/Tinker rendered token counts. - `data/manifests/qwen35_identity_rewrite_summary.json`: rewrite provenance and validation summary. - audit files under `data/analysis/` documenting before/after identity terms and remaining bare `Llama`/`LLaMA` contexts. ## Counts - SDF rows: 311,083 - EI rows: 41,290 - Total rows: 352,373 - Total rendered train tokens: 230,278,727 - Total loss tokens: 222,091,269 - Estimated Tinker cost at $0.67/M rendered train tokens: $154.29 ## Download From the project repo: ```bash uv run python pipelines/training/scripts/download_hua_qwen_data.py --core-only ``` Or with `huggingface_hub` directly: ```python from huggingface_hub import hf_hub_download hf_hub_download( repo_id="Luxel/hua-qwen35-tinker-training-data", repo_type="dataset", filename="data/derived/sdf_all_raw_qwen.jsonl", ) ``` ## Provenance This is a derived research artifact based on public Hua et al. training-data releases: - `timhua/evalwood_sdf_1stpart` - `timhua/second_half_training` - `timhua/expert_iter_2` The upstream Hugging Face dataset cards did not declare a license at upload time. This repository therefore does not assert a new license over upstream-derived content; use should respect the upstream source terms and research context. ## Integrity See `checksums.sha256` and `manifest.json` for file hashes and byte sizes.
提供机构:
Luxel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作