Luxel/hua-qwen35-tinker-training-data
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Luxel/hua-qwen35-tinker-training-data
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- eval-awareness
- qwen
- tinker
- synthetic-data
- fine-tuning
pretty_name: Hua Qwen3.5 Tinker Training Data
size_categories:
- 100K<n<1M
---
# Hua Qwen3.5 Tinker Training Data
This dataset contains the Qwen3.5 identity-rewritten, Tinker-rendered training inputs prepared in the `Non-verbal-Eval-Awareness` project for a Hua-style model-organism reproduction.
It mirrors the local generated files from `pipelines/training/data/`:
- `data/derived/sdf_all_raw_qwen.jsonl`: all 311,083 normalized synthetic-document fine-tuning rows, rewritten from Llama/Nemotron identity artifacts to Qwen identity artifacts.
- `data/derived/ei_replay_full_trace_qwen.jsonl`: 41,290 paper-replay expert-iteration full-trace rows from the four accepted EI files.
- `data/analysis/token_counts_qwen35_tinker_qwen_rewrite.json`: exact Qwen3/Tinker rendered token counts.
- `data/manifests/qwen35_identity_rewrite_summary.json`: rewrite provenance and validation summary.
- audit files under `data/analysis/` documenting before/after identity terms and remaining bare `Llama`/`LLaMA` contexts.
## Counts
- SDF rows: 311,083
- EI rows: 41,290
- Total rows: 352,373
- Total rendered train tokens: 230,278,727
- Total loss tokens: 222,091,269
- Estimated Tinker cost at $0.67/M rendered train tokens: $154.29
## Download
From the project repo:
```bash
uv run python pipelines/training/scripts/download_hua_qwen_data.py --core-only
```
Or with `huggingface_hub` directly:
```python
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id="Luxel/hua-qwen35-tinker-training-data",
repo_type="dataset",
filename="data/derived/sdf_all_raw_qwen.jsonl",
)
```
## Provenance
This is a derived research artifact based on public Hua et al. training-data releases:
- `timhua/evalwood_sdf_1stpart`
- `timhua/second_half_training`
- `timhua/expert_iter_2`
The upstream Hugging Face dataset cards did not declare a license at upload time. This repository therefore does not assert a new license over upstream-derived content; use should respect the upstream source terms and research context.
## Integrity
See `checksums.sha256` and `manifest.json` for file hashes and byte sizes.
提供机构:
Luxel



