five

TNSA/PT-HF500B

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TNSA/PT-HF500B
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: odc-by tags: - synthetic-data - instruction-tuning - large-scale - TNSA - NGen annotations_creators: - machine-generated language_creators: - found pretty_name: FinePhrase Synthetic Corpus size_categories: - n>1M source_datasets: - fineweb-edu (sample-350BT) task_categories: - text-generation task_ids: - language-modeling configs: - config_name: all data_files: - split: train path: - faq/**/*.parquet - math/**/*.parquet - table/**/*.parquet - tutorial/**/*.parquet - config_name: faq data_files: - split: train path: faq/**/*.parquet - config_name: math data_files: - split: train path: math/**/*.parquet - config_name: table data_files: - split: train path: table/**/*.parquet - config_name: tutorial data_files: - split: train path: tutorial/**/*.parquet train-eval-index: - config: all task: text-generation task_id: language-modeling splits: train_split: train col_mapping: text: text --- # PT-HF500B (FinePhrase) ## Overview **FinePhrase** is a large-scale synthetic dataset designed for high-quality language modeling, reasoning, and instruction-following tasks. It transforms raw educational web data into structured, instruction-rich formats suitable for training advanced language models. This dataset has been extensively used in the pre-training pipeline of TNSA models, including: * NGen-3 * NGen-4 * NGen-4-OW It plays a critical role in improving reasoning ability, structured output generation, and multi-format understanding. --- ## Dataset Composition FinePhrase is built by transforming raw documents into four distinct prompt-driven formats: ### 1. FAQ Format * Converts content into structured question-answer pairs * Enhances retrieval-style reasoning and clarity ### 2. Mathematical Reasoning * Converts text into multi-step math problems * Includes step-by-step solutions * Improves numerical reasoning and logical chains ### 3. Tabular Understanding * Extracts structured data into tables * Generates question-answer pairs from tabular data * Strengthens structured data interpretation ### 4. Tutorial / Instructional * Rewrites content into step-by-step guides * Improves procedural reasoning and instruction following --- ## Scale * Input Documents: ~339 Million * Generated Samples: ~1.35 Billion * Total Tokens Generated: ~486 Billion | Config | Samples | Tokens (Completion) | Avg Tokens | | --------- | --------- | ------------------- | ---------- | | FAQ | 338.9M | 148.1B | 436.9 | | Math | 338.7M | 98.4B | 290.5 | | Table | 338.5M | 92.4B | 272.9 | | Tutorial | 337.7M | 147.4B | 436.4 | | **Total** | **1.35B** | **486.3B** | **359.2** | --- ## Data Schema Each sample includes: * `id` — unique identifier * `text` — original source content * `rollout_results` — generated outputs * `text` — transformed output * `finish_reason` — generation termination reason * `usage` — token statistics --- ## Generation Process * Built using a high-throughput synthetic data pipeline * Based on large-scale educational web data * Uses instruction-driven transformations * Supports long-context generation (up to ~8K tokens) --- ## Use Cases * Pre-training large language models * Instruction tuning * Reasoning benchmarks * Structured output generation * Synthetic data augmentation --- ## Limitations * Fully synthetic outputs may include hallucinations * Some long documents are truncated due to context limits * Quality depends on transformation prompts and generation settings --- ## Licensing * ODC-BY (Open Data Commons Attribution License) --- ## Attribution This dataset originates from large-scale educational web corpora and has been transformed using automated synthetic data generation pipelines. --- ## Notes FinePhrase represents a foundation-scale synthetic dataset optimized for next-generation AI systems, particularly in improving: * reasoning depth * structured thinking * instruction adherence * multi-format understanding It serves as a core dataset in the development of TNSA’s advanced language models.
提供机构:
TNSA
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作