TNSA/PT-HF500B
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TNSA/PT-HF500B
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: odc-by
tags:
- synthetic-data
- instruction-tuning
- large-scale
- TNSA
- NGen
annotations_creators:
- machine-generated
language_creators:
- found
pretty_name: FinePhrase Synthetic Corpus
size_categories:
- n>1M
source_datasets:
- fineweb-edu (sample-350BT)
task_categories:
- text-generation
task_ids:
- language-modeling
configs:
- config_name: all
data_files:
- split: train
path:
- faq/**/*.parquet
- math/**/*.parquet
- table/**/*.parquet
- tutorial/**/*.parquet
- config_name: faq
data_files:
- split: train
path: faq/**/*.parquet
- config_name: math
data_files:
- split: train
path: math/**/*.parquet
- config_name: table
data_files:
- split: train
path: table/**/*.parquet
- config_name: tutorial
data_files:
- split: train
path: tutorial/**/*.parquet
train-eval-index:
- config: all
task: text-generation
task_id: language-modeling
splits:
train_split: train
col_mapping:
text: text
---
# PT-HF500B (FinePhrase)
## Overview
**FinePhrase** is a large-scale synthetic dataset designed for high-quality language modeling, reasoning, and instruction-following tasks. It transforms raw educational web data into structured, instruction-rich formats suitable for training advanced language models.
This dataset has been extensively used in the pre-training pipeline of TNSA models, including:
* NGen-3
* NGen-4
* NGen-4-OW
It plays a critical role in improving reasoning ability, structured output generation, and multi-format understanding.
---
## Dataset Composition
FinePhrase is built by transforming raw documents into four distinct prompt-driven formats:
### 1. FAQ Format
* Converts content into structured question-answer pairs
* Enhances retrieval-style reasoning and clarity
### 2. Mathematical Reasoning
* Converts text into multi-step math problems
* Includes step-by-step solutions
* Improves numerical reasoning and logical chains
### 3. Tabular Understanding
* Extracts structured data into tables
* Generates question-answer pairs from tabular data
* Strengthens structured data interpretation
### 4. Tutorial / Instructional
* Rewrites content into step-by-step guides
* Improves procedural reasoning and instruction following
---
## Scale
* Input Documents: ~339 Million
* Generated Samples: ~1.35 Billion
* Total Tokens Generated: ~486 Billion
| Config | Samples | Tokens (Completion) | Avg Tokens |
| --------- | --------- | ------------------- | ---------- |
| FAQ | 338.9M | 148.1B | 436.9 |
| Math | 338.7M | 98.4B | 290.5 |
| Table | 338.5M | 92.4B | 272.9 |
| Tutorial | 337.7M | 147.4B | 436.4 |
| **Total** | **1.35B** | **486.3B** | **359.2** |
---
## Data Schema
Each sample includes:
* `id` — unique identifier
* `text` — original source content
* `rollout_results` — generated outputs
* `text` — transformed output
* `finish_reason` — generation termination reason
* `usage` — token statistics
---
## Generation Process
* Built using a high-throughput synthetic data pipeline
* Based on large-scale educational web data
* Uses instruction-driven transformations
* Supports long-context generation (up to ~8K tokens)
---
## Use Cases
* Pre-training large language models
* Instruction tuning
* Reasoning benchmarks
* Structured output generation
* Synthetic data augmentation
---
## Limitations
* Fully synthetic outputs may include hallucinations
* Some long documents are truncated due to context limits
* Quality depends on transformation prompts and generation settings
---
## Licensing
* ODC-BY (Open Data Commons Attribution License)
---
## Attribution
This dataset originates from large-scale educational web corpora and has been transformed using automated synthetic data generation pipelines.
---
## Notes
FinePhrase represents a foundation-scale synthetic dataset optimized for next-generation AI systems, particularly in improving:
* reasoning depth
* structured thinking
* instruction adherence
* multi-format understanding
It serves as a core dataset in the development of TNSA’s advanced language models.
提供机构:
TNSA



