MichaelR207/rephraser_late_check_0225
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MichaelR207/rephraser_late_check_0225
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: warc_file
dtype: string
- name: doc_id
dtype: string
- name: spec_id
dtype: string
- name: spec
dtype: string
splits:
- name: train
num_bytes: 55543499089
num_examples: 818780
- name: validation
num_bytes: 7179391
num_examples: 100
download_size: 14739434831
dataset_size: 55550678480
---
## Token Statistics
Token counts computed using the **gpt-oss-120b** tokenizer.
- **Input tokens**: tokens in the prompt sent to the model.
- **Reasoning tokens**: tokens used for chain-of-thought reasoning (included in the API's `completion_tokens`).
- **Output tokens**: non-reasoning completion tokens (`completion_tokens - reasoning_tokens`), i.e. the actual document text.
- **Total**: `prompt_tokens + completion_tokens` (reasoning is NOT double-counted).
| Metric | Train | Validation | Total |
|--------|------:|-----------:|------:|
| Input tokens | 15,881,873,090 | 1,662,315 | 15,883,535,405 |
| Reasoning tokens | 318,231,613 | 82,441 | 318,314,054 |
| Output tokens | 188,592,633 | 212,407 | 188,805,040 |
| **Total** | **16,388,697,336** | **1,957,163** | **16,390,654,499** |
提供机构:
MichaelR207



