MichaelR207/rephraser_kimi_v1_0331_output
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MichaelR207/rephraser_kimi_v1_0331_output
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: output
dtype: string
- name: spec
dtype: string
- name: spec_id
dtype: string
- name: model
dtype: string
- name: warc_file
dtype: string
- name: doc_id
dtype: string
splits:
- name: train
num_examples: 23077
- name: validation
num_examples: 100
license: cc-by-4.0
---
# rephraser_kimi_v1_0331_output
Clean extracted text from web pages, produced by **Kimi-K2.5**.
This dataset contains only the model-generated extraction output (no prompts,
no HTML, no reasoning traces). Suitable for text quality analysis, downstream
NLP tasks, and training data.
## Source
Derived from [MichaelR207/rephraser_kimi_v1_0331](https://huggingface.co/datasets/MichaelR207/rephraser_kimi_v1_0331)
by extracting the assistant response, stripping `<think>` reasoning and DSPy
field markers, and dropping rows containing `[NO_USEFUL_CONTENT]`.
## Processing
- Reasoning (`<think>...</think>`) stripped
- DSPy markers (`[[ ## text ## ]]`, `[[ ## completed ## ]]`) stripped
- Rows with `[NO_USEFUL_CONTENT]` dropped (34,722 rows removed)
## Schema
| Column | Description |
|--------|-------------|
| `output` | Clean extracted text |
| `spec` | Extraction specification used |
| `spec_id` | Specification identifier (0-999) |
| `model` | Model that generated the extraction |
| `warc_file` | Source Common Crawl WARC file |
| `doc_id` | Document identifier within WARC |
## License
This dataset is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
The output text is model-generated extraction from public web content.
## Stats
| Split | Rows |
|-------|-----:|
| Train | 23,077 |
| Validation | 100 |
| Total | 23,177 |
| Dropped (no useful content) | 34,722 |
提供机构:
MichaelR207



