MichaelR207/rephraser_late_check_0225_output

Name: MichaelR207/rephraser_late_check_0225_output
Creator: MichaelR207
Published: 2026-04-01 16:58:27
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/MichaelR207/rephraser_late_check_0225_output

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: output dtype: string - name: spec dtype: string - name: spec_id dtype: string - name: model dtype: string - name: warc_file dtype: string - name: doc_id dtype: string splits: - name: train num_examples: 328635 - name: validation num_examples: 100 license: cc-by-4.0 --- # rephraser_late_check_0225_output Clean extracted text from web pages, produced by **Kimi-K2.5**. This dataset contains only the model-generated extraction output (no prompts, no HTML, no reasoning traces). Suitable for text quality analysis, downstream NLP tasks, and training data. ## Source Derived from [MichaelR207/rephraser_late_check_0225](https://huggingface.co/datasets/MichaelR207/rephraser_late_check_0225) by extracting the assistant response, stripping `<think>` reasoning and DSPy field markers, and dropping rows containing `[NO_USEFUL_CONTENT]`. ## Processing - Reasoning (`<think>...</think>`) stripped - DSPy markers (`[[ ## text ## ]]`, `[[ ## completed ## ]]`) stripped - Rows with `[NO_USEFUL_CONTENT]` dropped (490,145 rows removed) ## Schema | Column | Description | |--------|-------------| | `output` | Clean extracted text | | `spec` | Extraction specification used | | `spec_id` | Specification identifier (0-999) | | `model` | Model that generated the extraction | | `warc_file` | Source Common Crawl WARC file | | `doc_id` | Document identifier within WARC | ## License This dataset is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The output text is model-generated extraction from public web content. ## Stats | Split | Rows | |-------|-----:| | Train | 328,635 | | Validation | 100 | | Total | 328,735 | | Dropped (no useful content) | 490,145 |

提供机构：

MichaelR207

5,000+

优质数据集

54 个

任务类型

进入经典数据集