five

MichaelR207/rephraser_late_check_0225_output

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MichaelR207/rephraser_late_check_0225_output
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: output dtype: string - name: spec dtype: string - name: spec_id dtype: string - name: model dtype: string - name: warc_file dtype: string - name: doc_id dtype: string splits: - name: train num_examples: 328635 - name: validation num_examples: 100 license: cc-by-4.0 --- # rephraser_late_check_0225_output Clean extracted text from web pages, produced by **Kimi-K2.5**. This dataset contains only the model-generated extraction output (no prompts, no HTML, no reasoning traces). Suitable for text quality analysis, downstream NLP tasks, and training data. ## Source Derived from [MichaelR207/rephraser_late_check_0225](https://huggingface.co/datasets/MichaelR207/rephraser_late_check_0225) by extracting the assistant response, stripping `<think>` reasoning and DSPy field markers, and dropping rows containing `[NO_USEFUL_CONTENT]`. ## Processing - Reasoning (`<think>...</think>`) stripped - DSPy markers (`[[ ## text ## ]]`, `[[ ## completed ## ]]`) stripped - Rows with `[NO_USEFUL_CONTENT]` dropped (490,145 rows removed) ## Schema | Column | Description | |--------|-------------| | `output` | Clean extracted text | | `spec` | Extraction specification used | | `spec_id` | Specification identifier (0-999) | | `model` | Model that generated the extraction | | `warc_file` | Source Common Crawl WARC file | | `doc_id` | Document identifier within WARC | ## License This dataset is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The output text is model-generated extraction from public web content. ## Stats | Split | Rows | |-------|-----:| | Train | 328,635 | | Validation | 100 | | Total | 328,735 | | Dropped (no useful content) | 490,145 |
提供机构:
MichaelR207
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作