bingbangboom/editlens_iclr_binary_reasoning
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bingbangboom/editlens_iclr_binary_reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- en
tags:
- ai-detection
- synthetic-data
- reasoning
- chain-of-thought
source_datasets:
- pangram/editlens_iclr
task_categories:
- text-classification
- zero-shot-classification
---
# bingbangboom/editlens_iclr_binary_reasoning

This dataset is a binary-classification subset drawn from the training split of [`pangram/editlens_iclr`](https://huggingface.co/datasets/pangram/editlens_iclr) dataset.
It isolates purely human-crafted texts (`human_written`) against purely synthetic content (`ai_generated`), strictly filtering out the overlapping `ai_edited` classification cluster for binary classification tasks.
The primary augmentation of this dataset is the inclusion of **Reasoning Traces** (Chain of Thought). Every single text sequence is accompanied by an analytical reasoning block detailing stylistic forensic artifacts (including and not limited to lexical diversity, burstiness, syntax uniformity, discourse markers, and pragmatic edges).
> This dataset was published as a submission to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) powered by [Adaptive Data](https://www.adaptionlabs.ai/blog/adaption-launches-adaptive-data-beta).
## Dataset Structure
### Features
The data utilizes a `.jsonl` schema.
- `id`: Sequential numerical index.
- `system_prompt` (string): An analytic rule-set injected to align the reasoning traces.
- `text` (string): The originating text snippet extracted from `editlens_iclr` dataset.
- `text_type` (string): Ground truth extracted from `editlens_iclr` dataset. Exclusively bounded to `"human_written"` or `"ai_generated"`.
- `raw_model_response` (string): The full classification response string containing both the `<think>` block encapsulating the thinking/reasoning traces and the finalized discrete output label mapped sequentially at the end.
### Sample
```json
{
"id": 1,
"system_prompt": "You are a world-class forensic text analyst...Output only the correct classification label (\"human_written\" or \"ai_generated\" ) and nothing more.",
"text": "A YouTube user ... alongside the boat .",
"raw_model_response": "<think>\n1. **Analyze the Request:**\n * Role: World-class forensic text analyst.\n...\n</think>\n\nhuman_written\"\n",
"text_type": "human_written"}
```
## Dataset Creation
### Source Generation
The text samples (`text`) and their corresponding true classification labels (`text_type`) were ingested from `pangram/editlens_iclr` train set.
### Reasoning Annotation Process
The `<think>` tracing blocks housed inside `raw_model_response` were procedurally synthesized utilizing `glm-5.1` targeting zero-shot classification generation.
### Data Sanitization
Any reasoning trace resulting in a blank or invalid parse structure, hallucinated label, or arriving at an incorrect predicted class was scrubbed. This guarantees that 100% of the remaining traces directly support the ground truth target.
## Citation and License
**License:** CC BY-NC-SA 4.0
https://creativecommons.org/licenses/by-nc-sa/4.0/
This dataset is derived from `pangram/editlens_iclr` by Pangram Labs, licensed under CC BY-NC-SA 4.0.
If you use this dataset, please cite:
- `bingbangboom/editlens_iclr_binary_reasoning` — augmented version
- `pangram/editlens_iclr` — base dataset
提供机构:
bingbangboom



