five

bingbangboom/editlens_iclr_binary_reasoning

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bingbangboom/editlens_iclr_binary_reasoning
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 language: - en tags: - ai-detection - synthetic-data - reasoning - chain-of-thought source_datasets: - pangram/editlens_iclr task_categories: - text-classification - zero-shot-classification --- # bingbangboom/editlens_iclr_binary_reasoning ![Banner](https://huggingface.co/datasets/bingbangboom/editlens_iclr_binary_reasoning/resolve/main/banner.webp) This dataset is a binary-classification subset drawn from the training split of [`pangram/editlens_iclr`](https://huggingface.co/datasets/pangram/editlens_iclr) dataset. It isolates purely human-crafted texts (`human_written`) against purely synthetic content (`ai_generated`), strictly filtering out the overlapping `ai_edited` classification cluster for binary classification tasks. The primary augmentation of this dataset is the inclusion of **Reasoning Traces** (Chain of Thought). Every single text sequence is accompanied by an analytical reasoning block detailing stylistic forensic artifacts (including and not limited to lexical diversity, burstiness, syntax uniformity, discourse markers, and pragmatic edges). > This dataset was published as a submission to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) powered by [Adaptive Data](https://www.adaptionlabs.ai/blog/adaption-launches-adaptive-data-beta). ## Dataset Structure ### Features The data utilizes a `.jsonl` schema. - `id`: Sequential numerical index. - `system_prompt` (string): An analytic rule-set injected to align the reasoning traces. - `text` (string): The originating text snippet extracted from `editlens_iclr` dataset. - `text_type` (string): Ground truth extracted from `editlens_iclr` dataset. Exclusively bounded to `"human_written"` or `"ai_generated"`. - `raw_model_response` (string): The full classification response string containing both the `<think>` block encapsulating the thinking/reasoning traces and the finalized discrete output label mapped sequentially at the end. ### Sample ```json { "id": 1, "system_prompt": "You are a world-class forensic text analyst...Output only the correct classification label (\"human_written\" or \"ai_generated\" ) and nothing more.", "text": "A YouTube user ... alongside the boat .", "raw_model_response": "<think>\n1. **Analyze the Request:**\n * Role: World-class forensic text analyst.\n...\n</think>\n\nhuman_written\"\n", "text_type": "human_written"} ``` ## Dataset Creation ### Source Generation The text samples (`text`) and their corresponding true classification labels (`text_type`) were ingested from `pangram/editlens_iclr` train set. ### Reasoning Annotation Process The `<think>` tracing blocks housed inside `raw_model_response` were procedurally synthesized utilizing `glm-5.1` targeting zero-shot classification generation. ### Data Sanitization Any reasoning trace resulting in a blank or invalid parse structure, hallucinated label, or arriving at an incorrect predicted class was scrubbed. This guarantees that 100% of the remaining traces directly support the ground truth target. ## Citation and License **License:** CC BY-NC-SA 4.0 https://creativecommons.org/licenses/by-nc-sa/4.0/ This dataset is derived from `pangram/editlens_iclr` by Pangram Labs, licensed under CC BY-NC-SA 4.0. If you use this dataset, please cite: - `bingbangboom/editlens_iclr_binary_reasoning` — augmented version - `pangram/editlens_iclr` — base dataset
提供机构:
bingbangboom
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作