LeTG/psychological-coercion-identification

Name: LeTG/psychological-coercion-identification
Creator: LeTG
Published: 2026-03-23 10:58:57
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LeTG/psychological-coercion-identification

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-sa-4.0 task_categories: - text-classification task_ids: - multi-label-classification tags: - psychological-coercion - psyop - propaganda-detection - influence-operations - manipulation-detection - nlp - fine-tuning - media-literacy pretty_name: Psychological Coercion Identification size_categories: - 10K<n<100K --- # Psychological Coercion Identification Dataset ## Overview A labeled dataset of 12,655 text chunks derived from 24 publicly available YouTube interview transcripts, annotated for psychological coercion techniques and PSYOP (Psychological Operations) indicators. Built to support fine-tuning of open-source language models on coercive language detection, propaganda identification, and influence operation analysis. --- ## Intended Uses - Fine-tuning NLP classifiers to detect psychological manipulation in text - Research into propaganda, coercion, and influence operations - Training content moderation systems for media literacy tools - Academic study of persuasion and coercive communication patterns --- ## Dataset Statistics | Metric | Value | |---|---| | Total labeled chunks | 12,655 | | Source videos | 24 YouTube interviews | | Chunk size | ~30 words per chunk | | PSYOP present (label=1) | 1,317 (10.4%) | | PSYOP absent (label=0) | 11,338 (89.6%) | | Train split | 10,137 rows | | Validation split | 1,254 rows | | Test split | 1,264 rows | --- ## Technique Distribution | Technique | Occurrences | |---|---| | loaded_language | 973 | | us_vs_them | 886 | | fear_appeal | 424 | | false_dichotomy | 416 | | identity_targeting | 148 | | scarcity_urgency | 140 | | authority_appeal | 121 | | information_control | 98 | | social_proof | 61 | | repetition_conditioning | 59 | | thought_termination | 52 | | guilt_induction | 37 | | gaslighting | 16 | | victimhood_framing | 4 | --- ## Dataset Schema | Field | Type | Description | |---|---|---| | chunk_id | string | Unique identifier for the chunk | | source_url | string | YouTube URL the chunk was derived from | | source_file | string | Transcript filename | | chunk_index | int | Position of chunk within its source transcript | | text | string | The raw ~30-word text chunk | | word_count | int | Number of words in the chunk | | psyop_present | int | Binary label: 1 = coercive language detected, 0 = absent | | confidence | float | Annotator confidence score (0.0–1.0) | | techniques | list[string] | Detected technique labels (see definitions below) | | target | string | Coercion target: individual, group, or none | | sentiment | string | positive, negative, neutral, or mixed | | notes | string | Brief annotation reasoning | | split | string | train, validation, or test | --- ## Technique Label Definitions | Label | Definition | |---|---| | fear_appeal | Uses fear or threat of harm to influence behavior or belief | | false_dichotomy | Presents only two options when more exist | | loaded_language | Emotionally charged words used to influence perception | | repetition_conditioning | Repeats phrases or ideas to normalize them over time | | authority_appeal | Invokes authority figures to bypass critical thinking | | social_proof | Implies that everyone believes or does something | | scarcity_urgency | Creates artificial time pressure or scarcity to force decisions | | identity_targeting | Exploits group identity or sense of belonging | | guilt_induction | Uses guilt or shame to manipulate behavior | | love_bombing | Overwhelming praise or affection to gain compliance | | thought_termination | Uses clichés or platitudes to shut down critical thinking | | us_vs_them | Creates in-group vs out-group division to polarize | | gaslighting | Causes the target to question their own perception of reality | | information_control | Restricts, distorts, or selectively presents information | --- ## Class Imbalance Note This dataset reflects real-world distributions of coercive language in interview and documentary content. The 89.6% negative class majority is intentional and realistic — most speech is not overtly coercive even in content that contains manipulative elements. When fine-tuning models on this dataset, consider: - Weighted loss functions (e.g. `class_weight='balanced'` in sklearn) - Oversampling the minority class (e.g. SMOTE) - Adjusting classification thresholds post-training --- ## Annotation Methodology Text chunks were annotated using **Claude Haiku (claude-haiku-4-5)** via the Anthropic API with a controlled vocabulary system prompt enforcing consistent label taxonomy across all 14 technique categories. Chunk size was fixed at approximately 30 words to capture sentence-level coercive patterns. All source material consists of publicly available YouTube content. Transcripts were extracted using YT-DLP with automatic deduplication applied to remove caption overlap artifacts inherent in YouTube's ASR system. --- ## License [Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) You are free to share and adapt this dataset for any purpose, including commercial use, provided you give appropriate credit and distribute any derivative works under the same license.

提供机构：

LeTG

5,000+

优质数据集

54 个

任务类型

进入经典数据集