five

LeTG/psychological-coercion-identification

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LeTG/psychological-coercion-identification
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-sa-4.0 task_categories: - text-classification task_ids: - multi-label-classification tags: - psychological-coercion - psyop - propaganda-detection - influence-operations - manipulation-detection - nlp - fine-tuning - media-literacy pretty_name: Psychological Coercion Identification size_categories: - 10K<n<100K --- # Psychological Coercion Identification Dataset ## Overview A labeled dataset of 12,655 text chunks derived from 24 publicly available YouTube interview transcripts, annotated for psychological coercion techniques and PSYOP (Psychological Operations) indicators. Built to support fine-tuning of open-source language models on coercive language detection, propaganda identification, and influence operation analysis. --- ## Intended Uses - Fine-tuning NLP classifiers to detect psychological manipulation in text - Research into propaganda, coercion, and influence operations - Training content moderation systems for media literacy tools - Academic study of persuasion and coercive communication patterns --- ## Dataset Statistics | Metric | Value | |---|---| | Total labeled chunks | 12,655 | | Source videos | 24 YouTube interviews | | Chunk size | ~30 words per chunk | | PSYOP present (label=1) | 1,317 (10.4%) | | PSYOP absent (label=0) | 11,338 (89.6%) | | Train split | 10,137 rows | | Validation split | 1,254 rows | | Test split | 1,264 rows | --- ## Technique Distribution | Technique | Occurrences | |---|---| | loaded_language | 973 | | us_vs_them | 886 | | fear_appeal | 424 | | false_dichotomy | 416 | | identity_targeting | 148 | | scarcity_urgency | 140 | | authority_appeal | 121 | | information_control | 98 | | social_proof | 61 | | repetition_conditioning | 59 | | thought_termination | 52 | | guilt_induction | 37 | | gaslighting | 16 | | victimhood_framing | 4 | --- ## Dataset Schema | Field | Type | Description | |---|---|---| | chunk_id | string | Unique identifier for the chunk | | source_url | string | YouTube URL the chunk was derived from | | source_file | string | Transcript filename | | chunk_index | int | Position of chunk within its source transcript | | text | string | The raw ~30-word text chunk | | word_count | int | Number of words in the chunk | | psyop_present | int | Binary label: 1 = coercive language detected, 0 = absent | | confidence | float | Annotator confidence score (0.0–1.0) | | techniques | list[string] | Detected technique labels (see definitions below) | | target | string | Coercion target: individual, group, or none | | sentiment | string | positive, negative, neutral, or mixed | | notes | string | Brief annotation reasoning | | split | string | train, validation, or test | --- ## Technique Label Definitions | Label | Definition | |---|---| | fear_appeal | Uses fear or threat of harm to influence behavior or belief | | false_dichotomy | Presents only two options when more exist | | loaded_language | Emotionally charged words used to influence perception | | repetition_conditioning | Repeats phrases or ideas to normalize them over time | | authority_appeal | Invokes authority figures to bypass critical thinking | | social_proof | Implies that everyone believes or does something | | scarcity_urgency | Creates artificial time pressure or scarcity to force decisions | | identity_targeting | Exploits group identity or sense of belonging | | guilt_induction | Uses guilt or shame to manipulate behavior | | love_bombing | Overwhelming praise or affection to gain compliance | | thought_termination | Uses clichés or platitudes to shut down critical thinking | | us_vs_them | Creates in-group vs out-group division to polarize | | gaslighting | Causes the target to question their own perception of reality | | information_control | Restricts, distorts, or selectively presents information | --- ## Class Imbalance Note This dataset reflects real-world distributions of coercive language in interview and documentary content. The 89.6% negative class majority is intentional and realistic — most speech is not overtly coercive even in content that contains manipulative elements. When fine-tuning models on this dataset, consider: - Weighted loss functions (e.g. `class_weight='balanced'` in sklearn) - Oversampling the minority class (e.g. SMOTE) - Adjusting classification thresholds post-training --- ## Annotation Methodology Text chunks were annotated using **Claude Haiku (claude-haiku-4-5)** via the Anthropic API with a controlled vocabulary system prompt enforcing consistent label taxonomy across all 14 technique categories. Chunk size was fixed at approximately 30 words to capture sentence-level coercive patterns. All source material consists of publicly available YouTube content. Transcripts were extracted using YT-DLP with automatic deduplication applied to remove caption overlap artifacts inherent in YouTube's ASR system. --- ## License [Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) You are free to share and adapt this dataset for any purpose, including commercial use, provided you give appropriate credit and distribute any derivative works under the same license.
提供机构:
LeTG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作