LeTG/psychological-coercion-identification
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LeTG/psychological-coercion-identification
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-sa-4.0
task_categories:
- text-classification
task_ids:
- multi-label-classification
tags:
- psychological-coercion
- psyop
- propaganda-detection
- influence-operations
- manipulation-detection
- nlp
- fine-tuning
- media-literacy
pretty_name: Psychological Coercion Identification
size_categories:
- 10K<n<100K
---
# Psychological Coercion Identification Dataset
## Overview
A labeled dataset of 12,655 text chunks derived from 24 publicly available YouTube
interview transcripts, annotated for psychological coercion techniques and PSYOP
(Psychological Operations) indicators.
Built to support fine-tuning of open-source language models on coercive language
detection, propaganda identification, and influence operation analysis.
---
## Intended Uses
- Fine-tuning NLP classifiers to detect psychological manipulation in text
- Research into propaganda, coercion, and influence operations
- Training content moderation systems for media literacy tools
- Academic study of persuasion and coercive communication patterns
---
## Dataset Statistics
| Metric | Value |
|---|---|
| Total labeled chunks | 12,655 |
| Source videos | 24 YouTube interviews |
| Chunk size | ~30 words per chunk |
| PSYOP present (label=1) | 1,317 (10.4%) |
| PSYOP absent (label=0) | 11,338 (89.6%) |
| Train split | 10,137 rows |
| Validation split | 1,254 rows |
| Test split | 1,264 rows |
---
## Technique Distribution
| Technique | Occurrences |
|---|---|
| loaded_language | 973 |
| us_vs_them | 886 |
| fear_appeal | 424 |
| false_dichotomy | 416 |
| identity_targeting | 148 |
| scarcity_urgency | 140 |
| authority_appeal | 121 |
| information_control | 98 |
| social_proof | 61 |
| repetition_conditioning | 59 |
| thought_termination | 52 |
| guilt_induction | 37 |
| gaslighting | 16 |
| victimhood_framing | 4 |
---
## Dataset Schema
| Field | Type | Description |
|---|---|---|
| chunk_id | string | Unique identifier for the chunk |
| source_url | string | YouTube URL the chunk was derived from |
| source_file | string | Transcript filename |
| chunk_index | int | Position of chunk within its source transcript |
| text | string | The raw ~30-word text chunk |
| word_count | int | Number of words in the chunk |
| psyop_present | int | Binary label: 1 = coercive language detected, 0 = absent |
| confidence | float | Annotator confidence score (0.0–1.0) |
| techniques | list[string] | Detected technique labels (see definitions below) |
| target | string | Coercion target: individual, group, or none |
| sentiment | string | positive, negative, neutral, or mixed |
| notes | string | Brief annotation reasoning |
| split | string | train, validation, or test |
---
## Technique Label Definitions
| Label | Definition |
|---|---|
| fear_appeal | Uses fear or threat of harm to influence behavior or belief |
| false_dichotomy | Presents only two options when more exist |
| loaded_language | Emotionally charged words used to influence perception |
| repetition_conditioning | Repeats phrases or ideas to normalize them over time |
| authority_appeal | Invokes authority figures to bypass critical thinking |
| social_proof | Implies that everyone believes or does something |
| scarcity_urgency | Creates artificial time pressure or scarcity to force decisions |
| identity_targeting | Exploits group identity or sense of belonging |
| guilt_induction | Uses guilt or shame to manipulate behavior |
| love_bombing | Overwhelming praise or affection to gain compliance |
| thought_termination | Uses clichés or platitudes to shut down critical thinking |
| us_vs_them | Creates in-group vs out-group division to polarize |
| gaslighting | Causes the target to question their own perception of reality |
| information_control | Restricts, distorts, or selectively presents information |
---
## Class Imbalance Note
This dataset reflects real-world distributions of coercive language in interview
and documentary content. The 89.6% negative class
majority is intentional and realistic — most speech is not overtly coercive even
in content that contains manipulative elements.
When fine-tuning models on this dataset, consider:
- Weighted loss functions (e.g. `class_weight='balanced'` in sklearn)
- Oversampling the minority class (e.g. SMOTE)
- Adjusting classification thresholds post-training
---
## Annotation Methodology
Text chunks were annotated using **Claude Haiku (claude-haiku-4-5)** via the
Anthropic API with a controlled vocabulary system prompt enforcing consistent
label taxonomy across all 14 technique categories. Chunk size was fixed at
approximately 30 words to capture sentence-level coercive patterns.
All source material consists of publicly available YouTube content. Transcripts
were extracted using YT-DLP with automatic deduplication applied to remove
caption overlap artifacts inherent in YouTube's ASR system.
---
## License
[Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/)
You are free to share and adapt this dataset for any purpose, including commercial
use, provided you give appropriate credit and distribute any derivative works
under the same license.
提供机构:
LeTG



