redsandr/cognifi-investor-bias
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/redsandr/cognifi-investor-bias
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- id
- en
license: mit
task_categories:
- text-classification
task_ids:
- multi-class-classification
tags:
- behavioral-finance
- cognitive-bias
- indonesian
- retail-investors
- fomo
- loss-aversion
- confirmation-bias
- code-switching
- low-resource
- nlp
size_categories:
- 1K<n<10K
---
# CogniFi — Cognitive Bias Detection in Indonesian Retail Investor Text
Labeled dataset of authentic Indonesian retail investor community posts for cognitive bias detection. Collected from Stockbit, X.com, Reddit, and Threads between 2025 and March 2026.
**Code:** [github.com/redsandr/cognifi](https://github.com/redsandr/cognifi)
---
## Dataset summary
1,193 labeled samples of authentic Indonesian retail investor posts across four cognitive bias classes. All samples are real user-generated content, no synthetic examples. 25.7% of samples involve intra-sentential Indonesian-English code-switching, making this dataset a unique resource for low-resource NLP in emerging market financial contexts.
**Inter-annotator agreement:** Cohen's Kappa = 0.7815 (substantial agreement, Landis & Koch 1977), across two domain-qualified annotators (n=100 samples).
---
## Labels
| Label | Description | Examples |
|-------|-------------|---------|
| `FOMO` | Fear of Missing Out — urgency-driven entry triggered by social proof rather than fundamentals | "Semua temen gue udah profit, masih sempet ga?" |
| `LOSS_AVERSION` | Holding losing positions with rationalizations to avoid realizing losses | "Belum rugi kalau belum dijual, nunggu balik modal" |
| `CONFIRMATION_BIAS` | Seeking validation for a prior belief rather than genuine analysis | "BBRI pasti naik kan? Analis setuju sama gue" |
| `NONE` | Analytical, educational, or genuinely neutral text | "Berapa P/E ratio yang wajar untuk saham perbankan?" |
---
## Dataset statistics
### Training set (1,043 samples)
| Label | Count | % |
|-------|-------|---|
| NONE | 285 | 27.3% |
| FOMO | 280 | 26.8% |
| LOSS_AVERSION | 245 | 23.5% |
| CONFIRMATION_BIAS | 233 | 22.3% |
### Source platform
| Platform | Count | % |
|----------|-------|---|
| Stockbit | 882 | 84.6% |
| X.com | 85 | 8.1% |
| Reddit | 32 | 3.1% |
| Threads | 44 | 4.2% |
### Language distribution (full dataset, 1,193 samples)
| Category | n | % |
|----------|---|---|
| Monolingual Indonesian | 844 | 70.7% |
| Intra-sentential code-switching | 307 | 25.7% |
| Monolingual English | 42 | 3.5% |
Language detection uses hybrid FastText LID + domain lexicon overlap.
---
## Files
| File | Split | Samples | Labels |
| ------------ | -------- | ------- | ---------- |
| `train.json` | Training | 1,043 | Included |
| `test.json` | Held-out | 150 | Withheld |
The held-out test set labels are withheld to prevent data leakage. Evaluation results are documented in the paper (94.7% accuracy, macro F1: 94.9%).
---
## Data fields
### train.json
```json
{
"text": "GOTO mau naik nih, semua orang pada beli, masih sempet ga?",
"expected": "FOMO",
"original_index": 0,
"source": "Stockbit",
"type": "real"
}
```
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Original investor post text (Indonesian, English, or mixed) |
| `expected` | string | Bias label: `FOMO`, `LOSS_AVERSION`, `CONFIRMATION_BIAS`, or `NONE` |
| `original_index` | int | Index in original collection order |
| `source` | string | Platform source: `Stockbit`, `X.com`, `Reddit`, `Threads` |
| `type` | string | Always `real` — all samples are authentic user-generated content |
### test.json
```json
{
"no": 1,
"original_index": 500,
"text": "saudaranya ARTO udah pada naik, tinggal ini yang belum, semoga besok terbang"
}
```
Labels withheld. Text only.
---
## Load dataset
```python
from datasets import load_dataset
ds = load_dataset("redsandr/cognifi-investor-bias")
print(ds["train"][0])
```
```python
# Load training set only
import json
with open("train.json") as f:
data = json.load(f)
```
---
## Key finding
A rule-based classifier (CogniFi) trained on this dataset outperforms Gemini 3 Flash zero-shot with medium reasoning by **18.3 percentage points** (94.7% vs 76.4%). The gap is largest on code-switching samples, where Gemini accuracy drops to 64.1% while CogniFi maintains 91.6%.
The failure is distributional, not architectural: terms like `masih sempet ga` (urgency framing) and `serok bareng` (coordinated accumulation) are absent from any general-purpose training corpus.
---
## Intended use
- Cognitive bias detection in financial NLP
- Low-resource code-switching text classification
- Behavioral finance AI research
- Emerging market NLP benchmarking
## Misuse warning
The same lexical patterns used for bias detection could theoretically be inverted to elicit bias. This dataset should not be used to build systems designed to manipulate investor decision-making.
---
## Acknowledgments
Inter-annotator agreement conducted by Nuril Aidil Alfajri (S.E., 5 years investment experience) and Dawil Abrar (6 years investment experience). Dataset collected from Indonesian retail investor communities on Stockbit, X.com, Reddit, and Threads.
提供机构:
redsandr



