danghaidang-passau/HateOWS-dataset-LREC2026
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danghaidang-passau/HateOWS-dataset-LREC2026
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- de
- es
- vi
license: other
multilinguality: multilingual
pretty_name: OWS LREC2026 Raw + Ensemble Annotations
task_categories:
- text-classification
size_categories:
- 1M<n<10M
configs:
- config_name: raw
data_files:
- split: 4L
path: raw/ows4L.parquet
- split: deu
path: raw/deu.parquet
- split: eng
path: raw/eng.parquet
- split: spa
path: raw/spa.parquet
- config_name: 46k_qwen
data_files:
- split: train
path: annotatedOWS/46k_qwen.parquet
- config_name: annotatedOWS
data_files:
- split: train
path: annotatedOWS/annotatedOWS.parquet
- config_name: LightGBM_dataset
data_files:
- split: train
path: LightGBM_dataset/lgb_df.parquet
- config_name: 16-val-train
data_files:
- split: train
path: human-train-val/train.parquet
- split: test
path: human-train-val/test.parquet
---
# OWS Data for LREC 2026
This repository contains the data resources used in the paper:
**Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations** (LREC-COLING 2026).
It provides:
- multilingual OpenWebSearch (OWS) raw corpora (DEU, ENG, SPA, VIE)
- an annotated subset with four base LLM annotators + three ensemble labeling strategies
- a LightGBM-derived dataset (training probabilities / features)
- the 16 human-labelled datasets used for validation (train / val counts below)
---
## Splits
- Raw
- `4L` — 4-language OWS corpus (DEU, ENG, SPA, VIE)
- `deu` — German unlabeled subset
- `eng` — English unlabeled subset
- `spa` — Spanish unlabeled subset
- Annotated
- `annotated` — subset with model probabilities and ensemble labels
- LightGBM
- `LightGBM_dataset` — LightGBM training probabilities / features (7-dataset LGB training)
- Human 16
- `16-val-train` — 16 datasets used for train and validation splits (tables below)
---
## How to download splits
1) Load raw splits:
```python
from datasets import load_dataset
repo = "danghaidang-passau/HateOWS-dataset-LREC2026"
ds_raw = load_dataset(repo, "raw")
# access splits:
ds_4l = ds_raw["4L"]
ds_deu = ds_raw["deu"]
ds_eng = ds_raw["eng"]
ds_spa = ds_raw["spa"]
```
2) Load annotated splits:
```python
ds_annotated = load_dataset(repo, "annotatedOWS")
ds_annotated = ds_annotated["train"]
```
3) Load 16 human train/test splits:
```python
ds_16 = load_dataset(repo, "16-val-train")
ds_train = ds_16["train"]
ds_test = ds_16["test"]
```
4) Load LightGBM annotated splits:
```python
ds_lgb = load_dataset(repo, "LightGBM_dataset")
ds_lgb = ds_lgb["train"]
```
## Language stats (deu, eng, spa)
| split | language | rows | token_len_sum |
|---|---|---|---|
| deu | deu | 641830 | 51995390 |
| eng | eng | 1598372 | 119077423 |
| spa | spa | 1085275 | 110562498 |
## Language stats 4L (deu_eng_spa_vie)
| split | language | rows | token_len_sum |
|---|---|---|---|
| 4L | deu | 900000 | 47262107 |
| 4L | eng | 1200000 | 55352705 |
| 4L | spa | 500000 | 24724569 |
| 4L | vie | 186912 | 9273567 |
## Hate counts on `annotated`
The table below reports hate counts for four base annotator models and three ensemble methods.
For base models, hate is derived as `prob_1 >= prob_2` (same class ordering as the original pipeline).
| model | hate_count | total_rows | hate_ratio_pct |
|---|---|---|---|
| Qwen2.5-14B | 5823 | 240647 | 2.42 |
| Gemma2-9B | 99200 | 240647 | 41.22 |
| Llama3.1-8B | 534 | 240647 | 0.22 |
| Mistral-7B | 5012 | 240647 | 2.08 |
| LightGBM Ensemble | 3122 | 240647 | 1.3 |
| Mean Ensemble | 3987 | 240647 | 1.66 |
| Vote Ensemble | 4707 | 240647 | 1.96 |
## Annotation columns (short names)
- Base models: `qwen_prob_1`, `qwen_prob_2`, `gemma_prob_1`, `gemma_prob_2`, `llama_prob_1`, `llama_prob_2`, `mistral_prob_1`, `mistral_prob_2`
- Ensemble probs: `mean_prob_1`, `mean_prob_2`, `lgb_prob_1`, `lgb_prob_2`
- Ensemble labels: `mean_label`, `lgb_label`, `vote_label`
- Shared metadata: `text`, `language`, `token_len`
## 16 human datasets — train / validation counts (with reference links)
Combined train / validation table and LightGBM usage (✓ indicates dataset used for LightGBM training)
| Dataset | Language | Train rows | Val rows | LightGBM | Reference |
|---|---:|---:|---:|:---:|---|
| AHSD | eng | 21,783 | 3,000 | | https://ojs.aaai.org/index.php/ICWSM/article/view/14955 |
| HateXplain | eng | 15,299 | 3,846 | ✓ | https://ojs.aaai.org/index.php/AAAI/article/view/17745 |
| AbusEval | eng | 13,240 | 860 | |https://aclanthology.org/2020.lrec-1.760/ |
| Sexism | eng | 10,904 | 2,632 | ✓ | https://ojs.aaai.org/index.php/ICWSM/article/view/18085 |
| GermEval19 | deu | 9,698 | 2,507 | ✓ | https://www.zora.uzh.ch/server/api/core/bitstreams/2b6a9186-fb29-48fc-a9c2-e29cafd1949d/content |
| HateEval-eng | eng | 9,000 | 1,000 | | https://aclanthology.org/S19-2007/ |
| Gahd | deu | 8,797 | 2,198 | | https://aclanthology.org/2024.naacl-long.248/ |
| ViHSD | vie | 8,061 | 2,672 | ✓ | https://link.springer.com/chapter/10.1007/978-3-030-79457-6_35 |
| Chileno | spa | 7,572 | 1,928 | | https://aclanthology.org/2022.woah-1.12/ |
| HateEval-spa | spa | 5,309 | 1,286 | | https://aclanthology.org/S19-2007/ |
| GermEval18| deu | 5,009 | 3,532 | | https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/DATA/0B5VML/ |
| Haternet | spa | 4,794 | 1,205 | | https://aclanthology.org/S19-2007/ |
| HASOC | deu | 2,373 | 526 | | https://dl.acm.org/doi/10.1145/3368567.3368584 |
| GermEval21 | deu | 2,071 | 2,085 | ✓ | https://aclanthology.org/2021.germeval-1.1/ |
| US_election | eng | 1,283 | 1,117 | ✓ | https://aclanthology.org/2021.wassa-1.18/ |
| Covid | eng | 1,282 | 971 | ✓ | https://dl.acm.org/doi/10.1145/3487351.3488324 |
提供机构:
danghaidang-passau



