danghaidang-passau/HateOWS-dataset-LREC2026

Name: danghaidang-passau/HateOWS-dataset-LREC2026
Creator: danghaidang-passau
Published: 2026-03-18 11:43:22
License: 暂无描述

Hugging Face2026-03-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/danghaidang-passau/HateOWS-dataset-LREC2026

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - de - es - vi license: other multilinguality: multilingual pretty_name: OWS LREC2026 Raw + Ensemble Annotations task_categories: - text-classification size_categories: - 1M<n<10M configs: - config_name: raw data_files: - split: 4L path: raw/ows4L.parquet - split: deu path: raw/deu.parquet - split: eng path: raw/eng.parquet - split: spa path: raw/spa.parquet - config_name: 46k_qwen data_files: - split: train path: annotatedOWS/46k_qwen.parquet - config_name: annotatedOWS data_files: - split: train path: annotatedOWS/annotatedOWS.parquet - config_name: LightGBM_dataset data_files: - split: train path: LightGBM_dataset/lgb_df.parquet - config_name: 16-val-train data_files: - split: train path: human-train-val/train.parquet - split: test path: human-train-val/test.parquet --- # OWS Data for LREC 2026 This repository contains the data resources used in the paper: **Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations** (LREC-COLING 2026). It provides: - multilingual OpenWebSearch (OWS) raw corpora (DEU, ENG, SPA, VIE) - an annotated subset with four base LLM annotators + three ensemble labeling strategies - a LightGBM-derived dataset (training probabilities / features) - the 16 human-labelled datasets used for validation (train / val counts below) --- ## Splits - Raw - `4L` — 4-language OWS corpus (DEU, ENG, SPA, VIE) - `deu` — German unlabeled subset - `eng` — English unlabeled subset - `spa` — Spanish unlabeled subset - Annotated - `annotated` — subset with model probabilities and ensemble labels - LightGBM - `LightGBM_dataset` — LightGBM training probabilities / features (7-dataset LGB training) - Human 16 - `16-val-train` — 16 datasets used for train and validation splits (tables below) --- ## How to download splits 1) Load raw splits: ```python from datasets import load_dataset repo = "danghaidang-passau/HateOWS-dataset-LREC2026" ds_raw = load_dataset(repo, "raw") # access splits: ds_4l = ds_raw["4L"] ds_deu = ds_raw["deu"] ds_eng = ds_raw["eng"] ds_spa = ds_raw["spa"] ``` 2) Load annotated splits: ```python ds_annotated = load_dataset(repo, "annotatedOWS") ds_annotated = ds_annotated["train"] ``` 3) Load 16 human train/test splits: ```python ds_16 = load_dataset(repo, "16-val-train") ds_train = ds_16["train"] ds_test = ds_16["test"] ``` 4) Load LightGBM annotated splits: ```python ds_lgb = load_dataset(repo, "LightGBM_dataset") ds_lgb = ds_lgb["train"] ``` ## Language stats (deu, eng, spa) | split | language | rows | token_len_sum | |---|---|---|---| | deu | deu | 641830 | 51995390 | | eng | eng | 1598372 | 119077423 | | spa | spa | 1085275 | 110562498 | ## Language stats 4L (deu_eng_spa_vie) | split | language | rows | token_len_sum | |---|---|---|---| | 4L | deu | 900000 | 47262107 | | 4L | eng | 1200000 | 55352705 | | 4L | spa | 500000 | 24724569 | | 4L | vie | 186912 | 9273567 | ## Hate counts on `annotated` The table below reports hate counts for four base annotator models and three ensemble methods. For base models, hate is derived as `prob_1 >= prob_2` (same class ordering as the original pipeline). | model | hate_count | total_rows | hate_ratio_pct | |---|---|---|---| | Qwen2.5-14B | 5823 | 240647 | 2.42 | | Gemma2-9B | 99200 | 240647 | 41.22 | | Llama3.1-8B | 534 | 240647 | 0.22 | | Mistral-7B | 5012 | 240647 | 2.08 | | LightGBM Ensemble | 3122 | 240647 | 1.3 | | Mean Ensemble | 3987 | 240647 | 1.66 | | Vote Ensemble | 4707 | 240647 | 1.96 | ## Annotation columns (short names) - Base models: `qwen_prob_1`, `qwen_prob_2`, `gemma_prob_1`, `gemma_prob_2`, `llama_prob_1`, `llama_prob_2`, `mistral_prob_1`, `mistral_prob_2` - Ensemble probs: `mean_prob_1`, `mean_prob_2`, `lgb_prob_1`, `lgb_prob_2` - Ensemble labels: `mean_label`, `lgb_label`, `vote_label` - Shared metadata: `text`, `language`, `token_len` ## 16 human datasets — train / validation counts (with reference links) Combined train / validation table and LightGBM usage (✓ indicates dataset used for LightGBM training) | Dataset | Language | Train rows | Val rows | LightGBM | Reference | |---|---:|---:|---:|:---:|---| | AHSD | eng | 21,783 | 3,000 | | https://ojs.aaai.org/index.php/ICWSM/article/view/14955 | | HateXplain | eng | 15,299 | 3,846 | ✓ | https://ojs.aaai.org/index.php/AAAI/article/view/17745 | | AbusEval | eng | 13,240 | 860 | |https://aclanthology.org/2020.lrec-1.760/ | | Sexism | eng | 10,904 | 2,632 | ✓ | https://ojs.aaai.org/index.php/ICWSM/article/view/18085 | | GermEval19 | deu | 9,698 | 2,507 | ✓ | https://www.zora.uzh.ch/server/api/core/bitstreams/2b6a9186-fb29-48fc-a9c2-e29cafd1949d/content | | HateEval-eng | eng | 9,000 | 1,000 | | https://aclanthology.org/S19-2007/ | | Gahd | deu | 8,797 | 2,198 | | https://aclanthology.org/2024.naacl-long.248/ | | ViHSD | vie | 8,061 | 2,672 | ✓ | https://link.springer.com/chapter/10.1007/978-3-030-79457-6_35 | | Chileno | spa | 7,572 | 1,928 | | https://aclanthology.org/2022.woah-1.12/ | | HateEval-spa | spa | 5,309 | 1,286 | | https://aclanthology.org/S19-2007/ | | GermEval18| deu | 5,009 | 3,532 | | https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/DATA/0B5VML/ | | Haternet | spa | 4,794 | 1,205 | | https://aclanthology.org/S19-2007/ | | HASOC | deu | 2,373 | 526 | | https://dl.acm.org/doi/10.1145/3368567.3368584 | | GermEval21 | deu | 2,071 | 2,085 | ✓ | https://aclanthology.org/2021.germeval-1.1/ | | US_election | eng | 1,283 | 1,117 | ✓ | https://aclanthology.org/2021.wassa-1.18/ | | Covid | eng | 1,282 | 971 | ✓ | https://dl.acm.org/doi/10.1145/3487351.3488324 |

提供机构：

danghaidang-passau

5,000+

优质数据集

54 个

任务类型

进入经典数据集