five

turkish-nlp-suite/TurkishHateMap

收藏
Hugging Face2024-11-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/TurkishHateMap
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - Duygu Altinok language: - tr license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: - sentiment-classification pretty_name: TurkishHateMap (Hate Map of Türkiye) config_names: - animals - cities - ethnicity - lgbt - misogyny - occupations - politics - political-orientation - refugees - religion - sects - veganism tags: - sentiment dataset_info: - config_name: animals features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: train num_bytes: 1346938 num_examples: 996 - name: validation num_bytes: 133450 num_examples: 113 - name: test num_bytes: 176992 num_examples: 115 - config_name: cities features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 118338 num_examples: 121 - name: train num_bytes: 979370 num_examples: 1042 - name: validation num_bytes: 100464 num_examples: 103 - config_name: ethnicity features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 417886 num_examples: 456 - name: train num_bytes: 3765287 num_examples: 3519 - name: validation num_bytes: 375519 num_examples: 432 - config_name: lgbt features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 120004 num_examples: 114 - name: train num_bytes: 1105912 num_examples: 949 - name: validation num_bytes: 125561 num_examples: 105 - config_name: misogyny features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 1688127 num_examples: 1960 - name: train num_bytes: 15222910 num_examples: 16136 - name: validation num_bytes: 1787328 num_examples: 1902 - config_name: occupations features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 90756 num_examples: 81 - name: train num_bytes: 785293 num_examples: 712 - name: validation num_bytes: 82215 num_examples: 79 - config_name: politics features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 1247514 num_examples: 1182 - name: train num_bytes: 11384519 num_examples: 10249 - name: validation num_bytes: 1285706 num_examples: 1228 - config_name: political-orientation features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 300627 num_examples: 305 - name: train num_bytes: 2949075 num_examples: 2772 - name: validation num_bytes: 343262 num_examples: 342 - config_name: refugees features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 249765 num_examples: 203 - name: train num_bytes: 2012525 num_examples: 1688 - name: validation num_bytes: 245659 num_examples: 220 - config_name: religion features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 322690 num_examples: 197 - name: train num_bytes: 2439952 num_examples: 1734 - name: validation num_bytes: 341733 num_examples: 213 - config_name: sects features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 153362 num_examples: 145 - name: train num_bytes: 1423249 num_examples: 1278 - name: validation num_bytes: 132721 num_examples: 148 - config_name: veganism features: - name: baslik dtype: string - name: text dtype: string - name: label dtype: class_label: names: 0: offensive 1: hate 2: neutral 3: civilized splits: - name: test num_bytes: 191878 num_examples: 121 - name: train num_bytes: 1772770 num_examples: 1100 - name: validation num_bytes: 200298 num_examples: 115 configs: - config_name: animals data_files: - split: train path: animals/train* - split: validation path: animals/valid* - split: test path: animals/test* - config_name: cities data_files: - split: train path: cities/train* - split: validation path: cities/valid* - split: test path: cities/test* - config_name: ethnicity data_files: - split: train path: ethnicity/train* - split: validation path: ethnicity/valid* - split: test path: ethnicity/test* - config_name: lgbt data_files: - split: train path: lgbt/train* - split: validation path: lgbt/valid* - split: test path: lgbt/test* - config_name: misogyny data_files: - split: train path: misogyny/train* - split: validation path: misogyny/valid* - split: test path: misogyny/test* - config_name: occupations data_files: - split: train path: occupations/train* - split: validation path: occupations/valid* - split: test path: occupations/test* - config_name: politics data_files: - split: train path: politics/train* - split: validation path: politics/valid* - split: test path: politics/test* - config_name: political-orientation data_files: - split: train path: political-orientation/train* - split: validation path: political-orientation/valid* - split: test path: political-orientation/test* - config_name: refugees data_files: - split: train path: refugees/train* - split: validation path: refugees/valid* - split: test path: refugees/test* - config_name: religion data_files: - split: train path: religion/train* - split: validation path: religion/valid* - split: test path: religion/test* - config_name: sects data_files: - split: train path: sects/train* - split: validation path: sects/valid* - split: test path: sects/test* - config_name: veganism data_files: - split: train path: veganism/train* - split: validation path: veganism/valid* - split: test path: veganism/test* --- # Turkish Hate Map - A Large Scale and Diverse Hate Speech Dataset for Turkish <img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/tuhamalogo.png" width="50%" height="50%"> ## Dataset Summary Turkish Hate Map (TuHaMa for short) is a big scale Turkish hate speech dataset that includes diverse target groups such as misogyny, political animosity, animal aversion, vegan antipathy, ethnic group hostility, and more. The dataset includes a total of 52K instances with 13 target groups. The dataset includes 4 labels, **offensive**, **hate**, **neutral** and **civilized**. Here is the distribution of target groups: | Target group | size | |---|---| | Animals | 1.2K | | Cities | 1.2K | | Ethnic groups | 4.4K | | LGBT | 1.1K | | Misogyny | 19.9K | | Occupations | 0.8K | | Politics | 12.6 | | Political orientation | 3.4K | | Refugees | 2.1K | | Religion | 2.1K | | Sects | 1.5K | | Veganism | 1.3K | | Total | 52K | All text is scraped from Eksisozluk.com in a targeted manner and sampled. The annotations are done by the data company [Co-one](https://www.co-one.co/). For more details please refer to the [research paper]() ## Dataset Instances An instance looks like: ``` { "baslik": "soyleyecek-cok-seyi-oldugu-halde-susan-kadin", "text": "her susuşunda anlatmak istediği şeyi içine atan kadındır, zamanla hissettiği her şeyi tüketir. aynı zamanda çok cookdur kendisi.", "label": 2 } ``` ## Data Split | name |train|validation|test| |---------|----:|---:|---:| |Turkish Hate Map|42175|5000|5000| ## Benchmarking This dataset is a part of [SentiTurca](https://huggingface.co/datasets/turkish-nlp-suite/SentiTurca) benchmark, in the benchmark the subset name is **hate**, named according to the GLUE tasks. Model benchmarking information can be found under SentiTurca HF repo and benchmarking scripts can be found under [SentiTurca Github repo](https://github.com/turkish-nlp-suite/SentiTurca). For this dataset we benchmarked a transformer based model BERTurk and a handful of LLMs. Success of each model is follows: | Model | acc./F1 | |---|---| | Gemini 1.0 Pro | 0.33/0.29 | | GPT-4 Turbo | 0.38/0.32 | | Claude 3 Sonnet | 0.16/0.29 | | Llama 3 70B | 0.55/0.35 | | Qwen2-72B | 0.70/0.35 | | BERTurk | 0.61/0.58 | For a critique of the results, misclassified instances and more please consult to the [research paper](). ## Citation Coming soon!!
提供机构:
turkish-nlp-suite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作