five

phusroyal/ViHOS

收藏
Hugging Face2023-09-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/phusroyal/ViHOS
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced license: mit multilinguality: - monolingual source_datasets: - original task_ids: - hate-speech-detection task_categories: - text-classification - token-classification language: - vi pretty_name: ViHOS - Vietnamese Hate and Offensive Spans Dataset size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train_sequence_labeling path: - "train_sequence_labeling/syllable/train_BIO_syllable.csv" - "train_sequence_labeling/syllable/dev_BIO_syllable.csv" - "train_sequence_labeling/syllable/test_BIO_syllable.csv" - "train_sequence_labeling/word/train_BIO_syllable.csv" - "train_sequence_labeling/word/dev_BIO_syllable.csv" - "train_sequence_labeling/word/test_BIO_syllable.csv" - split: train_span_extraction path: - 'train_span_extraction/train.csv' - 'train_span_extraction/dev.csv' - split: test path: "test/test.csv" --- **Disclaimer**: This project contains real comments that could be considered profane, offensive, or abusive. # Dataset Card for "ViHOS - Vietnamese Hate and Offensive Spans Dataset" ## Dataset Description - **Repository:** [ViHOS](https://github.com/phusroyal/ViHOS) - **Paper:** [EACL-ViHOS](https://aclanthology.org/2023.eacl-main.47/) - **Total amount of disk used:** 2.6 MB ## Dataset Motivation The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (**Vi**etnamese **H**ate and **O**ffensive **S**pans) dataset, the first human-annotated corpus containing 26k spans on 11k online comments. Our goal is to create a dataset that contains comprehensive hate and offensive thoughts, meanings, or opinions within the comments rather than just a lexicon of hate and offensive terms. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Futhermore, our solutions to deal with *nine different online foul linguistic phenomena* are also provided in the [*paper*](https://aclanthology.org/2023.eacl-main.47/) (e.g. Teencodes; Metaphors, metonymies; Hyponyms; Puns...). We hope that this dataset will be useful for researchers and practitioners in the field of hate speech detection in general and hate spans detection in particular. ## Dataset Summary ViHOS contains 26,476 human-annotated spans on 11,056 comments (5,360 comments have hate and offensive spans, and 5,696 comments do not) It is splitted into train, dev, and test set with following information: 1. Train set: 8,844 comments 2. Dev set: 1,106 comments 3. Test set: 1,106 comments ## Data Instance An span extraction-based (see Data Structure for more details) example of 'test' looks as follows: ``` { "content": "Thối CC chỉ không ngửi đuợc thôi", 'index_spans': "[0, 1, 2, 3, 5, 6]" } ``` An sequence labeling-based (see Data Structure for more details) example of 'test' looks as follows: ``` { "content": "Thối CC chỉ không ngửi đuợc thôi", 'index_spans': ["B-T", "I-T", "O", "O", "O", "O", "O"] } ``` ## Data Structure Here is our data folder structure! ``` . └── data/ ├── train_sequence_labeling/ │ ├── syllable/ │ │ ├── dev_BIO_syllable.csv │ │ ├── test_BIO_syllable.csv │ │ └── train_BIO_syllable.csv │ └── word/ │ ├── dev_BIO_Word.csv │ ├── test_BIO_Word.csv │ └── train_BIO_Word.csv ├── train_span_extraction/ │ ├── dev.csv │ └── train.csv └── test/ └── test.csv ``` ### Sequence labeling-based version #### Syllable Description: - This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: - **index**: The id of the word. - **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer followed by underscore tokenization. The reason for this is that some words are in bad format: e.g. "điện.thoại của tôi" is split into ["điện.thoại", "của", "tôi"] instead of ["điện", "thoại", "của", "tôi"] if we use space tokenization, which is not in the right format of Syllable. As that, we used VnCoreNLP to tokenize first and then split words into tokens. e.g. "điện.thoại của tôi" ---(VnCoreNLP)---> ["điện_thoại", "của", "tôi"] ---(split by "_")---> ["điện", "thoại", "của", "tôi"]. - **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). - The train_BIO_syllable and dev_BIO_syllable file are used for training and validation for XLMR model, respectively. - The test_BIO_syllable file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the Testdata folder for testing the model.** #### Word Description: - This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns: - **index**: The id of the word. - **word**: Words in the sentence after the processing of tokenization using [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) tokenizer - **tag**: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word). - The train_BIO_Word and dev_BIO_Word file are used for training and validation for PhoBERT model, respectively. - The test_BIO_Word file is used for reference only. It is not used for testing the model. **Please use the test.csv file in the data/test folder for testing the model.** ### Span Extraction-based version Description: - This folder contains the data for the span extraction-based version of the task. The data is divided into two files: train and dev. Each file contains the following columns: - **content**: The content of the sentence. - **span_ids**: The index of the hate and offensive spans in the sentence. The index is in the format of [start, end] where start is the index of the first character of the hate and offensive span and end is the index of the last character of the hate and offensive span. - The train and dev file are used for training and validation for BiLSTM-CRF model, respectively. ### Citation Information ``` @inproceedings{hoang-etal-2023-vihos, title = "{V}i{HOS}: Hate Speech Spans Detection for {V}ietnamese", author = "Hoang, Phu Gia and Luu, Canh Duc and Tran, Khanh Quoc and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy", booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.eacl-main.47", doi = "10.18653/v1/2023.eacl-main.47", pages = "652--669", abstract = "The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus containing 26k spans on 11k comments. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Besides, we conduct experiments with various state-of-the-art models. Specifically, XLM-R{\_}Large achieved the best F1-scores in Single span detection and All spans detection, while PhoBERT{\_}Large obtained the highest in Multiple spans detection. Finally, our error analysis demonstrates the difficulties in detecting specific types of spans in our data for future research. Our dataset is released on GitHub.", } ```
提供机构:
phusroyal
原始信息汇总

数据集卡片:ViHOS - Vietnamese Hate and Offensive Spans Dataset

数据集描述

  • 注释创建者: 众包
  • 许可证: MIT
  • 多语言性: 单语
  • 源数据集: 原始数据
  • 任务ID: 仇恨言论检测
  • 任务类别: 文本分类、标记分类
  • 语言: 越南语
  • 美观名称: ViHOS - Vietnamese Hate and Offensive Spans Dataset
  • 大小类别: 10K<n<100K

数据集配置

  • 配置名称: default
    • 数据文件:
      • 分割: train_sequence_labeling
        • 路径:
          • "train_sequence_labeling/syllable/train_BIO_syllable.csv"
          • "train_sequence_labeling/syllable/dev_BIO_syllable.csv"
          • "train_sequence_labeling/syllable/test_BIO_syllable.csv"
          • "train_sequence_labeling/word/train_BIO_syllable.csv"
          • "train_sequence_labeling/word/dev_BIO_syllable.csv"
          • "train_sequence_labeling/word/test_BIO_syllable.csv"
      • 分割: train_span_extraction
        • 路径:
          • train_span_extraction/train.csv
          • train_span_extraction/dev.csv
      • 分割: test
        • 路径: "test/test.csv"

数据集概述

ViHOS包含26,476个人工注释的跨度在11,056条评论中(5,360条评论包含仇恨和冒犯性跨度,5,696条评论不包含)

它被分为训练集、开发集和测试集,具体信息如下:

  1. 训练集:8,844条评论
  2. 开发集:1,106条评论
  3. 测试集:1,106条评论

数据实例

基于跨度提取的示例(参见数据结构了解更多详情)如下: json { "content": "Thối CC chỉ không ngửi đuợc thôi", "index_spans": "[0, 1, 2, 3, 5, 6]" }

基于序列标记的示例(参见数据结构了解更多详情)如下: json { "content": "Thối CC chỉ không ngửi đuợc thôi", "index_spans": ["B-T", "I-T", "O", "O", "O", "O", "O"] }

数据结构

数据文件夹结构如下:

. └── data/ ├── train_sequence_labeling/ │ ├── syllable/ │ │ ├── dev_BIO_syllable.csv │ │ ├── test_BIO_syllable.csv │ │ └── train_BIO_syllable.csv │ └── word/ │ ├── dev_BIO_Word.csv │ ├── test_BIO_Word.csv │ └── train_BIO_Word.csv ├── train_span_extraction/ │ ├── dev.csv │ └── train.csv └── test/ └── test.csv

基于序列标记的版本

音节

描述:

  • 该文件夹包含基于序列标记的任务数据。数据分为两个文件:train和dev。每个文件包含以下列:
    • index: 单词的ID。
    • word: 使用VnCoreNLP分词器处理后的句子中的单词,随后进行下划线分词。
    • tag: 单词的标签。标签为B-T(单词开始)、I-T(单词内部)或O(单词外部)。

单词

描述:

  • 该文件夹包含基于序列标记的任务数据。数据分为两个文件:train和dev。每个文件包含以下列:
    • index: 单词的ID。
    • word: 使用VnCoreNLP分词器处理后的句子中的单词。
    • tag: 单词的标签。标签为B-T(单词开始)、I-T(单词内部)或O(单词外部)。

基于跨度提取的版本

描述:

  • 该文件夹包含基于跨度提取的任务数据。数据分为两个文件:train和dev。每个文件包含以下列:
    • content: 句子的内容。
    • span_ids: 句子中仇恨和冒犯性跨度的索引。索引格式为[start, end],其中start是仇恨和冒犯性跨度的第一个字符的索引,end是最后一个字符的索引。

引用信息

@inproceedings{hoang-etal-2023-vihos, title = "{V}i{HOS}: Hate Speech Spans Detection for {V}ietnamese", author = "Hoang, Phu Gia and Luu, Canh Duc and Tran, Khanh Quoc and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy", booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.eacl-main.47", doi = "10.18653/v1/2023.eacl-main.47", pages = "652--669", abstract = "The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus containing 26k spans on 11k comments. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Besides, we conduct experiments with various state-of-the-art models. Specifically, XLM-R{_}Large achieved the best F1-scores in Single span detection and All spans detection, while PhoBERT{_}Large obtained the highest in Multiple spans detection. Finally, our error analysis demonstrates the difficulties in detecting specific types of spans in our data for future research. Our dataset is released on GitHub.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作