five

w11wo/reddit_indonesia_sarcastic

收藏
Hugging Face2023-12-21 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/w11wo/reddit_indonesia_sarcastic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - ind pretty_name: "Reddit Indonesia Sarcastic" --- # Reddit Indonesia Sarcastic Reddit Indonesia Sarcastic is a dataset intended for sarcasm detection in the Indonesian language. This dataset is inspired by the data collection procedure introduced in [Ranti, K.S., & Girsang, A.S (2020)](http://www.warse.org/IJETER/static/pdf/file/ijeter10892020.pdf), whereby Reddit comments from r/indonesia subreddit are collected and filtered by the existence of an `/s` tag at the end of the comment. We collected Reddit comments from 2020-01 to 2023-09 from [Academic Torrents](https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7689d72b7548f699fc) and applied the aforementioned procedure. Further, we performed deduplication with minHash LSH, PII masking to remove usernames, hashtags, emails, URLs, and finally a random sampling to limit the non-sarcastic comments. Following [SemEval-2022 Task 6: iSarcasmEval](https://aclanthology.org/2022.semeval-1.111/), we used a 1:3 ratio to balance sarcastic with non-sarcastic comments. ## Dataset Structure ### Data Instances ```py { "author": "curuya", "created_utc": 1584876528, "score": 7, "permalink": "/r/indonesia/comments/fmxhfe/jangan_takut_sama_corona_takut_sama_allah/fl6n993/", "subreddit": "indonesia", "body": 'taat perintah tuhan : "kalau ada razia mendingan kabur" /s', "lang_fastText": "id", "label": 1, "text": 'taat perintah tuhan : "kalau ada razia mendingan kabur"', } ``` ### Data Fields - `author`: Comment author. - `created_utc`: Comment creation time, in UTC. - `score`: Comment's Reddit voting score. - `permalink`: Permalink to the Reddit comment. - `subreddit`: Subreddit name. - `body`: Raw Reddit comment content. - `lang_fastText`: Language detected by [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect). - `label`: `0` for non-sarcastic, `1` for sarcastic. - `text`: Sarcastic tag-removed, PII-masked version of `body`. ### Data Splits | Split | #sarcastic | #non sarcastic | #total | | ------------------ | :--------: | :------------: | :-----: | | `train` | 2470 | 7411 | 9881 | | `test` | 706 | 2118 | 2824 | | `validation` | 353 | 1058 | 1411 | | Total (balanced) | 3529 | 10587 | 14116 | | Total (unbalanced) | 3529 | 2616335 | 2619864 | ### Dataset Directory ```sh reddit_indonesia_sarcastic/ ├── README.md ├── data # re-balanced dataset │ ├── test.json │ ├── train.json │ └── validation.json └── raw_data # raw unbalanced dataset └── reddit_indonesia_sarcastic.json ``` ## Authors Reddit Indonesia Sarcastic is prepared by: <a href="https://github.com/w11wo"> <img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;border: solid 1px #fff;margin:0 4px;"> </a> ## References ```bibtex @article{Ranti2020IndonesianSD, title={Indonesian Sarcasm Detection Using Convolutional Neural Network}, author={Kiefer Stefano Ranti and Abba Suganda Girsang}, journal={International Journal of Emerging Trends in Engineering Research}, year={2020}, url={https://doi.org/10.30534/ijeter/2020/10892020} } @article{academicReddit, title= {Reddit comments/submissions 2005-06 to 2023-09}, journal= {}, author= {stuck_in_the_matrix, Watchful1, RaiderBDev}, year= {}, url= {}, abstract= {Reddit comments and submissions from 2005-06 to 2023-09 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps}, keywords= {reddit}, terms= {}, license= {}, superseded= {} } @inproceedings{abu-farha-etal-2022-semeval, title = "{S}em{E}val-2022 Task 6: i{S}arcasm{E}val, Intended Sarcasm Detection in {E}nglish and {A}rabic", author = "Abu Farha, Ibrahim and Oprea, Silviu Vlad and Wilson, Steven and Magdy, Walid", editor = "Emerson, Guy and Schluter, Natalie and Stanovsky, Gabriel and Kumar, Ritesh and Palmer, Alexis and Schneider, Nathan and Singh, Siddharth and Ratan, Shyam", booktitle = "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.semeval-1.111", doi = "10.18653/v1/2022.semeval-1.111", pages = "802--814", } ```
提供机构:
w11wo
原始信息汇总

Reddit Indonesia Sarcastic

概述

  • 名称: Reddit Indonesia Sarcastic
  • 语言: 印度尼西亚语
  • 用途: 用于印度尼西亚语中的讽刺检测
  • 数据来源: 从r/indonesia子版块收集的Reddit评论
  • 数据收集时间: 2020年1月至2023年9月
  • 数据处理:
    • 通过/s标签过滤讽刺评论
    • 使用minHash LSH进行去重
    • PII掩码处理(去除用户名、标签、电子邮件、URL)
    • 随机采样以平衡讽刺与非讽刺评论(1:3比例)

数据结构

数据实例

py { "author": "curuya", "created_utc": 1584876528, "score": 7, "permalink": "/r/indonesia/comments/fmxhfe/jangan_takut_sama_corona_takut_sama_allah/fl6n993/", "subreddit": "indonesia", "body": taat perintah tuhan : "kalau ada razia mendingan kabur" /s, "lang_fastText": "id", "label": 1, "text": taat perintah tuhan : "kalau ada razia mendingan kabur", }

数据字段

  • author: 评论作者
  • created_utc: 评论创建时间(UTC)
  • score: 评论的Reddit投票分数
  • permalink: 评论的永久链接
  • subreddit: 子版块名称
  • body: 原始Reddit评论内容
  • lang_fastText: 通过fasttext-langdetect检测的语言
  • label: 标签(0表示非讽刺,1表示讽刺)
  • text: 去除讽刺标签和PII掩码后的评论内容

数据分割

分割 讽刺评论数 非讽刺评论数 总数
train 2470 7411 9881
test 706 2118 2824
validation 353 1058 1411
总计(平衡) 3529 10587 14116
总计(不平衡) 3529 2616335 2619864

数据目录

sh reddit_indonesia_sarcastic/ ├── README.md ├── data # 重新平衡的数据集 │ ├── test.json │ ├── train.json │ └── validation.json └── raw_data # 原始不平衡数据集 └── reddit_indonesia_sarcastic.json

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作