w11wo/reddit_indonesia_sarcastic

Name: w11wo/reddit_indonesia_sarcastic
Creator: w11wo
Published: 2023-12-21 07:08:11
License: 暂无描述

Hugging Face2023-12-21 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/w11wo/reddit_indonesia_sarcastic

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - ind pretty_name: "Reddit Indonesia Sarcastic" --- # Reddit Indonesia Sarcastic Reddit Indonesia Sarcastic is a dataset intended for sarcasm detection in the Indonesian language. This dataset is inspired by the data collection procedure introduced in [Ranti, K.S., & Girsang, A.S (2020)](http://www.warse.org/IJETER/static/pdf/file/ijeter10892020.pdf), whereby Reddit comments from r/indonesia subreddit are collected and filtered by the existence of an `/s` tag at the end of the comment. We collected Reddit comments from 2020-01 to 2023-09 from [Academic Torrents](https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7689d72b7548f699fc) and applied the aforementioned procedure. Further, we performed deduplication with minHash LSH, PII masking to remove usernames, hashtags, emails, URLs, and finally a random sampling to limit the non-sarcastic comments. Following [SemEval-2022 Task 6: iSarcasmEval](https://aclanthology.org/2022.semeval-1.111/), we used a 1:3 ratio to balance sarcastic with non-sarcastic comments. ## Dataset Structure ### Data Instances ```py { "author": "curuya", "created_utc": 1584876528, "score": 7, "permalink": "/r/indonesia/comments/fmxhfe/jangan_takut_sama_corona_takut_sama_allah/fl6n993/", "subreddit": "indonesia", "body": 'taat perintah tuhan : "kalau ada razia mendingan kabur" /s', "lang_fastText": "id", "label": 1, "text": 'taat perintah tuhan : "kalau ada razia mendingan kabur"', } ``` ### Data Fields - `author`: Comment author. - `created_utc`: Comment creation time, in UTC. - `score`: Comment's Reddit voting score. - `permalink`: Permalink to the Reddit comment. - `subreddit`: Subreddit name. - `body`: Raw Reddit comment content. - `lang_fastText`: Language detected by [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect). - `label`: `0` for non-sarcastic, `1` for sarcastic. - `text`: Sarcastic tag-removed, PII-masked version of `body`. ### Data Splits | Split | #sarcastic | #non sarcastic | #total | | ------------------ | :--------: | :------------: | :-----: | | `train` | 2470 | 7411 | 9881 | | `test` | 706 | 2118 | 2824 | | `validation` | 353 | 1058 | 1411 | | Total (balanced) | 3529 | 10587 | 14116 | | Total (unbalanced) | 3529 | 2616335 | 2619864 | ### Dataset Directory ```sh reddit_indonesia_sarcastic/ ├── README.md ├── data # re-balanced dataset │ ├── test.json │ ├── train.json │ └── validation.json └── raw_data # raw unbalanced dataset └── reddit_indonesia_sarcastic.json ``` ## Authors Reddit Indonesia Sarcastic is prepared by: <a href="https://github.com/w11wo"> <img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;border: solid 1px #fff;margin:0 4px;"> </a> ## References ```bibtex @article{Ranti2020IndonesianSD, title={Indonesian Sarcasm Detection Using Convolutional Neural Network}, author={Kiefer Stefano Ranti and Abba Suganda Girsang}, journal={International Journal of Emerging Trends in Engineering Research}, year={2020}, url={https://doi.org/10.30534/ijeter/2020/10892020} } @article{academicReddit, title= {Reddit comments/submissions 2005-06 to 2023-09}, journal= {}, author= {stuck_in_the_matrix, Watchful1, RaiderBDev}, year= {}, url= {}, abstract= {Reddit comments and submissions from 2005-06 to 2023-09 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps}, keywords= {reddit}, terms= {}, license= {}, superseded= {} } @inproceedings{abu-farha-etal-2022-semeval, title = "{S}em{E}val-2022 Task 6: i{S}arcasm{E}val, Intended Sarcasm Detection in {E}nglish and {A}rabic", author = "Abu Farha, Ibrahim and Oprea, Silviu Vlad and Wilson, Steven and Magdy, Walid", editor = "Emerson, Guy and Schluter, Natalie and Stanovsky, Gabriel and Kumar, Ritesh and Palmer, Alexis and Schneider, Nathan and Singh, Siddharth and Ratan, Shyam", booktitle = "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.semeval-1.111", doi = "10.18653/v1/2022.semeval-1.111", pages = "802--814", } ```

提供机构：

w11wo

原始信息汇总

Reddit Indonesia Sarcastic

概述

名称: Reddit Indonesia Sarcastic
语言: 印度尼西亚语
用途: 用于印度尼西亚语中的讽刺检测
数据来源: 从r/indonesia子版块收集的Reddit评论
数据收集时间: 2020年1月至2023年9月
数据处理:
- 通过/s标签过滤讽刺评论
- 使用minHash LSH进行去重
- PII掩码处理（去除用户名、标签、电子邮件、URL）
- 随机采样以平衡讽刺与非讽刺评论（1:3比例）

数据结构

数据实例

py { "author": "curuya", "created_utc": 1584876528, "score": 7, "permalink": "/r/indonesia/comments/fmxhfe/jangan_takut_sama_corona_takut_sama_allah/fl6n993/", "subreddit": "indonesia", "body": taat perintah tuhan : "kalau ada razia mendingan kabur" /s, "lang_fastText": "id", "label": 1, "text": taat perintah tuhan : "kalau ada razia mendingan kabur", }

数据字段

author: 评论作者
created_utc: 评论创建时间（UTC）
score: 评论的Reddit投票分数
permalink: 评论的永久链接
subreddit: 子版块名称
body: 原始Reddit评论内容
lang_fastText: 通过fasttext-langdetect检测的语言
label: 标签（0表示非讽刺，1表示讽刺）
text: 去除讽刺标签和PII掩码后的评论内容

数据分割

分割	讽刺评论数	非讽刺评论数	总数
`train`	2470	7411	9881
`test`	706	2118	2824
`validation`	353	1058	1411
总计（平衡）	3529	10587	14116
总计（不平衡）	3529	2616335	2619864

数据目录

sh reddit_indonesia_sarcastic/ ├── README.md ├── data # 重新平衡的数据集 │ ├── test.json │ ├── train.json │ └── validation.json └── raw_data # 原始不平衡数据集 └── reddit_indonesia_sarcastic.json

5,000+

优质数据集

54 个

任务类型

进入经典数据集