five

Caryslara444/SARC_Sarcasm

收藏
Hugging Face2026-01-28 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Caryslara444/SARC_Sarcasm
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: author dtype: string - name: score dtype: int64 - name: ups dtype: int64 - name: downs dtype: int64 - name: date dtype: string - name: created_utc dtype: int64 - name: subreddit dtype: string - name: id dtype: string splits: - name: train num_bytes: 1764500045 num_examples: 12704751 download_size: 903559115 dataset_size: 1764500045 license: cc-by-2.0 --- # SARC_Sarcasm ## Dataset Description - **Paper:** [A Large Self-Annotated Corpus for Sarcasm](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf) ## Dataset Summary A large corpus for sarcasm research and for training and evaluating systems for sarcasm detection is presented. The corpus comprises 1.3 million sarcastic statements, a quantity that is tenfold more substantial than any preceding dataset, and includes many more instances of non-sarcastic statements. This allows for learning in both balanced and unbalanced label regimes. Each statement is self-annotated; that is to say, sarcasm is labeled by the author, not by an independent annotator, and is accompanied by user, topic, and conversation context. The accuracy of the corpus is evaluated, benchmarks for sarcasm detection are established, and baseline methods are assessed. For the details of this dataset, we refer you to the original [paper](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf). Metadata in Creative Language Toolkit ([CLTK](https://github.com/liyucheng09/cltk)) - CL Type: Sarcasm - Task Type: detection - Size: 1.3M - Created time: 2018 ### Contributions If you have any queries, please open an issue or direct your queries to [mail](mailto:yucheng.li@surrey.ac.uk).

--- 数据集信息: 特征: - 名称:text,数据类型:字符串 - 名称:author,数据类型:字符串 - 名称:score,数据类型:64位整数(int64) - 名称:ups(点赞数),数据类型:64位整数 - 名称:downs(点踩数),数据类型:64位整数 - 名称:date,数据类型:字符串 - 名称:created_utc(UTC创建时间戳),数据类型:64位整数(int64) - 名称:subreddit(Reddit子版块),数据类型:字符串 - 名称:id,数据类型:字符串 数据集拆分: - 名称:训练集(train),字节数:1764500045,样本数:12704751 下载大小:903559115 数据集总大小:1764500045 许可协议:知识共享署名2.0(CC BY 2.0) --- # SARC讽刺语料库(SARC_Sarcasm) ## 数据集说明 - **学术论文**:[《用于讽刺检测的大型自标注语料库》](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf) ## 数据集概述 本文提出了一款面向讽刺研究,以及用于训练、评估讽刺检测系统的大型语料库。该语料库包含130万条讽刺性表述,其规模较此前所有同类数据集提升整整一个数量级,同时还收录了数量更为可观的非讽刺性表述,可支持平衡标签与非平衡标签两种标注范式下的模型学习。所有表述均采用自标注形式:即讽刺标签由发言者本人标注,而非由独立标注人员完成,且每条表述均附带发布者、主题以及对话上下文信息。本研究对该语料库的准确性进行了评估,确立了讽刺检测任务的基准测试方案,并对基线方法开展了性能测评。 如需了解该数据集的详细细节,请参阅原始[学术论文](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf)。 ### 创意语言工具包(Creative Language Toolkit,CLTK)元数据 - CL类型:讽刺(Sarcasm) - 任务类型:检测(detection,即讽刺检测) - 规模:130万条 - 创建时间:2018年 ### 贡献与反馈 如有任何疑问,请提交Issue或发送邮件至[yucheng.li@surrey.ac.uk](mailto:yucheng.li@surrey.ac.uk).
提供机构:
Caryslara444
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作