Caryslara444/SARC_Sarcasm
收藏Hugging Face2026-01-28 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Caryslara444/SARC_Sarcasm
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: author
dtype: string
- name: score
dtype: int64
- name: ups
dtype: int64
- name: downs
dtype: int64
- name: date
dtype: string
- name: created_utc
dtype: int64
- name: subreddit
dtype: string
- name: id
dtype: string
splits:
- name: train
num_bytes: 1764500045
num_examples: 12704751
download_size: 903559115
dataset_size: 1764500045
license: cc-by-2.0
---
# SARC_Sarcasm
## Dataset Description
- **Paper:** [A Large Self-Annotated Corpus for Sarcasm](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf)
## Dataset Summary
A large corpus for sarcasm research and for training and evaluating systems for sarcasm detection is presented. The corpus comprises 1.3 million sarcastic statements, a quantity that is tenfold more substantial than any preceding dataset, and includes many more instances of non-sarcastic statements. This allows for learning in both balanced and unbalanced label regimes. Each statement is self-annotated; that is to say, sarcasm is labeled by the author, not by an independent annotator, and is accompanied by user, topic, and conversation context. The accuracy of the corpus is evaluated, benchmarks for sarcasm detection are established, and baseline methods are assessed.
For the details of this dataset, we refer you to the original [paper](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf).
Metadata in Creative Language Toolkit ([CLTK](https://github.com/liyucheng09/cltk))
- CL Type: Sarcasm
- Task Type: detection
- Size: 1.3M
- Created time: 2018
### Contributions
If you have any queries, please open an issue or direct your queries to [mail](mailto:yucheng.li@surrey.ac.uk).
---
数据集信息:
特征:
- 名称:text,数据类型:字符串
- 名称:author,数据类型:字符串
- 名称:score,数据类型:64位整数(int64)
- 名称:ups(点赞数),数据类型:64位整数
- 名称:downs(点踩数),数据类型:64位整数
- 名称:date,数据类型:字符串
- 名称:created_utc(UTC创建时间戳),数据类型:64位整数(int64)
- 名称:subreddit(Reddit子版块),数据类型:字符串
- 名称:id,数据类型:字符串
数据集拆分:
- 名称:训练集(train),字节数:1764500045,样本数:12704751
下载大小:903559115
数据集总大小:1764500045
许可协议:知识共享署名2.0(CC BY 2.0)
---
# SARC讽刺语料库(SARC_Sarcasm)
## 数据集说明
- **学术论文**:[《用于讽刺检测的大型自标注语料库》](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf)
## 数据集概述
本文提出了一款面向讽刺研究,以及用于训练、评估讽刺检测系统的大型语料库。该语料库包含130万条讽刺性表述,其规模较此前所有同类数据集提升整整一个数量级,同时还收录了数量更为可观的非讽刺性表述,可支持平衡标签与非平衡标签两种标注范式下的模型学习。所有表述均采用自标注形式:即讽刺标签由发言者本人标注,而非由独立标注人员完成,且每条表述均附带发布者、主题以及对话上下文信息。本研究对该语料库的准确性进行了评估,确立了讽刺检测任务的基准测试方案,并对基线方法开展了性能测评。
如需了解该数据集的详细细节,请参阅原始[学术论文](http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf)。
### 创意语言工具包(Creative Language Toolkit,CLTK)元数据
- CL类型:讽刺(Sarcasm)
- 任务类型:检测(detection,即讽刺检测)
- 规模:130万条
- 创建时间:2018年
### 贡献与反馈
如有任何疑问,请提交Issue或发送邮件至[yucheng.li@surrey.ac.uk](mailto:yucheng.li@surrey.ac.uk).
提供机构:
Caryslara444



