Domain Shift Datasets in NLP Tasks

Name: Domain Shift Datasets in NLP Tasks
Creator: Ping Song
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/domain-shift-datasets-nlp-tasks

下载链接

链接失效反馈

官方服务：

资源简介：

In this work, we introduce a unified benchmark for evaluating domain-generalisation in text-classification by bringing together seven diverse datasets spanning sentiment analysis and toxicity detection. For sentiment, we include (1) Amazon Reviews, covering multiple product categories; (2) SST-5, the five-way Stanford Sentiment Treebank drawn from film reviews; (3) SemEval Twitter-sentiment tasks; and (4) Dynasent, which provides context-sensitive sentiment judgments. For toxicity, we incorporate (5) Civil Comments, a large web-forum corpus labeled for toxicity; (6) Adversarial Civil Comments, which augments the Civil Comments set with human-crafted adversarial examples; and (7) ToxiGen, a collection of automatically generated toxic utterances designed to probe model robustness. Together, these datasets offer a spectrum of domains (e-commerce, social media, news comments) and annotation styles (star ratings, fine-grained labels, binary toxicity) that stress-test the ability of classifiers to generalise beyond their training distribution. We detail the construction, annotation schemes, and domain splits for each dataset, and demonstrate that state-of-the-art models exhibit significant performance degradation under cross-domain transfer, highlighting the critical need for robust domain-generalisation strategies in real-world NLP systems.

提供机构：

Ping Song

5,000+

优质数据集

54 个

任务类型

进入经典数据集