Probing Datasets for Noisy Texts

Name: Probing Datasets for Noisy Texts
Creator: federation.figshare.com
Published: 2021-03-14 00:00:00
License: 暂无描述

federation.figshare.com2021-03-14 更新2025-03-26 收录

下载链接：

https://federation.figshare.com/articles/dataset/Probing_Datasets/14211878/4

下载链接

链接失效反馈

官方服务：

资源简介：

ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation. This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file Column 1: train/test/validation split (tr-train, te-test, va-validation) Column 2: class label (refer to the content section for the class labels of each task file) Column 3: Tweet message (text) Column 4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.

语境探测任务在自然语言处理研究领域颇受欢迎，旨在评估语言信息编码表示的丰富性。每一项探测任务均构成一个分类问题，模型的性能将随着编码中表示的语言特性的丰富程度而有所差异。本数据集包含五个新的探测数据集，这些数据集由噪声文本（推文）组成，可作为研究人员研究非结构化和噪声文本语言特征的基准数据集。文件结构格式如下：一个制表符分隔的文本文件。列1：训练/测试/验证分割（tr-训练，te-测试，va-验证）；列2：类别标签（参照各任务文件的内容部分以获取类别标签）；列3：推文消息（文本）；列4：一个唯一的ID。在此次分类任务中，目标是根据句子的长度预测8个可能的区间（0-7）；0：(5-8)，1：(9-12)，2：(13-16)，3：(17-20)，4：(21-25)，5：(26-29)，6：(30-33)，7：(34-70)。此任务在论文中被称为“SentLen”。word_content.tsv：我们考虑一个包含10个目标词的10分类任务，这些目标词为手动标注实例中可用的。任务是预测给定句子中是否出现这些目标词。我们只考虑BERT词汇表中的词作为目标词。通过从语料库词汇表中按频率排序并选择长度至少为4个字符的前10个小写单词（以去除噪声）来构建数据。每个句子包含一个目标词，且该词在句子中恰好出现一次。此任务在论文中被称为“WC”。bigram_shift.tsv：Bigram Shift任务的目的是测试编码器是否对合法的词序敏感。推文中相邻的两个词被颠倒，分类模型执行二分类以识别颠倒（I）和非颠倒/原始（O）的推文。此任务在论文中被称为“BShift”。tree_depth.tsv：Tree Depth任务通过允许分类模型预测从根到推文解析树中任何叶子的最长路径的深度，来评估编码句子理解层次结构的能力。此任务在论文中被称为“TreeDepth”。odd_man_out.tsv：推文通过将一个随机的名词或动词o替换为另一个名词或动词r而被修改。分类器的任务是识别句子是否因这种变化而修改。类别标签O表示未修改的句子，而C表示修改后的句子。此任务在论文中被称为“SOMO”。

提供机构：

federation.figshare.com

5,000+

优质数据集

54 个

任务类型

进入经典数据集