Probing Datasets for Noisy Texts

Name: Probing Datasets for Noisy Texts
Creator: Federation University Australia
Published: 2021-03-14 07:22:01
License: 暂无描述

DataCite Commons2021-03-14 更新2024-07-13 收录

下载链接：

https://federation.figshare.com/articles/dataset/Probing_Datasets/14211878

下载链接

链接失效反馈

官方服务：

资源简介：

ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation. This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts. File StructureFormat: A tab-separated text file Column 1: train/test/validation split (tr-train, te-test, va-validation) Column 2: class label (refer to the content section for the class labels of each task file) Column 3: Tweet message (text) Column 4: a unique ID Contentsent_len.tsv In this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.

提供机构：

Federation University Australia

创建时间：

2021-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集