资源简介:
为了诊断分析和改进预训练语言模型(PLMs)在文本生成中的能力,我们提出了TGEA 2.0,这是迄今为止最大的基于机器生成文本的数据集,具有广泛的病理生成错误的细粒度语义标注。我们从3个领域的600万自然句子中收集了17万个名词性、短语性和句子性提示,并将这些提示输入到4个生成性PLMs中,使用其最佳解码策略生成段落。从这些生成的段落中提取了195,629个句子进行手动标注,其中检测到36,000个错误句子,定位并分类了42,000个错误跨度,这些错误跨度被分类为两级错误分类法中的错误类型。我们为每个错误跨度定义了最小错误相关词集(MiSEW),不仅提供了错误相关词,还合理化了对错误背后的推理。在整个标注过程之前和期间,进行了预标注和反馈循环的质量控制。通过诊断性标注的数据集,我们提出了5个诊断基准任务(即错误文本检测、MiSEW提取、错误跨度定位和纠正以及错误类型分类)和2个病理缓解基准任务(成对比较和词预测)。这些基准任务的实验结果表明,TGEA 2.0是一个具有挑战性的数据集,可以促进对机器文本的自动诊断和病理缓解的进一步研究。
To diagnose, analyze, and improve the capabilities of pre-trained language models (PLMs) in text generation, we propose TGEA 2.0, the largest machine-generated text dataset to date with extensive fine-grained semantic annotations of pathological generation errors. We collected 170,000 nominal, phrasal, and sentential prompts from 6 million natural sentences across 3 domains, and fed these prompts into 4 generative PLMs using their optimal decoding strategies to generate paragraphs. We extracted 195,629 sentences from these generated paragraphs for manual annotation, detected 36,000 erroneous sentences among them, localized and categorized 42,000 error spans, which were classified into error types under a two-level error taxonomy. We defined the Minimum Error-Span Associated Word set (MiSEW) for each error span, which not only provides error-related words but also rationalizes the reasoning behind the errors. Quality control via pre-annotation and feedback loops was conducted before and throughout the entire annotation process. Leveraging this diagnostically annotated dataset, we propose 5 diagnostic benchmark tasks (i.e., erroneous text detection, MiSEW extraction, error span localization and correction, and error type classification) and 2 pathology mitigation benchmark tasks (pairwise comparison and word prediction). Experimental results on these benchmark tasks demonstrate that TGEA 2.0 is a challenging dataset that can facilitate further research on automatic diagnosis and pathology mitigation of machine-generated text.