five

GYAFC (Grammarly’s Yahoo Answers Formality Corpus)

收藏
OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/GYAFC
下载链接
链接失效反馈
官方服务:
资源简介:
语法的雅虎答案形式语料库 (GYAFC) 是任何风格的最大数据集,总共包含110K个非正式/正式句子对。 Yahoo Answers是一个问答论坛,包含大量非正式句子,并允许重新分配数据。作者使用Yahoo Answers L6语料库创建了非正式和正式句子对的GYAFC数据集。为了确保数据的均匀分布,他们删除了属于问题,包含url并且短于5个单词或长于25个单词的句子。在这些预处理步骤之后,4000万句子仍然存在。 雅虎答案语料库由几个不同的领域组成,如商业、娱乐和音乐、旅游、食品等。Pavlick和Tetreault形式分类器 (PT16) 显示,不同类型的形式水平差异很大。为了控制这种变化,作者使用两个包含最非正式句子的特定领域,并在这些类别中显示培训和测试的结果。作者使用PT16的形式分类器来识别非正式句子,并在PT16语料库的答案类型上训练该分类器,该语料库由来自Yahoo答案的近5,000个随机选择的句子组成,手动注释为-3 (非常非正式) 至3 (非常正式)。他们发现娱乐与音乐,家庭与关系领域包含最非正式的句子,并使用这些领域创建GYAFC数据集。

The Grammarly Yahoo Answers Formalness Corpus (GYAFC) is the largest dataset of its kind for formal-informal style analysis, containing a total of 110K informal-formal sentence pairs. Yahoo Answers is a question-and-answer forum hosting a large volume of informal sentences, with permissible data redistribution. The authors constructed the GYAFC dataset comprising informal-formal sentence pairs using the Yahoo Answers L6 corpus. To ensure uniform data distribution, they filtered out sentences that belonged to questions, contained URLs, were shorter than 5 words, or longer than 25 words. Following these preprocessing steps, over 40 million sentences remained. The Yahoo Answers corpus encompasses several distinct domains, including business, entertainment & music, travel, food, and others. Pavlick and Tetreault’s formality classifier (PT16) revealed that formality levels vary considerably across different domain types. To control for this variability, the authors selected two specific domains that contained the most informal sentences, and reported their training and test results on these categories. The authors employed the PT16 formality classifier to identify informal sentences, and trained the classifier on the answer-type annotations from the PT16 corpus, which consists of nearly 5,000 randomly selected sentences from Yahoo Answers, manually annotated on a scale from -3 (extremely informal) to 3 (extremely formal). They found that the entertainment & music and family & relationships domains contained the most informal sentences, and used these domains to build the GYAFC dataset.
提供机构:
OpenDataLab
创建时间:
2022-11-02
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
GYAFC是一个基于Yahoo Answers L6语料库构建的文本数据集,包含110K非正式/正式句子对,专门用于风格转换研究,由马里兰大学帕克分校和Grammarly于2018年发布。它通过预处理和PT16形式分类器筛选,聚焦于娱乐与音乐、家庭与关系等非正式领域,适用于自然语言处理任务。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作