AAVE_SAE_Paired_Dataset

Name: AAVE_SAE_Paired_Dataset
Creator: OpenDataLab
Published: 2026-05-17 07:30:11
License: 暂无描述

OpenDataLab2026-05-17 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/AAVE_SAE_Paired_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

社交媒体的发展鼓励了非裔美国人白话英语 (AAVE) 的书面使用，该语言传统上仅用于口头语境。然而，由于文本语料库的可用性，NLP 模型历来是使用主要的英语变体开发的，例如标准美式英语 (SAE)。我们通过创建意图等效的并行 AAVE/SAE 推文对的数据集来研究 GPT-2 在 AAVE 文本上的性能，从而隔离每对的句法结构和 AAVE 或 SAE 特定语言。我们使用预训练的情感分类器评估每个样本及其 GPT-2 生成的文本，发现虽然 AAVE 文本比 SAE 导致更多的负面情绪分类，但 GPT-2 的使用通常会增加两者的正面情绪的出现。此外，我们对使用 GPT-2 生成的 AAVE 和 SAE 文本进行人工评估，以比较上下文的严谨性和整体质量。

The proliferation of social media has encouraged the written use of African American Vernacular English (AAVE), a variety historically confined to oral contexts. However, due to the availability of text corpora, NLP models have historically been developed using dominant English varieties such as Standard American English (SAE). We investigate GPT-2's performance on AAVE text by creating a dataset of intentionally equivalent parallel AAVE/SAE tweet pairs, thereby isolating the syntactic structure and AAVE/SAE-specific linguistic features of each pair. We evaluate each sample and its GPT-2-generated text using a pre-trained sentiment classifier, and find that while AAVE text yields more negative sentiment classifications than SAE, the use of GPT-2 generally increases the occurrence of positive sentiment for both varieties. Additionally, we conduct human evaluation of GPT-2-generated AAVE and SAE texts to compare contextual coherence and overall quality.

提供机构：

OpenDataLab

创建时间：

2022-05-23

搜集汇总

数据集介绍