arbml/AOC_ALDi
收藏数据集概述
数据集名称
- 名称: AOC_ALDi
数据集描述
- 摘要: 数据集AOC_ALDi源自AOC数据集,包含127,835个句子,其中17%来自新闻文章,83%来自用户对这些文章的评论。这些句子被手动标记了其方言程度。
- 支持的任务和排行榜: 信息待补充
- 语言: 信息待补充
数据集结构
- 数据实例: 信息待补充
- 数据字段: 信息待补充
- 数据分割: 信息待补充
数据集创建
- 筛选理由: 信息待补充
- 源数据:
- 初始数据收集和标准化: 信息待补充
- 源语言生产者: 信息待补充
- 注释:
- 注释过程: 信息待补充
- 注释者: 信息待补充
- 个人和敏感信息: 信息待补充
使用数据集的考虑
- 数据集的社会影响: 信息待补充
- 偏见的讨论: 信息待补充
- 其他已知限制: 信息待补充
附加信息
-
数据集管理员: 信息待补充
-
许可信息: 信息待补充
-
引用信息:
@inproceedings{keleg-etal-2023-aldi, title = "{ALD}i: Quantifying the {A}rabic Level of Dialectness of Text", author = "Keleg, Amr and Goldwater, Sharon and Magdy, Walid", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.655", doi = "10.18653/v1/2023.emnlp-main.655", pages = "10597--10611", abstract = "Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17{%} from news articles and 83{%} from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers{} stylistic choices in different situations, a useful property for sociolinguistic analyses.", }
-
贡献者: 感谢@github-username添加此数据集。




