arbml/AOC_ALDi

Name: arbml/AOC_ALDi
Creator: arbml
Published: 2024-03-23 12:52:55
License: 暂无描述

Hugging Face2024-03-23 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/arbml/AOC_ALDi

下载链接

链接失效反馈

官方服务：

资源简介：

AOC_ALDi数据集包含127,835个句子，这些句子来自新闻文章和用户评论，并手动标注了它们的方言程度（ALDi）。该数据集的目标是量化文本的阿拉伯方言程度，并提供比传统方言识别系统更细致的分析。

提供机构：

arbml

原始信息汇总

数据集概述

数据集名称

名称: AOC_ALDi

数据集描述

摘要: 数据集AOC_ALDi源自AOC数据集，包含127,835个句子，其中17%来自新闻文章，83%来自用户对这些文章的评论。这些句子被手动标记了其方言程度。
支持的任务和排行榜: 信息待补充
语言: 信息待补充

数据集结构

数据实例: 信息待补充
数据字段: 信息待补充
数据分割: 信息待补充

数据集创建

筛选理由: 信息待补充
源数据:
- 初始数据收集和标准化: 信息待补充
- 源语言生产者: 信息待补充
注释:
- 注释过程: 信息待补充
- 注释者: 信息待补充
个人和敏感信息: 信息待补充

使用数据集的考虑

数据集的社会影响: 信息待补充
偏见的讨论: 信息待补充
其他已知限制: 信息待补充

附加信息

数据集管理员: 信息待补充
许可信息: 信息待补充
引用信息:

@inproceedings{keleg-etal-2023-aldi, title = "{ALD}i: Quantifying the {A}rabic Level of Dialectness of Text", author = "Keleg, Amr and Goldwater, Sharon and Magdy, Walid", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.655", doi = "10.18653/v1/2023.emnlp-main.655", pages = "10597--10611", abstract = "Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17{%} from news articles and 83{%} from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers{} stylistic choices in different situations, a useful property for sociolinguistic analyses.", }
贡献者: 感谢@github-username添加此数据集。

搜集汇总

数据集介绍

背景与挑战

背景概述

AOC_ALDi是一个阿拉伯语文本数据集，专注于量化文本的阿拉伯语方言水平（ALDi），而非传统的二元方言识别。数据集包含约12.8万行数据，来自新闻文章和用户评论，用于分析阿拉伯语中现代标准语与方言的混合程度，支持自然语言处理任务中的方言程度评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集