arabic_pos_dialect
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/arabic_pos_dialect
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Arabic POS Dialect
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://alt.qcri.org/resources/da_resources/
- **Repository:** https://github.com/qcri/dialectal_arabic_resources
- **Paper:** http://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdf
- **Contacts:**
- Ahmed Abdelali < aabdelali @ hbku dot edu dot qa >
- Kareem Darwish < kdarwish @ hbku dot edu dot qa >
- Hamdy Mubarak < hmubarak @ hbku dot edu dot qa >
### Dataset Summary
This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.
### Supported Tasks and Leaderboards
The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is typically measured by achieving a high accuracy over a held out dataset. Darwish et al. (2018) train a CRF model across all four dialects and achieve an average accuracy of 89.3%.
### Languages
The BCP-47 code is ar-Arab. The dataset consists of four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR), written in Arabic script.
## Dataset Structure
### Data Instances
Below is a partial example from the Egyptian set:
```
- `Fold`: 4
- `SubFold`: A
- `Word`: [ليه, لما, تحب, حد, من, قلبك, ...]
- `Segmentation`: [ليه, لما, تحب, حد, من, قلب+ك, ...]
- `POS`: [PART, PART, V, NOUN, PREP, NOUN+PRON, ...]
```
### Data Fields
The `fold` and the `subfold` fields refer to the crossfold validation splits used by Darwish et al., which can be generated using this [script](https://github.com/qcri/dialectal_arabic_resources/blob/master/generate_splits.sh).
- `fold`: An int32 indicating which fold the instance was in for the crossfold validation
- `subfold`: A string, either 'A' or 'B', indicating which subfold the instance was in for the crossfold validation
- `words`: A sequence of strings of the unsegmented token
- `segments`: A sequence of strings consisting of the segments of the word separated by '+' if there is more than one segment
- `pos_tags`: A sequence of strings of the part of speech tags of the segments separated by '+' if there is more than one segment
The POS tags consist of a set developed by [Darwish et al. (2017)](https://www.aclweb.org/anthology/W17-1316.pdf) for Modern Standard Arabic (MSA) plus an additional 6 tags (2 dialect-specific tags and 4 tweet-specific tags).
| Tag | Purpose | Description |
| ----- | ------ | ----- |
| ADV | MSA | Adverb |
| ADJ | MSA | Adjective |
| CONJ | MSA | Conjunction |
| DET | MSA | Determiner |
| NOUN | MSA | Noun |
| NSUFF | MSA | Noun suffix |
| NUM | MSA | Number |
| PART | MSA | Particle |
| PREP | MSA | Preposition |
| PRON | MSA | Pronoun |
| PUNC | MSA | Preposition |
| V | MSA | Verb |
| ABBREV | MSA | Abbreviation |
| CASE | MSA | Alef of tanween fatha |
| JUS | MSA | Jussification attached to verbs |
| VSUFF | MSA | Verb Suffix |
| FOREIGN | MSA | Non-Arabic as well as non-MSA words |
| FUR_PART | MSA | Future particle "s" prefix and "swf" |
| PROG_PART | Dialect | Progressive particle |
| NEG_PART | Dialect | Negation particle |
| HASH | Tweet | Hashtag |
| EMOT | Tweet | Emoticon/Emoji |
| MENTION | Tweet | Mention |
| URL | Tweet | URL |
### Data Splits
The dataset is split by dialect.
| Dialect | Tweets | Words |
| ----- | ------ | ----- |
| Egyptian (EGY) | 350 | 7481 |
| Levantine (LEV) | 350 | 7221 |
| Gulf (GLF) | 350 | 6767 |
| Maghrebi (MGR) | 350 | 6400 |
## Dataset Creation
### Curation Rationale
This dataset was created to address the lack of computational resources available for dialects of Arabic. These dialects are typically used in speech, while written forms of the language are typically in Modern Standard Arabic. Social media, however, has provided a venue for people to use dialects in written format.
### Source Data
This dataset builds off of the work of [Eldesouki et al. (2017)](https://arxiv.org/pdf/1708.05891.pdf) and [Samih et al. (2017b)](https://www.aclweb.org/anthology/K17-1043.pdf) who originally collected the tweets.
#### Initial Data Collection and Normalization
They started with 175 million Arabic tweets returned by the Twitter API using the query "lang:ar" in March 2014. They then filtered this set using author-identified locations and tokens that are unique to each dialect. Finally, they had native speakers of each dialect select 350 tweets that were heavily accented.
#### Who are the source language producers?
The source language producers are people who posted on Twitter in Arabic using dialectal words from countries where the dialects of interest were spoken, as identified in [Mubarak and Darwish (2014)](https://www.aclweb.org/anthology/W14-3601.pdf).
### Annotations
#### Annotation process
The segmentation guidelines are available at https://alt.qcri.org/resources1/da_resources/seg-guidelines.pdf. The tagging guidelines are not provided, but Darwish at al. note that there were multiple rounds of quality control and revision.
#### Who are the annotators?
The POS tags were annotated by native speakers of each dialect. Further information is not known.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
Darwish et al find that the accuracy on the Maghrebi dataset suffered the most when the training set was from another dialect, and conversely training on Maghrebi yielded the worst results for all the other dialects. They suggest that Egyptian, Levantine, and Gulf may be more similar to each other and Maghrebi the most dissimilar to all of them. They also find that training on Modern Standard Arabic (MSA) and testing on dialects yielded significantly lower results compared to training on dialects and testing on MSA. This suggests that dialectal variation should be a significant consideration for future work in Arabic NLP applications, particularly when working with social media text.
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
This dataset was curated by Kareem Darwish, Hamdy Mubarak, Mohamed Eldesouki and Ahmed Abdelali with the Qatar Computing Research Institute (QCRI), Younes Samih and Laura Kallmeyer with the University of Dusseldorf, Randah Alharbi and Walid Magdy with the University of Edinburgh, and Mohammed Attia with Google. No funding information was included.
### Licensing Information
This dataset is licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy and Laura Kallmeyer (2018) Multi-Dialect Arabic POS Tagging: A CRF Approach. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7-12, 2018. Miyazaki, Japan.
```
@InProceedings{DARWISH18.562,
author = {Kareem Darwish ,Hamdy Mubarak ,Ahmed Abdelali ,Mohamed Eldesouki ,Younes Samih ,Randah Alharbi ,Mohammed Attia ,Walid Magdy and Laura Kallmeyer},
title = {Multi-Dialect Arabic POS Tagging: A CRF Approach},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {may},
date = {7-12},
location = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {979-10-95546-00-9},
language = {english}
}
```
### Contributions
Thanks to [@mcmillanmajora](https://github.com/mcmillanmajora) for adding this dataset.
# 阿拉伯语方言词性标注数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与评测基准](#supported-tasks-and-leaderboards)
- [使用语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样本](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [数据源](#source-data)
- [标注过程](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集概述
- **主页**:https://alt.qcri.org/resources/da_resources/
- **代码仓库**:https://github.com/qcri/dialectal_arabic_resources
- **相关论文**:http://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdf
- **联系方式**:
- Ahmed Abdelali < aabdelali @ hbku dot edu dot qa >
- Kareem Darwish < kdarwish @ hbku dot edu dot qa >
- Hamdy Mubarak < hmubarak @ hbku dot edu dot qa >
### 数据集概况
本数据集旨在支持阿拉伯语方言的词性标注(Part of Speech, POS)任务。其包含四个阿拉伯语方言的数据集,每个方言均有350条经过人工分词与词性标注的推文,分别为埃及方言(Egyptian, EGY)、黎凡特方言(Levantine, LEV)、海湾方言(Gulf, GLF)以及马格里布方言(Maghrebi, MGR)。
### 支持任务与评测基准
本数据集可用于训练阿拉伯语方言的分词与词性标注模型。该任务的性能通常通过在预留测试集上取得高准确率来衡量。Darwish等人(2018)针对全部四个方言训练了条件随机场(Conditional Random Field, CRF)模型,平均准确率达到89.3%。
### 使用语言
本数据集的BCP-47代码为`ar-Arab`。其包含四种阿拉伯语方言,均以阿拉伯字母书写,分别为埃及方言(EGY)、黎凡特方言(LEV)、海湾方言(GLF)与马格里布方言(MGR)。
## 数据集结构
### 数据样本
以下为埃及方言数据集的部分示例:
- `Fold`: 4
- `SubFold`: A
- `Word`: [ليه, لما, تحب, حد, من, قلبك, ...]
- `Segmentation`: [ليه, لما, تحب, حد, من, قلب+ك, ...]
- `POS`: [PART, PART, V, NOUN, PREP, NOUN+PRON, ...]
### 数据字段
`fold`与`subfold`字段对应Darwish等人所使用的交叉验证划分,可通过该[脚本](https://github.com/qcri/dialectal_arabic_resources/blob/master/generate_splits.sh)生成。
- `fold`:int32类型,表示该样本在交叉验证中所属的折叠编号
- `subfold`:字符串类型,取值为`'A'`或`'B'`,表示该样本在交叉验证中所属的子折叠
- `words`:字符串序列,为未分词的Token序列
- `segments`:字符串序列,为单词的分词结果,若存在多个分词片段则以`'+'`分隔
- `pos_tags`:字符串序列,为分词片段的词性标注结果,若存在多个分词片段则以`'+'`分隔
本数据集所用的词性标注集基于[Darwish等人(2017)](https://www.aclweb.org/anthology/W17-1316.pdf)为现代标准阿拉伯语(Modern Standard Arabic, MSA)开发的标注集,额外增加了6个标签(2个方言专属标签与4个推文专属标签)。
| 标签 | 所属范畴 | 描述 |
| ----- | ------ | ----- |
| ADV | MSA | 副词(Adverb) |
| ADJ | MSA | 形容词(Adjective) |
| CONJ | MSA | 连词(Conjunction) |
| DET | MSA | 限定词(Determiner) |
| NOUN | MSA | 名词(Noun) |
| NSUFF | MSA | 名词后缀(Noun suffix) |
| NUM | MSA | 数词(Number) |
| PART | MSA | 小品词(Particle) |
| PREP | MSA | 介词(Preposition) |
| PRON | MSA | 代词(Pronoun) |
| PUNC | MSA | 标点符号(Punctuation) |
| V | MSA | 动词(Verb) |
| ABBREV | MSA | 缩写词(Abbreviation) |
| CASE | MSA | 软音符fat'ha型Alef |
| JUS | MSA | 附加于动词的祈使形态标记 |
| VSUFF | MSA | 动词后缀(Verb Suffix) |
| FOREIGN | MSA | 非阿拉伯语及非现代标准阿拉伯语词汇 |
| FUR_PART | MSA | 将来时小品词前缀`s`与后缀`swf` |
| PROG_PART | 方言专属 | 进行体小品词 |
| NEG_PART | 方言专属 | 否定小品词 |
| HASH | 推文专属 | 话题标签 |
| EMOT | 推文专属 | 表情符号/表情 |
| MENTION | 推文专属 | 用户提及 |
| URL | 推文专属 | 统一资源定位符 |
### 数据划分
本数据集按方言进行划分。
| 方言 | 推文数 | 单词数 |
| ----- | ------ | ----- |
| 埃及方言(EGY) | 350 | 7481 |
| 黎凡特方言(LEV) | 350 | 7221 |
| 海湾方言(GLF) | 350 | 6767 |
| 马格里布方言(MGR) | 350 | 6400 |
## 数据集构建
### 构建初衷
本数据集旨在弥补阿拉伯语方言可用计算资源匮乏的问题。阿拉伯语方言通常仅用于口语场景,而书面语通常采用现代标准阿拉伯语(MSA)。但社交媒体为人们提供了以书面形式使用方言的平台。
### 数据源
本数据集基于Eldesouki等人(2017)与Samih等人(2017b)的工作构建,这两项工作最初收集了本数据集所用的推文。
#### 初始数据收集与标准化处理
研究团队于2014年3月通过Twitter API以查询`lang:ar`获取了1.75亿条阿拉伯语推文。随后通过作者标注的地理位置与各方言专属Token对该数据集进行筛选。最终,各方言的母语使用者从中挑选出350条带有浓重方言特征的推文。
#### 文本生产者群体
本数据集的文本生产者为在推特上使用目标方言词汇发帖的用户,这些用户所在的国家对应各目标方言的使用区域,具体划分详见[Mubarak和Darwish (2014)](https://www.aclweb.org/anthology/W14-3601.pdf)。
### 标注过程
#### 标注流程
分词指南可访问https://alt.qcri.org/resources1/da_resources/seg-guidelines.pdf获取。尽管未提供标注指南,但Darwish等人指出,该数据集经过了多轮质量管控与修订。
#### 标注人员
词性标注由各方言的母语使用者完成,其余详细信息暂未公开。
### 个人与敏感信息
[需补充更多信息]
## 数据使用注意事项
### 数据集的社会影响
Darwish等人发现,当训练集来自其他方言时,马格里布方言数据集上的准确率下降最为显著;反之,若在马格里布方言数据集上训练,则在其他所有方言上的测试效果均最差。他们指出,埃及、黎凡特与海湾方言彼此之间相似度较高,而马格里布方言与其余三者差异最大。此外,他们还发现,使用现代标准阿拉伯语(MSA)训练、在方言数据集上测试的效果,显著差于使用方言训练、在MSA上测试的效果。这表明,在阿拉伯语自然语言处理应用中,尤其是针对社交媒体文本的任务,方言差异是需要重点考虑的因素。
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集整理者
本数据集由卡塔尔计算研究所(Qatar Computing Research Institute, QCRI)的Kareem Darwish、Hamdy Mubarak、Mohamed Eldesouki与Ahmed Abdelali,杜塞尔多夫大学的Younes Samih与Laura Kallmeyer,爱丁堡大学的Randah Alharbi与Walid Magdy,以及谷歌的Mohammed Attia共同整理。未提及相关资助信息。
### 授权信息
本数据集采用[Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)协议进行授权。
### 引用信息
Kareem Darwish、Hamdy Mubarak、Ahmed Abdelali、Mohamed Eldesouki、Younes Samih、Randah Alharbi、Mohammed Attia、Walid Magdy与Laura Kallmeyer(2018):《多方言阿拉伯语词性标注:基于条件随机场的方法》,第十一届国际语言资源与评价会议(LREC 2018)论文集,2018年5月7日至12日,日本宫崎市。
@InProceedings{DARWISH18.562,
author = {Kareem Darwish ,Hamdy Mubarak ,Ahmed Abdelali ,Mohamed Eldesouki ,Younes Samih ,Randah Alharbi ,Mohammed Attia ,Walid Magdy and Laura Kallmeyer},
title = {Multi-Dialect Arabic POS Tagging: A CRF Approach},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {may},
date = {7-12},
location = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {979-10-95546-00-9},
language = {english}
}
### 贡献者
感谢[@mcmillanmajora](https://github.com/mcmillanmajora)为本数据集添加至仓库。
提供机构:
maas
创建时间:
2025-06-17



