QCRI/arabic_pos_dialect
收藏Hugging Face2024-01-09 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/QCRI/arabic_pos_dialect
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- found
language:
- ar
license:
- apache-2.0
multilinguality:
- multilingual
size_categories:
- n<1K
source_datasets:
- extended
task_categories:
- token-classification
task_ids:
- part-of-speech
pretty_name: Arabic POS Dialect
dataset_info:
- config_name: egy
features:
- name: fold
dtype: int32
- name: subfold
dtype: string
- name: words
sequence: string
- name: segments
sequence: string
- name: pos_tags
sequence: string
splits:
- name: train
num_bytes: 269629
num_examples: 350
download_size: 89684
dataset_size: 269629
- config_name: glf
features:
- name: fold
dtype: int32
- name: subfold
dtype: string
- name: words
sequence: string
- name: segments
sequence: string
- name: pos_tags
sequence: string
splits:
- name: train
num_bytes: 239883
num_examples: 350
download_size: 89178
dataset_size: 239883
- config_name: lev
features:
- name: fold
dtype: int32
- name: subfold
dtype: string
- name: words
sequence: string
- name: segments
sequence: string
- name: pos_tags
sequence: string
splits:
- name: train
num_bytes: 263102
num_examples: 350
download_size: 97055
dataset_size: 263102
- config_name: mgr
features:
- name: fold
dtype: int32
- name: subfold
dtype: string
- name: words
sequence: string
- name: segments
sequence: string
- name: pos_tags
sequence: string
splits:
- name: train
num_bytes: 245717
num_examples: 350
download_size: 90503
dataset_size: 245717
configs:
- config_name: egy
data_files:
- split: train
path: egy/train-*
- config_name: glf
data_files:
- split: train
path: glf/train-*
- config_name: lev
data_files:
- split: train
path: lev/train-*
- config_name: mgr
data_files:
- split: train
path: mgr/train-*
---
# Dataset Card for Arabic POS Dialect
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://alt.qcri.org/resources/da_resources/
- **Repository:** https://github.com/qcri/dialectal_arabic_resources
- **Paper:** http://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdf
- **Contacts:**
- Ahmed Abdelali < aabdelali @ hbku dot edu dot qa >
- Kareem Darwish < kdarwish @ hbku dot edu dot qa >
- Hamdy Mubarak < hmubarak @ hbku dot edu dot qa >
### Dataset Summary
This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.
### Supported Tasks and Leaderboards
The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is typically measured by achieving a high accuracy over a held out dataset. Darwish et al. (2018) train a CRF model across all four dialects and achieve an average accuracy of 89.3%.
### Languages
The BCP-47 code is ar-Arab. The dataset consists of four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR), written in Arabic script.
## Dataset Structure
### Data Instances
Below is a partial example from the Egyptian set:
```
- `Fold`: 4
- `SubFold`: A
- `Word`: [ليه, لما, تحب, حد, من, قلبك, ...]
- `Segmentation`: [ليه, لما, تحب, حد, من, قلب+ك, ...]
- `POS`: [PART, PART, V, NOUN, PREP, NOUN+PRON, ...]
```
### Data Fields
The `fold` and the `subfold` fields refer to the crossfold validation splits used by Darwish et al., which can be generated using this [script](https://github.com/qcri/dialectal_arabic_resources/blob/master/generate_splits.sh).
- `fold`: An int32 indicating which fold the instance was in for the crossfold validation
- `subfold`: A string, either 'A' or 'B', indicating which subfold the instance was in for the crossfold validation
- `words`: A sequence of strings of the unsegmented token
- `segments`: A sequence of strings consisting of the segments of the word separated by '+' if there is more than one segment
- `pos_tags`: A sequence of strings of the part of speech tags of the segments separated by '+' if there is more than one segment
The POS tags consist of a set developed by [Darwish et al. (2017)](https://www.aclweb.org/anthology/W17-1316.pdf) for Modern Standard Arabic (MSA) plus an additional 6 tags (2 dialect-specific tags and 4 tweet-specific tags).
| Tag | Purpose | Description |
| ----- | ------ | ----- |
| ADV | MSA | Adverb |
| ADJ | MSA | Adjective |
| CONJ | MSA | Conjunction |
| DET | MSA | Determiner |
| NOUN | MSA | Noun |
| NSUFF | MSA | Noun suffix |
| NUM | MSA | Number |
| PART | MSA | Particle |
| PREP | MSA | Preposition |
| PRON | MSA | Pronoun |
| PUNC | MSA | Preposition |
| V | MSA | Verb |
| ABBREV | MSA | Abbreviation |
| CASE | MSA | Alef of tanween fatha |
| JUS | MSA | Jussification attached to verbs |
| VSUFF | MSA | Verb Suffix |
| FOREIGN | MSA | Non-Arabic as well as non-MSA words |
| FUR_PART | MSA | Future particle "s" prefix and "swf" |
| PROG_PART | Dialect | Progressive particle |
| NEG_PART | Dialect | Negation particle |
| HASH | Tweet | Hashtag |
| EMOT | Tweet | Emoticon/Emoji |
| MENTION | Tweet | Mention |
| URL | Tweet | URL |
### Data Splits
The dataset is split by dialect.
| Dialect | Tweets | Words |
| ----- | ------ | ----- |
| Egyptian (EGY) | 350 | 7481 |
| Levantine (LEV) | 350 | 7221 |
| Gulf (GLF) | 350 | 6767 |
| Maghrebi (MGR) | 350 | 6400 |
## Dataset Creation
### Curation Rationale
This dataset was created to address the lack of computational resources available for dialects of Arabic. These dialects are typically used in speech, while written forms of the language are typically in Modern Standard Arabic. Social media, however, has provided a venue for people to use dialects in written format.
### Source Data
This dataset builds off of the work of [Eldesouki et al. (2017)](https://arxiv.org/pdf/1708.05891.pdf) and [Samih et al. (2017b)](https://www.aclweb.org/anthology/K17-1043.pdf) who originally collected the tweets.
#### Initial Data Collection and Normalization
They started with 175 million Arabic tweets returned by the Twitter API using the query "lang:ar" in March 2014. They then filtered this set using author-identified locations and tokens that are unique to each dialect. Finally, they had native speakers of each dialect select 350 tweets that were heavily accented.
#### Who are the source language producers?
The source language producers are people who posted on Twitter in Arabic using dialectal words from countries where the dialects of interest were spoken, as identified in [Mubarak and Darwish (2014)](https://www.aclweb.org/anthology/W14-3601.pdf).
### Annotations
#### Annotation process
The segmentation guidelines are available at https://alt.qcri.org/resources1/da_resources/seg-guidelines.pdf. The tagging guidelines are not provided, but Darwish at al. note that there were multiple rounds of quality control and revision.
#### Who are the annotators?
The POS tags were annotated by native speakers of each dialect. Further information is not known.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
Darwish et al find that the accuracy on the Maghrebi dataset suffered the most when the training set was from another dialect, and conversely training on Maghrebi yielded the worst results for all the other dialects. They suggest that Egyptian, Levantine, and Gulf may be more similar to each other and Maghrebi the most dissimilar to all of them. They also find that training on Modern Standard Arabic (MSA) and testing on dialects yielded significantly lower results compared to training on dialects and testing on MSA. This suggests that dialectal variation should be a significant consideration for future work in Arabic NLP applications, particularly when working with social media text.
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
This dataset was curated by Kareem Darwish, Hamdy Mubarak, Mohamed Eldesouki and Ahmed Abdelali with the Qatar Computing Research Institute (QCRI), Younes Samih and Laura Kallmeyer with the University of Dusseldorf, Randah Alharbi and Walid Magdy with the University of Edinburgh, and Mohammed Attia with Google. No funding information was included.
### Licensing Information
This dataset is licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy and Laura Kallmeyer (2018) Multi-Dialect Arabic POS Tagging: A CRF Approach. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7-12, 2018. Miyazaki, Japan.
```
@InProceedings{DARWISH18.562,
author = {Kareem Darwish ,Hamdy Mubarak ,Ahmed Abdelali ,Mohamed Eldesouki ,Younes Samih ,Randah Alharbi ,Mohammed Attia ,Walid Magdy and Laura Kallmeyer},
title = {Multi-Dialect Arabic POS Tagging: A CRF Approach},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {may},
date = {7-12},
location = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {979-10-95546-00-9},
language = {english}
}
```
### Contributions
Thanks to [@mcmillanmajora](https://github.com/mcmillanmajora) for adding this dataset.
提供机构:
QCRI
原始信息汇总
数据集概述
名称: Arabic POS Dialect
语言: 阿拉伯语(BCP-47代码:ar-Arab),包含四个方言:埃及(EGY)、黎凡特(LEV)、海湾(GLF)和马格里布(MGR)。
许可证: Apache-2.0
多语言性: 多语言(特定于阿拉伯语方言)
大小类别: 小于1K(每个方言350条推文)
任务类别: 词性标注(POS tagging)
数据集结构:
-
配置:
- 埃及(EGY): 训练集大小269629字节,350个例子,下载大小89684字节。
- 黎凡特(LEV): 训练集大小239883字节,350个例子,下载大小89178字节。
- 海湾(GLF): 训练集大小263102字节,350个例子,下载大小97055字节。
- 马格里布(MGR): 训练集大小245717字节,350个例子,下载大小90503字节。
-
特征:
- fold: 整数,表示交叉验证的折叠。
- subfold: 字符串,表示交叉验证的子折叠。
- words: 字符串序列,未分割的词。
- segments: 字符串序列,词的分割部分。
- pos_tags: 字符串序列,词性标签。
数据集创建:
- 注释创建者: 专家生成
- 语言创建者: 发现
- 来源数据: 基于Eldesouki et al. (2017)和Samih et al. (2017b)收集的推文,通过Twitter API获取。
- 注释过程: 由各方言的母语者进行词性标注,遵循特定的分割和标注指南。
使用考虑:
- 社会影响: 数据集揭示了阿拉伯语方言间的差异,特别是在词性标注任务中,不同方言间的训练和测试结果存在显著差异。
- 偏见讨论: 未提供详细信息。
- 其他已知限制: 未提供详细信息。
附加信息:
- 数据集管理员: Kareem Darwish, Hamdy Mubarak, Mohamed Eldesouki, Ahmed Abdelali等。
- 贡献者: 感谢@mcmillanmajora添加此数据集。



