hbenayed/tunisian-msa-parallel-corpus-evaluated
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hbenayed/tunisian-msa-parallel-corpus-evaluated
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- translation
- text-generation
language:
- ar
- aeb
pretty_name: Tunisian Arabic → MSA Synthetic Parallel Corpus
license: cc-by-4.0
train-eval-split: train
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: chunk_id
dtype: string
- name: chunk_text
dtype: string
- name: original_text_id
dtype: string
- name: original_text
dtype: string
- name: position
dtype: int64
- name: num_chunks_in_doc
dtype: int64
- name: num_tokens
dtype: int64
- name: msa_translation
dtype: string
- name: cleaned_msa_translation
dtype: string
- name: semantic_similarity
dtype: float32
- name: fluency_score
dtype: float32
- name: composite_score
dtype: float32
- name: quality_flag
dtype: bool
splits:
- name: train
num_bytes: 1297114
num_examples: 1000
download_size: 523175
dataset_size: 1297114
tags:
- translation
- tunisian
- arabic
---
### Dataset Description
This dataset is a **synthetic parallel corpus** of Tunisian Arabic (aeb) and Modern Standard Arabic (arb).
It was created with a **rigorous multi-stage pipeline** to maximize quality and reproducibility, addressing the scarcity of high-quality resources for Tunisian Arabic NLP.
The primary goals are to support:
* Machine translation between Tunisian Arabic and MSA.
* Research in dialectal-aware text generation and evaluation.
* Cross-dialect representation learning in Arabic NLP.
This release is part of the Tunisia.AI community effort to build open, transparent resources for low-resource Arabic dialects.
---
### Dataset Status
This is an **initial release (`v0.1.0`)**.
The dataset is actively being expanded and refined. Future versions will include larger samples, refined evaluation metrics, and possibly human validation subsets.
---
### Dataset Structure
The dataset is stored in `JSONL` format. Each entry corresponds to one parallel segment, enriched with metadata.
| Column | Type | Description |
| ------------------------- | -------- | ------------------------------------------------ |
| `chunk_id` | `string` | Unique identifier for the chunk. |
| `chunk_text` | `string` | Tunisian Arabic segment after semantic chunking. |
| `original_text_id` | `string` | Identifier of the source document. |
| `original_text` | `string` | Original unprocessed Tunisian text. |
| `position` | `int` | Position of the chunk in the original text. |
| `num_chunks_in_doc` | `int` | Number of chunks extracted from the source. |
| `num_tokens` | `int` | Length of the chunk in tokens. |
| `msa_translation` | `string` | Raw MSA translation generated by LLMs. |
| `cleaned_msa_translation` | `string` | Post-processed clean MSA translation. |
| `semantic_similarity` | `float` | Embedding-based similarity score. |
| `fluency_score` | `float` | Fluency score from an Arabic LM. |
| `composite_score` | `float` | Weighted score combining fidelity & fluency. |
| `quality_flag` | `bool` | True if `composite_score >= 0.6`. |
---
### Dataset Creation
#### 1. Data Collection
Raw Tunisian text was collected from public online sources.
#### 2. Filtering (Dialect Identification)
* Classified using [`Ammar-alhaj-ali/arabic-MARBERT-dialect-identification-city`](https://huggingface.co/Ammar-alhaj-ali/arabic-MARBERT-dialect-identification-city).
* Kept only samples labeled as `Tunis` or `Sfax`.
#### 3. Semantic Chunking
* Split by punctuation and Tunisian discourse markers.
* Discarded short chunks (< 7 tokens).
* Long segments (> 120 tokens) processed with sliding window (70% overlap).
* Adjacent chunks merged if cosine similarity ≥ 0.7 using multilingual MiniLM embeddings.
#### 4. Synthetic MSA Generation
* Used Groq API models (`allam-2-7b`, `llama-3.1-8b-instant`, `gemma2-9b-it`).
* Structured prompt guided translation.
* Stored raw outputs in `msa_translation`.
#### 5. Post-Processing
* Cleaned translations to remove artifacts, explanations, or repeated prompts.
* Final results stored in `cleaned_msa_translation`.
#### 6. Automatic Evaluation
* **Semantic fidelity**: Cosine similarity of embeddings.
* **Fluency**: Log-likelihood from [`aubmindlab/aragpt2-base`](https://huggingface.co/aubmindlab/aragpt2-base).
* **Composite score**: `0.5 * semantic_similarity + 0.5 * normalized_fluency`.
* **Quality flag**: `True` if score ≥ 0.6.
---
### Licensing
Licensed under [Creative Commons Attribution 4.0 (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).
---
### Limitations and Biases
* **Synthetic translations**: Not human-verified, may contain mistranslations or artifacts.
* **Dialect coverage**: Focused on Tunis & Sfax varieties, not all Tunisian sub-dialects.
* **Domain bias**: Dependent on the types of public sources collected.
---
### Citation
If you use this dataset, please cite the following paper (placeholder until publication):
```bibtex
@inproceedings{tunisian_msa_synthetic_2025,
author = {Bouajila Hamza et al. and Mahmoudi Nizar},
title = {{Creating a High-Quality Tunisian Arabic ↔ MSA Parallel Corpus with an Iterative Synthetic Data Generation Pipeline}},
booktitle = {Proceedings of the Workshop on Arabic Natural Language Processing},
year = {2025}
publisher = {Hugging Face Datasets},
}
````
### Contact For any questions, bug reports, or collaboration inquiries, please open an issue on the repository.
提供机构:
hbenayed



