hbenayed/tunisian-msa-parallel-corpus-evaluated

Name: hbenayed/tunisian-msa-parallel-corpus-evaluated
Creator: hbenayed
Published: 2026-04-03 22:15:36
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/hbenayed/tunisian-msa-parallel-corpus-evaluated

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - translation - text-generation language: - ar - aeb pretty_name: Tunisian Arabic → MSA Synthetic Parallel Corpus license: cc-by-4.0 train-eval-split: train configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: chunk_id dtype: string - name: chunk_text dtype: string - name: original_text_id dtype: string - name: original_text dtype: string - name: position dtype: int64 - name: num_chunks_in_doc dtype: int64 - name: num_tokens dtype: int64 - name: msa_translation dtype: string - name: cleaned_msa_translation dtype: string - name: semantic_similarity dtype: float32 - name: fluency_score dtype: float32 - name: composite_score dtype: float32 - name: quality_flag dtype: bool splits: - name: train num_bytes: 1297114 num_examples: 1000 download_size: 523175 dataset_size: 1297114 tags: - translation - tunisian - arabic --- ### Dataset Description This dataset is a **synthetic parallel corpus** of Tunisian Arabic (aeb) and Modern Standard Arabic (arb). It was created with a **rigorous multi-stage pipeline** to maximize quality and reproducibility, addressing the scarcity of high-quality resources for Tunisian Arabic NLP. The primary goals are to support: * Machine translation between Tunisian Arabic and MSA. * Research in dialectal-aware text generation and evaluation. * Cross-dialect representation learning in Arabic NLP. This release is part of the Tunisia.AI community effort to build open, transparent resources for low-resource Arabic dialects. --- ### Dataset Status This is an **initial release (`v0.1.0`)**. The dataset is actively being expanded and refined. Future versions will include larger samples, refined evaluation metrics, and possibly human validation subsets. --- ### Dataset Structure The dataset is stored in `JSONL` format. Each entry corresponds to one parallel segment, enriched with metadata. | Column | Type | Description | | ------------------------- | -------- | ------------------------------------------------ | | `chunk_id` | `string` | Unique identifier for the chunk. | | `chunk_text` | `string` | Tunisian Arabic segment after semantic chunking. | | `original_text_id` | `string` | Identifier of the source document. | | `original_text` | `string` | Original unprocessed Tunisian text. | | `position` | `int` | Position of the chunk in the original text. | | `num_chunks_in_doc` | `int` | Number of chunks extracted from the source. | | `num_tokens` | `int` | Length of the chunk in tokens. | | `msa_translation` | `string` | Raw MSA translation generated by LLMs. | | `cleaned_msa_translation` | `string` | Post-processed clean MSA translation. | | `semantic_similarity` | `float` | Embedding-based similarity score. | | `fluency_score` | `float` | Fluency score from an Arabic LM. | | `composite_score` | `float` | Weighted score combining fidelity & fluency. | | `quality_flag` | `bool` | True if `composite_score >= 0.6`. | --- ### Dataset Creation #### 1. Data Collection Raw Tunisian text was collected from public online sources. #### 2. Filtering (Dialect Identification) * Classified using [`Ammar-alhaj-ali/arabic-MARBERT-dialect-identification-city`](https://huggingface.co/Ammar-alhaj-ali/arabic-MARBERT-dialect-identification-city). * Kept only samples labeled as `Tunis` or `Sfax`. #### 3. Semantic Chunking * Split by punctuation and Tunisian discourse markers. * Discarded short chunks (< 7 tokens). * Long segments (> 120 tokens) processed with sliding window (70% overlap). * Adjacent chunks merged if cosine similarity ≥ 0.7 using multilingual MiniLM embeddings. #### 4. Synthetic MSA Generation * Used Groq API models (`allam-2-7b`, `llama-3.1-8b-instant`, `gemma2-9b-it`). * Structured prompt guided translation. * Stored raw outputs in `msa_translation`. #### 5. Post-Processing * Cleaned translations to remove artifacts, explanations, or repeated prompts. * Final results stored in `cleaned_msa_translation`. #### 6. Automatic Evaluation * **Semantic fidelity**: Cosine similarity of embeddings. * **Fluency**: Log-likelihood from [`aubmindlab/aragpt2-base`](https://huggingface.co/aubmindlab/aragpt2-base). * **Composite score**: `0.5 * semantic_similarity + 0.5 * normalized_fluency`. * **Quality flag**: `True` if score ≥ 0.6. --- ### Licensing Licensed under [Creative Commons Attribution 4.0 (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/). --- ### Limitations and Biases * **Synthetic translations**: Not human-verified, may contain mistranslations or artifacts. * **Dialect coverage**: Focused on Tunis & Sfax varieties, not all Tunisian sub-dialects. * **Domain bias**: Dependent on the types of public sources collected. --- ### Citation If you use this dataset, please cite the following paper (placeholder until publication): ```bibtex @inproceedings{tunisian_msa_synthetic_2025, author = {Bouajila Hamza et al. and Mahmoudi Nizar}, title = {{Creating a High-Quality Tunisian Arabic ↔ MSA Parallel Corpus with an Iterative Synthetic Data Generation Pipeline}}, booktitle = {Proceedings of the Workshop on Arabic Natural Language Processing}, year = {2025} publisher = {Hugging Face Datasets}, } ```` ### Contact For any questions, bug reports, or collaboration inquiries, please open an issue on the repository.

提供机构：

hbenayed

5,000+

优质数据集

54 个

任务类型

进入经典数据集