five

MBZUAI/instructpoet-ar

收藏
Hugging Face2026-04-19 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/MBZUAI/instructpoet-ar
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Arabic Poetry IFT language: - ar language_creators: - found license: other multilinguality: multilingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation - question-answering task_ids: - text2text-generation - multiple-choice-qa annotations_creators: - expert-generated - machine-generated tags: - arabic - poetry - instruction-following - dialects - literary-text configs: - config_name: generation_templates default: true data_files: - split: generation_templates path: "Templates - Poetry Generation.csv" - config_name: continuation_templates data_files: - split: continuation_templates path: "Templates - Poetry Continuation.csv" - config_name: analysis_templates data_files: - split: analysis_templates path: "Templates - Poetry Analysis.csv" - config_name: corruption_templates data_files: - split: corruption_templates path: "Templates - Poetry Corruption.csv" --- # Arabic Poetry IFT ## Dataset Summary Arabic Poetry IFT is a large-scale instruction-following dataset for Arabic poetry understanding and co-creation. It supports four task families: generation, continuation, revision/restoration, and multiple-choice analysis. The dataset covers Modern Standard Arabic (MSA) and four regional Arabic varieties used in the instruction layer: Gulf, Levantine, Nile Valley, and North African Arabic. This release accompanies the ACL 2026 paper *Instruction-Guided Poetry Generation in Arabic and Its Dialects*. **Dataset curators:** Abdelrahman Sadallah, Kareem Elozeiri, Mervat Abassy, Rania Elbadry, Mohamed Anwar, Abed Alhakim Freihat, Preslav Nakov, Fajri Koto. ### Supported Tasks 1. `generation`: compose a poem from constraints such as title, poet, era, genre, meter, rhyme, keywords, or key phrases 2. `continuation`: continue a partial poem while preserving poetic constraints 3. `revision` / `corruption`: restore a corrupted poem to its intended form 4. `analysis`: answer multiple-choice questions about poem metadata such as poet, title, keywords, meter, era, genre, and rhyme ### Language Coverage - Modern Standard Arabic (MSA) - Gulf Arabic - Levantine Arabic - Nile Valley Arabic - North African Arabic Most source poems are in MSA, while dialectal coverage is introduced primarily through manually written instruction templates and a smaller amount of dialectal source material. ## Dataset Construction ### Source Collection The underlying poetry corpus was aggregated from public literary resources, unified into a common format, enriched with metadata, and deduplicated before instruction generation. Poems with only one verse were removed. Rhyme was automatically inferred when at least 70% of verse endings matched after normalization. Training corpus statistics from the paper: | Source | Train poems | Avg. verses | |---|---:|---:| | Ashaar | 123,581 | 19.81 | | PoetsGate* | 112,482 | 15.58 | | Adab* | 70,277 | 35.33 | | AraPoems | 62,963 | 22.01 | | Diwan* | 38,005 | 22.65 | | Mawsooaa* | 18,002 | 10.25 | | Arapoet* | 1,303 | 9.25 | | Arabic Poetry Dataset | 662 | 19.41 | | Arabic-Poetry-Melody | 48 | 21.44 | | Adab World* | 6 | 93.33 | | Other | 8 | 24.88 | | **Total** | **427,337** | **21.39** | Test benchmark statistics from the paper: | Source | Test poems | Avg. verses | |---|---:|---:| | FannOrFlop (Al Ghallabi et al., 2025) | 6,984 | 17.97 | \* Scraped sources, as reported in the paper. ### Metadata Enrichment The corpus uses both original and derived metadata. Key fields used to build the instruction tasks include: - `poem_text` - `poem_title` - `poet_name` - `poet_era` - `genre` - `meter` - `rhyme` - `keywords` - `key_phrases` `keywords` and `key_phrases` were automatically generated using Gemini 2.5 Pro. The paper reports a manual quality check on 100 sampled poems, where 96% of extracted keywords were judged to be good quality. ### De-duplication and Leakage Control - Intra-source duplicates were removed after normalization. - Orthographic normalization included removing elongation, removing diacritics, and standardizing spelling variants. - Any poem overlapping with the FannOrFlop benchmark was removed from training to avoid train/test leakage. ### Instruction Template Creation The instruction layer was manually designed for four task families and then expanded across five Arabic varieties. The camera-ready paper reports: | Task family | Base templates | Arabic varieties | Dialect-specific templates | |---|---:|---:|---:| | Generation | 246 | 5 | 1,230 | | Continuation | 176 | 5 | 880 | | Analysis | 214 | 5 | 1,070 | | Revision | 8 | 5 | 40 | | **Total** | **644** | **5** | **3,220** | Dialect templates were written and revised by native speakers of the corresponding regional varieties. ## Dataset Statistics ### Overall IFT Size The instruction-following dataset contains **1,350,897** training pairs and **24,815** test pairs, for **1,375,712** total examples. | Task | Train examples | Test examples | Train subtasks | Test subtasks | |---|---:|---:|---:|---:| | Generation | 427,337 | 6,984 | 19 | 19 | | Continuation | 427,276 | 6,984 | 11 | 11 | | Revision | 68,947 | 3,863 | 8 | 8 | | Analysis | 427,337 | 6,984 | 16 | 14 | ### Task-Specific Notes - Continuation examples are created by splitting poems at random cut points between 10% and 90% of the poem length. - Analysis is framed as multiple-choice question answering with 1 correct answer and 4 distractors. - Revision examples are generated from automatically corrupted poems paired with their clean originals. The eight revision corruption types are: - `rhyme_structure` - `full_style` - `rhyme_substitution` - `rhyme_content` - `era_corruption` - `meter_transformation` - `meter_destruction` - `meter_inconsistency` ### Metadata Distribution Highlights Top values reported in the paper: - Meter: Al-Tawil 20.31%, Al-Kamil 16.10%, Al-Basit 12.27% - Poet era: Modern 37.17%, Abbasid 22.59%, Mamluk 10.05% - Genre: General 22.89%, Short 11.61%, Praise 8.01% ## Dataset Structure ### Release Organization The Hugging Face dataset viewer is configured with four subsets so each template file appears separately in the subset dropdown: - `generation_templates` - `continuation_templates` - `analysis_templates` - `corruption_templates` These subsets map to the following files: - `generation_templates` -> `Templates - Poetry Generation.csv` - `continuation_templates` -> `Templates - Poetry Continuation.csv` - `analysis_templates` -> `Templates - Poetry Analysis.csv` - `corruption_templates` -> `Templates - Poetry Corruption.csv` At minimum, each example contains the columns provided in its source CSV file. ### Example Schema Generation / continuation / analysis template files include columns such as: - `Placeholder` - `Output` or `output` - `MSA` - `Nile Valley` - `North Africa` - `Gulf` - `Levant` The corruption template file includes columns such as: - `Placeholder` - `corruption_type` - `MSA` - `Nile Valley` - `North Africa` - `Gulf` - `Levant` ## Intended Uses ### Direct Use - Instruction tuning for Arabic poetry-capable language models - Evaluation of controllable Arabic poetry generation - Study of dialectal prompt robustness in Arabic - Research on meter-, rhyme-, style-, and metadata-conditioned generation - Benchmarking poetry analysis with multiple-choice supervision
提供机构:
MBZUAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作