five

drelhaj/Tarab

收藏
Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/drelhaj/Tarab
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - ar task_categories: - text-classification - text-generation size_categories: - 10M<n<100M pretty_name: "Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry" dataset_info: features: - name: art_id dtype: int32 - name: artist_id dtype: int32 - name: artist_name dtype: string - name: art_title dtype: string - name: writer dtype: string - name: composer dtype: string - name: verse_order dtype: int32 - name: verse_lyrics dtype: string - name: origin dtype: string - name: dialect dtype: string - name: type dtype: string - name: corpus_version dtype: string - name: word_count dtype: int32 splits: - name: train - name: validation - name: test --- # Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry **Tarab** is a large-scale Arabic creative-text corpus that unifies **song lyrics** and **poetry** in a single verse-level representation. It contains **2,557,311 verses** and **13,509,336 tokens**, spanning **Classical Arabic**, **MSA**, and six major regional dialect groups, and covering both **modern countries** and **historical eras**. ## Dataset Overview Each row corresponds to a **single verse** with structured metadata linking it to its parent work (song/poem). | Column | Description | |---|---| | `art_id` | Work identifier (song/poem) | | `artist_id`, `artist_name` | Creator identifier and name | | `art_title` | Song/poem title | | `writer`, `composer` | Credits when available | | `verse_order` | Verse position within the work | | `verse_lyrics` | Verse text (UTF-8) | | `origin` | Modern country or historical era | | `dialect` | Classical, MSA, Egyptian, Gulf, Levantine, Iraqi, Sudanese, Maghrebi | | `type` | `song` or `poem` | | `corpus_version` | Source lineage (e.g., Habibi vs new crawl / poetry source) | | `word_count` | Tokens per verse (precomputed) | --- ## Key Statistics ### Subset Breakdown | Subset | Works | Verses | Tokens | Avg tokens/verse | | --------- | ---------- | ------------- | -------------- | ---------------- | | Songs | 34,239 | 1,170,028 | 6,989,019 | 4.9 | | Poems | 54,927 | 1,387,283 | 6,520,317 | 5.6 | | **Total** | **89,166** | **2,557,311** | **13,509,336** | **5.3** | --- ## Dialect Distribution | Dialect | Verses | Vocab size | Avg tokens/verse | % of corpus | | --------- | ------- | ---------- | ---------------- | ----------- | | Classical | 937,473 | 1,044,325 | 4.7 | 36.7 | | MSA | 449,810 | 577,073 | 4.6 | 17.6 | | Egyptian | 308,714 | 120,507 | 6.3 | 12.1 | | Gulf | 308,249 | 133,599 | 6.1 | 12.1 | | Levantine | 250,276 | 119,455 | 5.9 | 9.8 | | Iraqi | 156,153 | 73,531 | 5.5 | 6.1 | | Sudanese | 89,226 | 58,092 | 5.7 | 3.5 | | Maghrebi | 57,410 | 33,762 | 6.0 | 2.2 | --- ## Geographic and Historical Coverage | Origin | Works | Tokens | Verses | | ---------------------- | ---------- | -------------- | ------------- | | Egypt | 11,182 | 2,429,198 | 414,914 | | Abbasid Era | 13,456 | 1,431,613 | 303,378 | | Lebanon | 7,390 | 1,390,369 | 253,143 | | Saudi Arabia | 6,575 | 1,193,549 | 197,384 | | Iraq | 4,913 | 1,034,427 | 195,165 | | Ayyubid Era | 5,018 | 690,972 | 143,768 | | Andalusian Era | 4,410 | 616,022 | 130,040 | | Ottoman Era | 3,937 | 502,892 | 108,743 | | Mamluk Era | 6,095 | 490,866 | 102,999 | | Syria | 2,820 | 517,833 | 99,693 | | Sudan | 2,683 | 507,783 | 89,829 | | Kuwait | 1,962 | 361,052 | 61,867 | | Palestine | 1,429 | 271,712 | 56,448 | | United Arab Emirates | 1,719 | 310,004 | 54,462 | | Islamic Era | 2,351 | 264,482 | 54,081 | | Morocco | 1,259 | 235,739 | 41,298 | | Era of the Mukhadramun | 2,167 | 192,953 | 40,692 | | Pre-Islamic Era | 1,989 | 175,622 | 36,826 | | Tunisia | 1,072 | 168,709 | 31,671 | | Yemen | 1,360 | 153,797 | 30,535 | | Algeria | 807 | 129,197 | 25,157 | | Umayyad Era | 2,360 | 124,200 | 24,817 | | Jordan | 775 | 125,656 | 23,574 | | Oman | 872 | 95,100 | 19,872 | | Bahrain | 207 | 35,515 | 5,863 | | Qatar | 199 | 33,696 | 5,723 | | Libya | 133 | 18,292 | 3,775 | | Mauritania | 27 | 8,086 | 1,594 | | **Total** | **89,166** | **13,509,336** | **2,557,311** | --- ## Splits The repository includes `train.csv`, `validation.csv`, and `test.csv` created using a **70/15/15** split at the **work level** (`art_id`), stratified to preserve coverage across: - `type` (song vs poem) - `origin` (countries + historical eras) This avoids leakage where verses from the same work appear in multiple splits. ``` import pandas as pd from sklearn.model_selection import train_test_split INPUT_CSV = "tarab_full.csv" RANDOM_STATE = 42 # Output files OUT_TRAIN = "train.csv" OUT_VAL = "validation.csv" OUT_TEST = "test.csv" # Chunk settings (keeps memory stable) CHUNK_SIZE = 250_000 def build_artwork_split_map(path: str) -> dict[int, str]: """ Creates a mapping: art_id -> split_name, using stratified split on (type, origin). Split is done at artwork level to avoid leakage across splits. """ # Read only the columns needed to define strata at artwork level usecols = ["art_id", "type", "origin"] meta = pd.read_csv(path, usecols=usecols) # Artwork-level metadata (one row per art_id) art = ( meta.groupby("art_id", as_index=False) .agg({"type": "first", "origin": "first"}) ) # Stratum ensures coverage across songs/poems and countries/eras art["stratum"] = art["type"].astype(str) + "|" + art["origin"].astype(str) art_ids = art["art_id"].to_numpy() strata = art["stratum"].to_numpy() # 70% train, 30% temp train_ids, temp_ids = train_test_split( art_ids, test_size=0.30, random_state=RANDOM_STATE, stratify=strata ) # Split temp into 15% val, 15% test (i.e., half/half of 30%) # Need strata for temp only temp_strata = art.set_index("art_id").loc[temp_ids, "stratum"].to_numpy() val_ids, test_ids = train_test_split( temp_ids, test_size=0.50, random_state=RANDOM_STATE, stratify=temp_strata ) split_map = {int(a): "train" for a in train_ids} split_map.update({int(a): "validation" for a in val_ids}) split_map.update({int(a): "test" for a in test_ids}) return split_map def write_splits_streaming(path: str, split_map: dict[int, str]) -> None: """ Streams through the big CSV and writes out train/val/test without loading everything at once. """ # Reset outputs for f in (OUT_TRAIN, OUT_VAL, OUT_TEST): open(f, "w", encoding="utf-8").close() header_written = {"train": False, "validation": False, "test": False} for chunk in pd.read_csv(path, chunksize=CHUNK_SIZE): # Assign split by art_id chunk["__split__"] = chunk["art_id"].map(split_map) # Drop any rows whose art_id isn't mapped (shouldn't happen, but safe) chunk = chunk.dropna(subset=["__split__"]) for split_name, out_path in [ ("train", OUT_TRAIN), ("validation", OUT_VAL), ("test", OUT_TEST), ]: part = chunk[chunk["__split__"] == split_name].drop(columns=["__split__"]) if part.empty: continue part.to_csv( out_path, mode="a", index=False, header=not header_written[split_name], encoding="utf-8" ) header_written[split_name] = True if __name__ == "__main__": split_map = build_artwork_split_map(INPUT_CSV) write_splits_streaming(INPUT_CSV, split_map) print("Done.") print("Wrote:", OUT_TRAIN, OUT_VAL, OUT_TEST) ``` --- ## Dialect-Specific Subsets In addition to the standard train/validation/test splits, the repository provides dialect-specific CSV files, where the corpus is partitioned by the dialect label. Each file contains all verses belonging to a single dialect category: - Classical - MSA - Egyptian - Gulf - Levantine - Iraqi - Sudanese - Maghrebi The dialect splits are derived directly from the master file and preserve full metadata, including origin, type, and art_id. These subsets support: - Dialect-specific modelling and evaluation - Controlled experiments on regional linguistic variation - Cross-dialect transfer learning - Vocabulary and stylistic analysis within dialect boundaries ``` import os import pandas as pd # ====== CONFIG ====== INPUT_FILE = "tarab_full.csv" OUTPUT_DIR = "tarab_by_dialect" ENCODING = "utf-8" # ==================== # Create output directory if it doesn't exist os.makedirs(OUTPUT_DIR, exist_ok=True) # Load dataset df = pd.read_csv(INPUT_FILE, encoding=ENCODING) # Basic sanity check print(f"Total rows: {len(df):,}") print(f"Unique dialects: {df['dialect'].nunique()}") # Clean dialect labels (optional but safer) df["dialect"] = df["dialect"].astype(str).str.strip() # Get unique dialects dialects = sorted(df["dialect"].unique()) print("\nCreating files per dialect...\n") for d in dialects: dialect_df = df[df["dialect"] == d] # Safe filename safe_name = d.replace(" ", "_").replace("/", "_") output_path = os.path.join(OUTPUT_DIR, f"tarab_{safe_name}.csv") dialect_df.to_csv(output_path, index=False, encoding="utf-8") print(f"{d}:") print(f" Verses: {len(dialect_df):,}") print(f" Works: {dialect_df['art_id'].nunique():,}") print(f" File: {output_path}\n") print("Done.") ``` --- ## Tarab Miscellaneous: Additional Thematic and Web-Derived Split We compiled a supplementary split based on thematic categories collected from publicly available Arabic song websites. These sources are informal and not officially curated, therefore their categorisation cannot be independently verified. - **Tarab_love_songs.csv** Songs labelled under romantic or love-related themes. - **Tarab_hiphop_songs.csv** Arabic hip hop tracks. - **Tarab_deeni_songs.csv** Religious songs. - **Tarab_khaleeji_songs.csv** Songs categorised as Gulf (Khaleeji). This reflects dialect or stylistic classification rather than artist nationality. For example, an Egyptian singer may perform in Gulf dialect. - **Tarab_maghribi_songs.csv** Songs labelled as Maghrebi. As above, this reflects dialectal or stylistic features, not necessarily the artist’s country of origin. A Saudi singer, for instance, may perform in Moroccan dialect. - **Tarab_video_songs.csv** Songs associated with video-clip releases, as identified by the source websites. - **Tarab_poetry.csv** Poetry entries collected from Kaggle (see Tarab paper for reference) - **artists_details.csv** A partially completed metadata file from Wiki-Data containing finer-grained information about artists, including nationality, dominant dialect, birth and death years, active period, and brief biographical notes extracted from Wikidata. Due to resource constraints, this metadata enrichment was not completed. In principle, this component could be extended using a robust large language model to assist with structured biographical completion and validation. This split should be treated as weakly supervised metadata derived from web categorisation rather than authoritative genre or dialect annotation. --- ## Citation If you use Tarab, please cite: ```bibtex @inproceedings{elhaj2026tarab, title={Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry}, author={El-Haj, Mo}, booktitle={Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP 2026) at the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)}, pages={37--46}, address={Rabat, Morocco}, month={March}, year={2026} }
提供机构:
drelhaj
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作