five

amine-khelif/Algerian-Darija

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/amine-khelif/Algerian-Darija
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-4.0 size_categories: - 100K<n<1M task_categories: - text-generation - text2text-generation pretty_name: Algerian Darija dataset_info: features: - name: Text dtype: string splits: - name: train num_bytes: 30499704 num_examples: 2324 - name: v1 num_bytes: 23477688 num_examples: 168655 download_size: 44762377 dataset_size: 53977392 configs: - config_name: default data_files: - split: train path: data/train-* - split: v1 path: data/v1-* tags: - Darija - Algeria --- ## Overview This dataset contains text in `Algerian Darija`, collected from a variety of sources including **existing datasets on Hugging Face**, **web scraping**, and **YouTube transcript APIs**. - The **`train`** **split** consists more then **2k rows** of uncleaned text data. - The **`v1`** **split** consists more than **170k rows** of split and partially cleaned text. ## Sources The text data was gathered from: - **Hugging Face Datasets**: Pre-existing datasets relevant to Algerian Darija. - **Web Scraping**: Content from various online sources. - **YouTube API**: Transcriptions from Algerian Darija videos and comments on YouTube. ## Data Cleaning Initial data cleaning steps included: - Removing duplicate emojis and characters. - Removing URLs, email addresses, and phone numbers. **Note**: Some text data from the YouTube Transcript API may contain imperfections due to limitations in speech-to-text technology for Algerian Darija. Additionally, the dataset still requires further cleaning to improve its quality for more advanced NLP tasks.
提供机构:
amine-khelif
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作