five

Farah21/Tunisia_Tech

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Farah21/Tunisia_Tech
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar - fr - en tags: - tunisian-arabic - arabic-dialect - tech-qa - alpaca - instruction-tuning - nlp license: cc-by-4.0 size_categories: - 1K<n<10K --- # Reddit Tunisia Tech QA Dataset The pipeline employs a multi-layer scraping strategy to collect Tunisian Arabic tech QA pairs from Reddit. It combines three data sources: the Reddit JSON API for real-time posts and comment threads, RSS feeds for post discovery with JSON comment fetching, and Arctic Shift for historical data. The system builds both direct QA pairs (post → top comment) and conversation chains (comment → reply) to capture nuanced technical discussions. Each pair is scored using a quality function that considers length, tech relevance, and upvotes, then filtered and ranked to produce the top 5,000 high-quality instruction pairs. The final dataset is formatted in Alpaca standard and uploaded to Hugging Face with automatic train/test splitting. **results**: **5,500 instruction-following QA pairs** scraped from Reddit communities with strong Tunisian presence, filtered and ranked for technical relevance. Built as part of the **TunisIA-Co-Lab** initiative to create Tunisian Arabic NLP resources. ## Quality summary | Metric | Value | |--------|-------| | Total pairs | 5,500 | | Avg quality score | 0.634 | | Avg tech confidence | 0.0275 | ## Dialect distribution | Dialect | Count | |---------|-------| | mixed | 5104 | | tunisian_arabizi | 165 | | french | 117 | | arabic | 110 | | tunisian_arabic | 4 | ## Dataset structure (Alpaca format) | Column | Description | |--------|-------------| | `instruction` | Reddit post title (+ body when present) | | `input` | Always `""` (Alpaca standard) | | `output` | Top/best Reddit comment as the answer | | `source` | Subreddit origin | | `dialect` | `tunisian_arabic`, `tunisian_arabizi`, `french`, `arabic`, `mixed` | | `quality_score` | Holistic quality 0–1 (length × tech density × upvotes) | | `comment_score` | Reddit upvotes on the answer comment | | `chain_depth` | 0 = direct reply to post; >0 = reply-to-reply conversation chain | | `tech_confidence` | Tech-density confidence 0–1 | | `layer` | Data source layer: `json_api`, `json_api_chain`, `rss`, `arctic_shift` | ## Sources Scraped from: r/Tunisia, r/learnprogramming, r/cscareerquestions, r/datascience, r/MachineLearning, r/Python, r/webdev, r/freelance, r/digitalnomad. ## Usage ```python from datasets import load_dataset ds = load_dataset("Farah21/Tunisia_Tech") print(ds["train"][0]) ``` ## Citation ```bibtex @dataset{tunisian_tech_qa_2026, title={Reddit Tunisia Tech QA}, author={TunisIA-Co-Lab}, year={2026}, url={https://huggingface.co/datasets/Farah21/Tunisia_Tech} } ```
提供机构:
Farah21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作