five

TajikNLPWorld/tajik-lora-qlora-benchmark

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: tg license: apache-2.0 tags: - tajik - lora - qlora - mistral - qwen - phi - gpt2 - low-resource-language - benchmark pretty_name: Tajik LoRA/QLoRA Benchmark size_categories: 1K<n<10K task_categories: - text-generation --- # Tajik LoRA/QLoRA Benchmark ## 📊 Description This benchmark contains the complete results of fine-tuning **15+ language models** (from 124M to 7B parameters) on a subset of the **Tajik language** (1000 sentences from the [TajikNLPWorld/tajik-web-corpus](https://huggingface.co/datasets/TajikNLPWorld/tajik-web-corpus)). The study compares full fine-tuning versus LoRA/QLoRA, evaluating model quality (perplexity), GPU memory usage, and training time. ### Key Findings - **GPT‑2 medium (full fine-tuning)** achieves the lowest perplexity (3.48), but fails to generate coherent Tajik text when tested on real prompts. - **Mistral‑7B with QLoRA (r=16)** shows the best trade‑off: perplexity 5.03 and generates meaningful Tajik sentences. - **mT5‑small with QLoRA (r=8)** after fixing fp16 issues reaches perplexity 6.34 – a strong multi‑lingual baseline. - **LoRA drastically reduces GPU memory** (e.g., GPT‑2 medium from 7.1 GB to 1.1 GB) with a modest quality drop. ## 🏆 Best Performing Models (by Perplexity) | Model | Perplexity (mean±std) | GPU Memory (GB) | Training Time (s) | |-------|----------------------|-----------------|--------------------| | GPT‑2 medium (full) | 3.48 ± 0.00 | n/a | 136.0 | | GPT‑2 (full) | 4.48 ± 0.02 | n/a | 40.5 | | Mistral‑7B + QLoRA (r=16) | 5.03 ± 0.03 | 15.28 | 1987 | | DistilGPT‑2 (full) | 5.03 ± 0.02 | n/a | 25.5 | | Mistral‑7B + QLoRA (r=8) | 5.11 ± 0.03 | 14.21 | 1991 | | mT5‑small + QLoRA (r=8) | 6.34 ± 0.44 | 25.05 | 376.8 | | Qwen2.5‑7B + QLoRA (r=16) | 7.35 ± 0.02 | n/a | 1531.0 | For the full list see `metrics.csv`. ## 📁 Repository Content | File | Description | |------|-------------| | `metrics.csv` | Full metrics table (perplexity, loss, GPU, time, seeds) | | `generations.csv` | Model generations for 10 Tajik prompts (if available) | | `paper_table.tex` | LaTeX table ready for academic papers (optional) | | `analysis_report.html` | Complete interactive HTML report with all plots | | `perplexity_comparison.png` | Bar chart of perplexity with error bars | | `time_vs_perplexity.png` | Training time vs. model quality | | `gpu_vs_perplexity.png` | GPU memory usage vs. model quality | | `params_vs_perplexity.png` | Trainable parameters vs. quality (LoRA models) | ## 🔬 Qualitative Comparison (Tajik Prompts) | Prompt | Mistral-7B (r=16) | GPT‑2 medium (full) | |--------|-------------------|----------------------| | Салом, шумо кӣ ҳастед? | Ҳамчун бештар аз сол пеш дар Тоҷикистон... | I am not sure whether or when this post... | | Тоҷикистон | / Қонун ва тартибот / Афзудани тоҷик... | ая хелом грумы, 2.2k I'm a guy who likes... | See full comparison in `generations.csv`. ## 🚀 Usage Load the benchmark data directly from Hugging Face: ```python import pandas as pd # Load metrics metrics = pd.read_csv("https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark/resolve/main/metrics.csv") print(metrics.head()) # Load generations (if available) generations = pd.read_csv("https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark/resolve/main/generations.csv") print(generations.head()) ``` ## 🤝 Part of TajikNLPWorld This benchmark is part of the [TajikNLPWorld](https://huggingface.co/TajikNLPWorld) initiative — a collaborative research hub for Tajik and Persian low‑resource languages. ## 📝 Citation If you use this benchmark in your research, please cite it as: ```bibtex @misc{tajik-lora-benchmark, author = {Tajik NLP Community}, title = {Tajik LoRA/QLoRA Benchmark}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark} } ``` ## 📧 Contact - **Organization:** [TajikNLPWorld](https://huggingface.co/TajikNLPWorld) - **Email:** (укажите ваш email или оставьте пустым)
提供机构:
TajikNLPWorld
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作