TajikNLPWorld/tajik-lora-qlora-benchmark
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
language: tg
license: apache-2.0
tags:
- tajik
- lora
- qlora
- mistral
- qwen
- phi
- gpt2
- low-resource-language
- benchmark
pretty_name: Tajik LoRA/QLoRA Benchmark
size_categories: 1K<n<10K
task_categories:
- text-generation
---
# Tajik LoRA/QLoRA Benchmark
## 📊 Description
This benchmark contains the complete results of fine-tuning **15+ language models** (from 124M to 7B parameters) on a subset of the **Tajik language** (1000 sentences from the [TajikNLPWorld/tajik-web-corpus](https://huggingface.co/datasets/TajikNLPWorld/tajik-web-corpus)).
The study compares full fine-tuning versus LoRA/QLoRA, evaluating model quality (perplexity), GPU memory usage, and training time.
### Key Findings
- **GPT‑2 medium (full fine-tuning)** achieves the lowest perplexity (3.48), but fails to generate coherent Tajik text when tested on real prompts.
- **Mistral‑7B with QLoRA (r=16)** shows the best trade‑off: perplexity 5.03 and generates meaningful Tajik sentences.
- **mT5‑small with QLoRA (r=8)** after fixing fp16 issues reaches perplexity 6.34 – a strong multi‑lingual baseline.
- **LoRA drastically reduces GPU memory** (e.g., GPT‑2 medium from 7.1 GB to 1.1 GB) with a modest quality drop.
## 🏆 Best Performing Models (by Perplexity)
| Model | Perplexity (mean±std) | GPU Memory (GB) | Training Time (s) |
|-------|----------------------|-----------------|--------------------|
| GPT‑2 medium (full) | 3.48 ± 0.00 | n/a | 136.0 |
| GPT‑2 (full) | 4.48 ± 0.02 | n/a | 40.5 |
| Mistral‑7B + QLoRA (r=16) | 5.03 ± 0.03 | 15.28 | 1987 |
| DistilGPT‑2 (full) | 5.03 ± 0.02 | n/a | 25.5 |
| Mistral‑7B + QLoRA (r=8) | 5.11 ± 0.03 | 14.21 | 1991 |
| mT5‑small + QLoRA (r=8) | 6.34 ± 0.44 | 25.05 | 376.8 |
| Qwen2.5‑7B + QLoRA (r=16) | 7.35 ± 0.02 | n/a | 1531.0 |
For the full list see `metrics.csv`.
## 📁 Repository Content
| File | Description |
|------|-------------|
| `metrics.csv` | Full metrics table (perplexity, loss, GPU, time, seeds) |
| `generations.csv` | Model generations for 10 Tajik prompts (if available) |
| `paper_table.tex` | LaTeX table ready for academic papers (optional) |
| `analysis_report.html` | Complete interactive HTML report with all plots |
| `perplexity_comparison.png` | Bar chart of perplexity with error bars |
| `time_vs_perplexity.png` | Training time vs. model quality |
| `gpu_vs_perplexity.png` | GPU memory usage vs. model quality |
| `params_vs_perplexity.png` | Trainable parameters vs. quality (LoRA models) |
## 🔬 Qualitative Comparison (Tajik Prompts)
| Prompt | Mistral-7B (r=16) | GPT‑2 medium (full) |
|--------|-------------------|----------------------|
| Салом, шумо кӣ ҳастед? | Ҳамчун бештар аз сол пеш дар Тоҷикистон... | I am not sure whether or when this post... |
| Тоҷикистон | / Қонун ва тартибот / Афзудани тоҷик... | ая хелом грумы, 2.2k I'm a guy who likes... |
See full comparison in `generations.csv`.
## 🚀 Usage
Load the benchmark data directly from Hugging Face:
```python
import pandas as pd
# Load metrics
metrics = pd.read_csv("https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark/resolve/main/metrics.csv")
print(metrics.head())
# Load generations (if available)
generations = pd.read_csv("https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark/resolve/main/generations.csv")
print(generations.head())
```
## 🤝 Part of TajikNLPWorld
This benchmark is part of the [TajikNLPWorld](https://huggingface.co/TajikNLPWorld) initiative — a collaborative research hub for Tajik and Persian low‑resource languages.
## 📝 Citation
If you use this benchmark in your research, please cite it as:
```bibtex
@misc{tajik-lora-benchmark,
author = {Tajik NLP Community},
title = {Tajik LoRA/QLoRA Benchmark},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/TajikNLPWorld/tajik-lora-qlora-benchmark}
}
```
## 📧 Contact
- **Organization:** [TajikNLPWorld](https://huggingface.co/TajikNLPWorld)
- **Email:** (укажите ваш email или оставьте пустым)
提供机构:
TajikNLPWorld



