omneity-labs/lid-benchmark
收藏Hugging Face2026-04-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/omneity-labs/lid-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- multilingual
tags:
- language-identification
- lid
- benchmark
- evaluation
size_categories:
- 10M<n<100M
configs:
- config_name: results_summary
data_files: results_summary/train.parquet
- config_name: results_aggregate
data_files: results_aggregate/train.parquet
- config_name: results_per_language
data_files: results_per_language/train.parquet
- config_name: results_speed
data_files: results_speed/train.parquet
- config_name: model_languages
data_files: model_languages/train.parquet
- config_name: results_individual
data_files: results_individual/train.parquet
---
# LID Benchmark
Comprehensive evaluation of **17 language identification models** across **8 diverse benchmarks**.
Built by [Omneity Labs](https://www.omneitylabs.com).
## Subsets
| Config | Description | Rows |
|--------|-------------|------|
| `results_summary` | One row per model × benchmark × scope with aggregate metrics | ~136 |
| `results_aggregate` | Detailed aggregate metrics per model × benchmark × scope | ~816 |
| `results_per_language` | Per-language accuracy for every model × benchmark × scope | ~57k |
| `results_speed` | Inference speed (samples/sec) per model × benchmark | ~136 |
| `model_languages` | Supported language codes declared by each model | ~4.7k |
| `results_individual` | Every individual prediction (model × benchmark × sample) | ~28M |
## Models
gherbal-v1, gherbal-v2, gherbal-v3, gherbal-v4, nllb-lid, openlid-v1, openlid-v2,
hplt-openlid-v3, fastlid-176, glotlid, franc, franc-all, franc-min, cld2, langdetect,
langid, py3langid
## Benchmarks
| Benchmark | Source |
|-----------|--------|
| flores-devtest | [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus) (devtest split) |
| flores-dev | [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus) (dev split) |
| madar | [Madar](https://camel.abudhabi.nyu.edu/madar-parallel-corpus) |
| gherbal-multi | [sawalni-ai/gherbal-multi](https://huggingface.co/datasets/sawalni-ai/gherbal-multi) |
| atlasia-lid | [atlasia/Arabic-LID-Leaderboard](https://huggingface.co/datasets/atlasia/Arabic-LID-Leaderboard) |
| wili-2018 | [wili_2018](https://huggingface.co/datasets/wili_2018) |
| commonlid | [commoncrawl/CommonLID](https://huggingface.co/datasets/commoncrawl/CommonLID) |
| bouquet | [facebook/bouquet](https://huggingface.co/datasets/facebook/bouquet) |
## Methodology
All predictions are normalized to **ISO 639-3 + Script** (ISO 15924) codes using [babelcode](https://github.com/omneity-labs/babelcode).
Metrics: accuracy, macro-F1, weighted-F1, precision, recall — computed under multiple scopes (full, self, v1–v4).
## Interactive App
Explore results interactively: [LID Benchmark Leaderboard](https://huggingface.co/spaces/omneity-labs/lid-benchmark)
## Citation
If you use this benchmark data in your research, please reference:
- Omneity Labs LID Benchmark: https://huggingface.co/datasets/omneity-labs/lid-benchmark
- Gherbal model: https://www.omneitylabs.com/models/gherbal
- Evaluation benchmarks: See individual benchmark datasets linked above.
## Author
- [Omar Kamali](https://omarkamali.com)
- [Omneity Labs](https://omneitylabs.com)
## License
The evaluation results in this dataset are released under Apache 2.0. The underlying benchmark datasets retain their original licenses.
提供机构:
omneity-labs



