five

omneity-labs/lid-benchmark

收藏
Hugging Face2026-04-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/omneity-labs/lid-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - multilingual tags: - language-identification - lid - benchmark - evaluation size_categories: - 10M<n<100M configs: - config_name: results_summary data_files: results_summary/train.parquet - config_name: results_aggregate data_files: results_aggregate/train.parquet - config_name: results_per_language data_files: results_per_language/train.parquet - config_name: results_speed data_files: results_speed/train.parquet - config_name: model_languages data_files: model_languages/train.parquet - config_name: results_individual data_files: results_individual/train.parquet --- # LID Benchmark Comprehensive evaluation of **17 language identification models** across **8 diverse benchmarks**. Built by [Omneity Labs](https://www.omneitylabs.com). ## Subsets | Config | Description | Rows | |--------|-------------|------| | `results_summary` | One row per model × benchmark × scope with aggregate metrics | ~136 | | `results_aggregate` | Detailed aggregate metrics per model × benchmark × scope | ~816 | | `results_per_language` | Per-language accuracy for every model × benchmark × scope | ~57k | | `results_speed` | Inference speed (samples/sec) per model × benchmark | ~136 | | `model_languages` | Supported language codes declared by each model | ~4.7k | | `results_individual` | Every individual prediction (model × benchmark × sample) | ~28M | ## Models gherbal-v1, gherbal-v2, gherbal-v3, gherbal-v4, nllb-lid, openlid-v1, openlid-v2, hplt-openlid-v3, fastlid-176, glotlid, franc, franc-all, franc-min, cld2, langdetect, langid, py3langid ## Benchmarks | Benchmark | Source | |-----------|--------| | flores-devtest | [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus) (devtest split) | | flores-dev | [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus) (dev split) | | madar | [Madar](https://camel.abudhabi.nyu.edu/madar-parallel-corpus) | | gherbal-multi | [sawalni-ai/gherbal-multi](https://huggingface.co/datasets/sawalni-ai/gherbal-multi) | | atlasia-lid | [atlasia/Arabic-LID-Leaderboard](https://huggingface.co/datasets/atlasia/Arabic-LID-Leaderboard) | | wili-2018 | [wili_2018](https://huggingface.co/datasets/wili_2018) | | commonlid | [commoncrawl/CommonLID](https://huggingface.co/datasets/commoncrawl/CommonLID) | | bouquet | [facebook/bouquet](https://huggingface.co/datasets/facebook/bouquet) | ## Methodology All predictions are normalized to **ISO 639-3 + Script** (ISO 15924) codes using [babelcode](https://github.com/omneity-labs/babelcode). Metrics: accuracy, macro-F1, weighted-F1, precision, recall — computed under multiple scopes (full, self, v1–v4). ## Interactive App Explore results interactively: [LID Benchmark Leaderboard](https://huggingface.co/spaces/omneity-labs/lid-benchmark) ## Citation If you use this benchmark data in your research, please reference: - Omneity Labs LID Benchmark: https://huggingface.co/datasets/omneity-labs/lid-benchmark - Gherbal model: https://www.omneitylabs.com/models/gherbal - Evaluation benchmarks: See individual benchmark datasets linked above. ## Author - [Omar Kamali](https://omarkamali.com) - [Omneity Labs](https://omneitylabs.com) ## License The evaluation results in this dataset are released under Apache 2.0. The underlying benchmark datasets retain their original licenses.
提供机构:
omneity-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作