five

mteb/AfriXNLI

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mteb/AfriXNLI
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - human-annotated language: - amh - eng - ewe - fra - hau - ibo - kin - lin - lug - orm - sna - sot - swa - twi - wol - xho - yor - zul license: cc-by-4.0 multilinguality: multilingual source_datasets: - masakhane/afrixnli task_categories: - text-classification task_ids: - natural-language-inference dataset_info: - config_name: amh features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 82220.0 num_examples: 300 - name: test num_bytes: 108543.0 num_examples: 400 download_size: 74147 dataset_size: 190763.0 - config_name: eng features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 52146.0 num_examples: 300 - name: test num_bytes: 66098.0 num_examples: 400 download_size: 57599 dataset_size: 118244.0 - config_name: ewe features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 50618.0 num_examples: 300 - name: test num_bytes: 65034.0075 num_examples: 399 download_size: 54043 dataset_size: 115652.0075 - config_name: fra features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 61306.0 num_examples: 300 - name: test num_bytes: 77694.0 num_examples: 400 download_size: 65749 dataset_size: 139000.0 - config_name: gaz features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 60860.0 num_examples: 300 - name: test num_bytes: 77965.0 num_examples: 400 download_size: 62309 dataset_size: 138825.0 - config_name: hau features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 52198.0 num_examples: 300 - name: test num_bytes: 68341.0 num_examples: 400 download_size: 56601 dataset_size: 120539.0 - config_name: ibo features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 62998.0 num_examples: 300 - name: test num_bytes: 80525.0 num_examples: 400 download_size: 59789 dataset_size: 143523.0 - config_name: kin features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 50611.0 num_examples: 300 - name: test num_bytes: 64702.0 num_examples: 400 download_size: 56419 dataset_size: 115313.0 - config_name: lin features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 55425.0 num_examples: 300 - name: test num_bytes: 60450.495 num_examples: 399 download_size: 56219 dataset_size: 115875.495 - config_name: lug features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 59827.0 num_examples: 300 - name: test num_bytes: 78305.0 num_examples: 400 download_size: 64424 dataset_size: 138132.0 - config_name: sna features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 55677.0 num_examples: 300 - name: test num_bytes: 71533.0 num_examples: 400 download_size: 59859 dataset_size: 127210.0 - config_name: sot features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 55290.0 num_examples: 300 - name: test num_bytes: 70199.0 num_examples: 400 download_size: 56651 dataset_size: 125489.0 - config_name: swh features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 51679.0 num_examples: 300 - name: test num_bytes: 66611.0 num_examples: 400 download_size: 56866 dataset_size: 118290.0 - config_name: twi features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 56804.0 num_examples: 300 - name: test num_bytes: 69996.0 num_examples: 400 download_size: 56630 dataset_size: 126800.0 - config_name: wol features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 55466.0 num_examples: 300 - name: test num_bytes: 70362.0 num_examples: 400 download_size: 61561 dataset_size: 125828.0 - config_name: xho features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 55524.0 num_examples: 300 - name: test num_bytes: 71000.0 num_examples: 400 download_size: 60993 dataset_size: 126524.0 - config_name: yor features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 68968.0 num_examples: 300 - name: test num_bytes: 88506.0 num_examples: 400 download_size: 62544 dataset_size: 157474.0 - config_name: zul features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: labels dtype: int64 splits: - name: validation num_bytes: 54169.83 num_examples: 299 - name: test num_bytes: 68975.13 num_examples: 399 download_size: 59598 dataset_size: 123144.96 configs: - config_name: amh data_files: - split: validation path: amh/validation-* - split: test path: amh/test-* - config_name: eng data_files: - split: validation path: eng/validation-* - split: test path: eng/test-* - config_name: ewe data_files: - split: validation path: ewe/validation-* - split: test path: ewe/test-* - config_name: fra data_files: - split: validation path: fra/validation-* - split: test path: fra/test-* - config_name: gaz data_files: - split: validation path: gaz/validation-* - split: test path: gaz/test-* - config_name: hau data_files: - split: validation path: hau/validation-* - split: test path: hau/test-* - config_name: ibo data_files: - split: validation path: ibo/validation-* - split: test path: ibo/test-* - config_name: kin data_files: - split: validation path: kin/validation-* - split: test path: kin/test-* - config_name: lin data_files: - split: validation path: lin/validation-* - split: test path: lin/test-* - config_name: lug data_files: - split: validation path: lug/validation-* - split: test path: lug/test-* - config_name: sna data_files: - split: validation path: sna/validation-* - split: test path: sna/test-* - config_name: sot data_files: - split: validation path: sot/validation-* - split: test path: sot/test-* - config_name: swh data_files: - split: validation path: swh/validation-* - split: test path: swh/test-* - config_name: twi data_files: - split: validation path: twi/validation-* - split: test path: twi/test-* - config_name: wol data_files: - split: validation path: wol/validation-* - split: test path: wol/test-* - config_name: xho data_files: - split: validation path: xho/validation-* - split: test path: xho/test-* - config_name: yor data_files: - split: validation path: yor/validation-* - split: test path: yor/test-* - config_name: zul data_files: - split: validation path: zul/validation-* - split: test path: zul/test-* tags: - mteb - text --- <!-- adapted from https://github.com/huggingface/huggingface_hub/blob/v0.30.2/src/huggingface_hub/templates/datasetcard_template.md --> <div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">AfriXNLI</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;" onmouseover="this.style.textDecoration='underline'" onmouseout="this.style.textDecoration='none'">MTEB</a> dataset</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div> </div> Cross-lingual natural language inference dataset focusing on African languages. | | | |---------------|---------------------------------------------| | Task category | t2t | | Domains | News, Written | | Reference | https://github.com/masakhane-io/afri-xnli | Source datasets: - [masakhane/afrixnli](https://huggingface.co/datasets/masakhane/afrixnli) ## Dataset Preparation in MTEB This repository is a staging copy of `masakhane/afrixnli` for MTEB. The intended long-term canonical benchmark copy is `mteb/AfriXNLI`. ### Transformations - Filtered the source labels to contradiction and entailment only (`label in {0, 2}`) - Renamed `premise` -> `sentence1` and `hypothesis` -> `sentence2` - Mapped labels to the binary pair-classification convention used by MTEB - Preserved the MTEB-facing subset names, including `gaz` and `swh`, while sourcing from the original Hub configs - Removed empty pairs, duplicate pairs, and pair-level label conflicts in the staging copy ### Label Schema - `0`: contradiction - `1`: entailment ### Splits and subsets - Language-specific configs are preserved from the benchmark task - Each config contains the transformed pair-classification validation/test splits used by MTEB ## How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: ```python import mteb task = mteb.get_task("AfriXNLI") evaluator = mteb.MTEB([task]) model = mteb.get_model(YOUR_MODEL) evaluator.run(model) ``` <!-- Datasets want link to arxiv in readme to autolink dataset with paper --> To learn more about how to run models on `mteb` task check out the [GitHub repository](https://github.com/embeddings-benchmark/mteb). ## Citation If you use this dataset, please cite the dataset as well as [mteb](https://github.com/embeddings-benchmark/mteb), as this dataset likely includes additional processing as a part of the [MMTEB Contribution](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb). ```bibtex @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff}, publisher = {arXiv}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi = {10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, publisher = {arXiv}, journal={arXiv preprint arXiv:2210.07316}, year = {2022} url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, } ``` # Dataset Statistics <details> <summary> Dataset Statistics</summary> The following code contains the descriptive statistics from the task. These can also be obtained using: ```python import mteb task = mteb.get_task("AfriXNLI") desc_stats = task.metadata.descriptive_stats ``` ```json { "test": { "num_samples": 7200, "unique_pairs": 7197, "number_of_characters": 1092291, "text1_statistics": { "total_text_length": 707366, "min_text_length": 14, "average_text_length": 98.24527777777777, "max_text_length": 324, "unique_texts": 3600 }, "image1_statistics": null, "audio1_statistics": null, "text2_statistics": { "total_text_length": 384925, "min_text_length": 10, "average_text_length": 53.46180555555556, "max_text_length": 230, "unique_texts": 7196 }, "image2_statistics": null, "audio2_statistics": null, "labels_statistics": { "min_labels_per_text": 1, "average_label_per_text": 1.0, "max_labels_per_text": 1, "unique_labels": 2, "labels": { "0": { "count": 3600 }, "1": { "count": 3600 } } } } } ``` </details> --- *This dataset card was automatically generated using [MTEB](https://github.com/embeddings-benchmark/mteb)*
提供机构:
mteb
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作