five

INK-USC/xcsr

收藏
Hugging Face2024-01-04 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/INK-USC/xcsr
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced - machine-generated language: - ar - de - en - es - fr - hi - it - ja - nl - pl - pt - ru - sw - ur - vi - zh license: - mit multilinguality: - multilingual size_categories: - 1K<n<10K source_datasets: - extended|codah - extended|commonsense_qa task_categories: - question-answering task_ids: - multiple-choice-qa pretty_name: X-CSR dataset_info: - config_name: X-CODAH-ar features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 568026 num_examples: 1000 - name: validation num_bytes: 165022 num_examples: 300 download_size: 265474 dataset_size: 733048 - config_name: X-CODAH-de features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 476087 num_examples: 1000 - name: validation num_bytes: 138764 num_examples: 300 download_size: 259705 dataset_size: 614851 - config_name: X-CODAH-en features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 417000 num_examples: 1000 - name: validation num_bytes: 121811 num_examples: 300 download_size: 217262 dataset_size: 538811 - config_name: X-CODAH-es features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 450954 num_examples: 1000 - name: validation num_bytes: 130678 num_examples: 300 download_size: 242647 dataset_size: 581632 - config_name: X-CODAH-fr features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 477525 num_examples: 1000 - name: validation num_bytes: 137889 num_examples: 300 download_size: 244998 dataset_size: 615414 - config_name: X-CODAH-hi features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 973733 num_examples: 1000 - name: validation num_bytes: 283004 num_examples: 300 download_size: 336862 dataset_size: 1256737 - config_name: X-CODAH-it features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 457055 num_examples: 1000 - name: validation num_bytes: 133504 num_examples: 300 download_size: 241780 dataset_size: 590559 - config_name: X-CODAH-jap features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 538415 num_examples: 1000 - name: validation num_bytes: 157392 num_examples: 300 download_size: 264995 dataset_size: 695807 - config_name: X-CODAH-nl features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 448728 num_examples: 1000 - name: validation num_bytes: 130018 num_examples: 300 download_size: 237855 dataset_size: 578746 - config_name: X-CODAH-pl features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 438538 num_examples: 1000 - name: validation num_bytes: 127750 num_examples: 300 download_size: 254894 dataset_size: 566288 - config_name: X-CODAH-pt features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 455583 num_examples: 1000 - name: validation num_bytes: 131933 num_examples: 300 download_size: 238858 dataset_size: 587516 - config_name: X-CODAH-ru features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 674567 num_examples: 1000 - name: validation num_bytes: 193713 num_examples: 300 download_size: 314200 dataset_size: 868280 - config_name: X-CODAH-sw features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 423421 num_examples: 1000 - name: validation num_bytes: 124770 num_examples: 300 download_size: 214100 dataset_size: 548191 - config_name: X-CODAH-ur features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 687123 num_examples: 1000 - name: validation num_bytes: 199737 num_examples: 300 download_size: 294475 dataset_size: 886860 - config_name: X-CODAH-vi features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 543089 num_examples: 1000 - name: validation num_bytes: 156888 num_examples: 300 download_size: 251390 dataset_size: 699977 - config_name: X-CODAH-zh features: - name: id dtype: string - name: lang dtype: string - name: question_tag dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 394660 num_examples: 1000 - name: validation num_bytes: 115025 num_examples: 300 download_size: 237827 dataset_size: 509685 - config_name: X-CSQA-ar features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 288645 num_examples: 1074 - name: validation num_bytes: 273580 num_examples: 1000 download_size: 255626 dataset_size: 562225 - config_name: X-CSQA-de features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 234170 num_examples: 1074 - name: validation num_bytes: 222840 num_examples: 1000 download_size: 242762 dataset_size: 457010 - config_name: X-CSQA-en features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 215617 num_examples: 1074 - name: validation num_bytes: 205079 num_examples: 1000 download_size: 222677 dataset_size: 420696 - config_name: X-CSQA-es features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 236817 num_examples: 1074 - name: validation num_bytes: 224497 num_examples: 1000 download_size: 238810 dataset_size: 461314 - config_name: X-CSQA-fr features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 243952 num_examples: 1074 - name: validation num_bytes: 231396 num_examples: 1000 download_size: 244676 dataset_size: 475348 - config_name: X-CSQA-hi features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 415011 num_examples: 1074 - name: validation num_bytes: 396318 num_examples: 1000 download_size: 304090 dataset_size: 811329 - config_name: X-CSQA-it features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 232604 num_examples: 1074 - name: validation num_bytes: 220902 num_examples: 1000 download_size: 236130 dataset_size: 453506 - config_name: X-CSQA-jap features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 250846 num_examples: 1074 - name: validation num_bytes: 240404 num_examples: 1000 download_size: 249420 dataset_size: 491250 - config_name: X-CSQA-nl features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 226949 num_examples: 1074 - name: validation num_bytes: 216194 num_examples: 1000 download_size: 231078 dataset_size: 443143 - config_name: X-CSQA-pl features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 231479 num_examples: 1074 - name: validation num_bytes: 219814 num_examples: 1000 download_size: 245829 dataset_size: 451293 - config_name: X-CSQA-pt features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 235469 num_examples: 1074 - name: validation num_bytes: 222785 num_examples: 1000 download_size: 238902 dataset_size: 458254 - config_name: X-CSQA-ru features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 341749 num_examples: 1074 - name: validation num_bytes: 323724 num_examples: 1000 download_size: 296252 dataset_size: 665473 - config_name: X-CSQA-sw features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 222215 num_examples: 1074 - name: validation num_bytes: 211426 num_examples: 1000 download_size: 214954 dataset_size: 433641 - config_name: X-CSQA-ur features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 306129 num_examples: 1074 - name: validation num_bytes: 292001 num_examples: 1000 download_size: 267789 dataset_size: 598130 - config_name: X-CSQA-vi features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 265210 num_examples: 1074 - name: validation num_bytes: 253502 num_examples: 1000 download_size: 244641 dataset_size: 518712 - config_name: X-CSQA-zh features: - name: id dtype: string - name: lang dtype: string - name: question struct: - name: stem dtype: string - name: choices sequence: - name: label dtype: string - name: text dtype: string - name: answerKey dtype: string splits: - name: test num_bytes: 197444 num_examples: 1074 - name: validation num_bytes: 188273 num_examples: 1000 download_size: 207379 dataset_size: 385717 configs: - config_name: X-CODAH-ar data_files: - split: test path: X-CODAH-ar/test-* - split: validation path: X-CODAH-ar/validation-* - config_name: X-CODAH-de data_files: - split: test path: X-CODAH-de/test-* - split: validation path: X-CODAH-de/validation-* - config_name: X-CODAH-en data_files: - split: test path: X-CODAH-en/test-* - split: validation path: X-CODAH-en/validation-* - config_name: X-CODAH-es data_files: - split: test path: X-CODAH-es/test-* - split: validation path: X-CODAH-es/validation-* - config_name: X-CODAH-fr data_files: - split: test path: X-CODAH-fr/test-* - split: validation path: X-CODAH-fr/validation-* - config_name: X-CODAH-hi data_files: - split: test path: X-CODAH-hi/test-* - split: validation path: X-CODAH-hi/validation-* - config_name: X-CODAH-it data_files: - split: test path: X-CODAH-it/test-* - split: validation path: X-CODAH-it/validation-* - config_name: X-CODAH-jap data_files: - split: test path: X-CODAH-jap/test-* - split: validation path: X-CODAH-jap/validation-* - config_name: X-CODAH-nl data_files: - split: test path: X-CODAH-nl/test-* - split: validation path: X-CODAH-nl/validation-* - config_name: X-CODAH-pl data_files: - split: test path: X-CODAH-pl/test-* - split: validation path: X-CODAH-pl/validation-* - config_name: X-CODAH-pt data_files: - split: test path: X-CODAH-pt/test-* - split: validation path: X-CODAH-pt/validation-* - config_name: X-CODAH-ru data_files: - split: test path: X-CODAH-ru/test-* - split: validation path: X-CODAH-ru/validation-* - config_name: X-CODAH-sw data_files: - split: test path: X-CODAH-sw/test-* - split: validation path: X-CODAH-sw/validation-* - config_name: X-CODAH-ur data_files: - split: test path: X-CODAH-ur/test-* - split: validation path: X-CODAH-ur/validation-* - config_name: X-CODAH-vi data_files: - split: test path: X-CODAH-vi/test-* - split: validation path: X-CODAH-vi/validation-* - config_name: X-CODAH-zh data_files: - split: test path: X-CODAH-zh/test-* - split: validation path: X-CODAH-zh/validation-* - config_name: X-CSQA-ar data_files: - split: test path: X-CSQA-ar/test-* - split: validation path: X-CSQA-ar/validation-* - config_name: X-CSQA-de data_files: - split: test path: X-CSQA-de/test-* - split: validation path: X-CSQA-de/validation-* - config_name: X-CSQA-en data_files: - split: test path: X-CSQA-en/test-* - split: validation path: X-CSQA-en/validation-* - config_name: X-CSQA-es data_files: - split: test path: X-CSQA-es/test-* - split: validation path: X-CSQA-es/validation-* - config_name: X-CSQA-fr data_files: - split: test path: X-CSQA-fr/test-* - split: validation path: X-CSQA-fr/validation-* - config_name: X-CSQA-hi data_files: - split: test path: X-CSQA-hi/test-* - split: validation path: X-CSQA-hi/validation-* - config_name: X-CSQA-it data_files: - split: test path: X-CSQA-it/test-* - split: validation path: X-CSQA-it/validation-* - config_name: X-CSQA-jap data_files: - split: test path: X-CSQA-jap/test-* - split: validation path: X-CSQA-jap/validation-* - config_name: X-CSQA-nl data_files: - split: test path: X-CSQA-nl/test-* - split: validation path: X-CSQA-nl/validation-* - config_name: X-CSQA-pl data_files: - split: test path: X-CSQA-pl/test-* - split: validation path: X-CSQA-pl/validation-* - config_name: X-CSQA-pt data_files: - split: test path: X-CSQA-pt/test-* - split: validation path: X-CSQA-pt/validation-* - config_name: X-CSQA-ru data_files: - split: test path: X-CSQA-ru/test-* - split: validation path: X-CSQA-ru/validation-* - config_name: X-CSQA-sw data_files: - split: test path: X-CSQA-sw/test-* - split: validation path: X-CSQA-sw/validation-* - config_name: X-CSQA-ur data_files: - split: test path: X-CSQA-ur/test-* - split: validation path: X-CSQA-ur/validation-* - config_name: X-CSQA-vi data_files: - split: test path: X-CSQA-vi/test-* - split: validation path: X-CSQA-vi/validation-* - config_name: X-CSQA-zh data_files: - split: test path: X-CSQA-zh/test-* - split: validation path: X-CSQA-zh/validation-* --- # Dataset Card for X-CSR ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://inklab.usc.edu//XCSR/ - **Repository:** https://github.com/INK-USC/XCSR - **Paper:** https://arxiv.org/abs/2106.06937 - **Leaderboard:** https://inklab.usc.edu//XCSR/leaderboard - **Point of Contact:** https://yuchenlin.xyz/ ### Dataset Summary To evaluate multi-lingual language models (ML-LMs) for commonsense reasoning in a cross-lingual zero-shot transfer setting (X-CSR), i.e., training in English and test in other languages, we create two benchmark datasets, namely X-CSQA and X-CODAH. Specifically, we automatically translate the original CSQA and CODAH datasets, which only have English versions, to 15 other languages, forming development and test sets for studying X-CSR. As our goal is to evaluate different ML-LMs in a unified evaluation protocol for X-CSR, we argue that such translated examples, although might contain noise, can serve as a starting benchmark for us to obtain meaningful analysis, before more human-translated datasets will be available in the future. ### Supported Tasks and Leaderboards https://inklab.usc.edu//XCSR/leaderboard ### Languages The total 16 languages for X-CSR: {en, zh, de, es, fr, it, jap, nl, pl, pt, ru, ar, vi, hi, sw, ur}. ## Dataset Structure ### Data Instances An example of the X-CSQA dataset: ``` { "id": "be1920f7ba5454ad", # an id shared by all languages "lang": "en", # one of the 16 language codes. "question": { "stem": "What will happen to your knowledge with more learning?", # question text "choices": [ {"label": "A", "text": "headaches" }, {"label": "B", "text": "bigger brain" }, {"label": "C", "text": "education" }, {"label": "D", "text": "growth" }, {"label": "E", "text": "knowing more" } ] }, "answerKey": "D" # hidden for test data. } ``` An example of the X-CODAH dataset: ``` { "id": "b8eeef4a823fcd4b", # an id shared by all languages "lang": "en", # one of the 16 language codes. "question_tag": "o", # one of 6 question types "question": { "stem": " ", # always a blank as a dummy question "choices": [ {"label": "A", "text": "Jennifer loves her school very much, she plans to drop every courses."}, {"label": "B", "text": "Jennifer loves her school very much, she is never absent even when she's sick."}, {"label": "C", "text": "Jennifer loves her school very much, she wants to get a part-time job."}, {"label": "D", "text": "Jennifer loves her school very much, she quits school happily."} ] }, "answerKey": "B" # hidden for test data. } ``` ### Data Fields - id: an id shared by all languages - lang: one of the 16 language codes. - question_tag: one of 6 question types - stem: always a blank as a dummy question - choices: a list of answers, each answer has: - label: a string answer identifier for each answer - text: the answer text ### Data Splits - X-CSQA: There are 8,888 examples for training in English, 1,000 for development in each language, and 1,074 examples for testing in each language. - X-CODAH: There are 8,476 examples for training in English, 300 for development in each language, and 1,000 examples for testing in each language. ## Dataset Creation ### Curation Rationale To evaluate multi-lingual language models (ML-LMs) for commonsense reasoning in a cross-lingual zero-shot transfer setting (X-CSR), i.e., training in English and test in other languages, we create two benchmark datasets, namely X-CSQA and X-CODAH. The details of the dataset construction, especially the translation procedures, can be found in section A of the appendix of the [paper](https://inklab.usc.edu//XCSR/XCSR_paper.pdf). ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information ``` # X-CSR @inproceedings{lin-etal-2021-common, title = "Common Sense Beyond {E}nglish: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning", author = "Lin, Bill Yuchen and Lee, Seyeon and Qiao, Xiaoyang and Ren, Xiang", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-long.102", doi = "10.18653/v1/2021.acl-long.102", pages = "1274--1287", abstract = "Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-general probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 14 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method {---} multilingual contrastive pretraining (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks (e.g., +2.7{\%} accuracy for X-CSQA over XLM-R{\_}L).", } # CSQA @inproceedings{Talmor2019commonsenseqaaq, address = {Minneapolis, Minnesota}, author = {Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan}, booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}, doi = {10.18653/v1/N19-1421}, pages = {4149--4158}, publisher = {Association for Computational Linguistics}, title = {CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge}, url = {https://www.aclweb.org/anthology/N19-1421}, year = {2019} } # CODAH @inproceedings{Chen2019CODAHAA, address = {Minneapolis, USA}, author = {Chen, Michael and D{'}Arcy, Mike and Liu, Alisa and Fernandez, Jared and Downey, Doug}, booktitle = {Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for {NLP}}, doi = {10.18653/v1/W19-2008}, pages = {63--69}, publisher = {Association for Computational Linguistics}, title = {CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense}, url = {https://www.aclweb.org/anthology/W19-2008}, year = {2019} } ``` ### Contributions Thanks to [Bill Yuchen Lin](https://yuchenlin.xyz/), [Seyeon Lee](https://seyeon-lee.github.io/), [Xiaoyang Qiao](https://www.linkedin.com/in/xiaoyang-qiao/), [Xiang Ren](http://www-bcf.usc.edu/~xiangren/) for adding this dataset.
提供机构:
INK-USC
原始信息汇总

数据集概述

基本信息

  • 名称: X-CSR
  • 任务类型: 问答(Question-Answering)
  • 任务ID: 多选题问答(multiple-choice-qa)
  • 语言: 多语言,包括阿拉伯语(ar)、德语(de)、英语(en)、西班牙语(es)、法语(fr)、印地语(hi)、意大利语(it)、日语(ja)、荷兰语(nl)、波兰语(pl)、葡萄牙语(pt)、俄语(ru)、斯瓦希里语(sw)、乌尔都语(ur)、越南语(vi)、中文(zh)等。
  • 许可证: MIT
  • 数据集大小: 每个语言配置的数据集大小在1K到10K之间。

数据集结构

数据集包含多个配置,每个配置对应不同的语言,例如X-CODAH-ar、X-CODAH-de等。每个配置包含以下特征:

  • id: 数据类型为字符串。
  • lang: 数据类型为字符串。
  • question_tag: 数据类型为字符串。
  • question: 结构化数据,包含:
    • stem: 数据类型为字符串。
    • choices: 序列化数据,包含:
      • label: 数据类型为字符串。
      • text: 数据类型为字符串。
  • answerKey: 数据类型为字符串。

数据集分割

每个语言配置的数据集被分割为测试集和验证集,具体信息如下:

  • 测试集: 包含1000个示例,大小根据语言不同而变化。
  • 验证集: 包含300个示例,大小根据语言不同而变化。

数据集大小详情

每个语言配置的数据集大小包括下载大小和数据集大小,具体数值根据语言不同而有所变化。例如:

  • X-CODAH-ar: 下载大小为265474字节,数据集大小为733048字节。
  • X-CODAH-de: 下载大小为259705字节,数据集大小为614851字节。
  • X-CODAH-en: 下载大小为217262字节,数据集大小为538811字节。
  • X-CODAH-es: 下载大小为242647字节,数据集大小为581632字节。
  • X-CODAH-fr: 下载大小为244998字节,数据集大小为615414字节。
  • X-CODAH-hi: 下载大小为336862字节,数据集大小为1256737字节。
  • X-CODAH-it: 下载大小为241780字节,数据集大小为590559字节。
  • X-CODAH-ja: 下载大小为264995字节,数据集大小为695807字节。
  • X-CODAH-nl: 下载大小为237855字节,数据集大小为578746字节。
  • X-CODAH-pl: 下载大小为254894字节,数据集大小为566288字节。
  • X-CODAH-pt: 下载大小为238858字节,数据集大小为587516字节。
  • X-CODAH-ru: 下载大小为314200字节,数据集大小为868280字节。
  • X-CODAH-sw: 下载大小为214100字节,数据集大小为548191字节。
  • X-CODAH-ur: 下载大小为294475字节,数据集大小为886860字节。
  • X-CODAH-vi: 下载大小为251390字节,数据集大小为699977字节。
  • X-CODAH-zh: 下载大小为237827字节,数据集大小为509685字节。

数据来源

  • 源数据集: 扩展自CODAH和Commonsense_QA。
搜集汇总
数据集介绍
main_image_url
构建方式
在常识推理领域,跨语言数据集的构建面临资源稀缺的挑战。X-CSR数据集通过扩展CODAH和Commonsense QA两个英文基准,采用机器翻译与人工众包相结合的策略,生成了涵盖16种语言的平行语料。具体而言,原始英文问题首先经由自动化系统翻译为目标语言,随后由母语者进行细致的校对与润色,确保语言表达的准确性与文化适配性。每个语言配置均包含验证集与测试集,数据规模在数千条量级,形成了结构严谨的多语言选择题库。
特点
该数据集的核心特征在于其广泛的多语言覆盖与高质量的常识推理内容。它囊括了阿拉伯语、德语、中文等16种语言,为评估模型的语言通用性与跨文化推理能力提供了宝贵资源。数据条目结构清晰,包含问题主干、多个选项及标准答案,且部分配置还设有问题标签以辅助分析。其规模虽属中等,但凭借其源于经典基准的可靠性与经过人工校验的翻译质量,在跨语言自然语言理解研究中展现出独特的价值。
使用方法
研究人员可利用该数据集对多语言或跨语言模型进行系统性的评估与基准测试。通过HuggingFace数据集库,用户可依据特定语言代码(如‘X-CSQA-zh’)加载相应子集,便捷地获取结构化的训练与验证数据。该数据适用于多项选择题回答任务,能够直接用于模型微调或作为零样本、少样本学习的评估基准。其标准化的数据格式确保了与主流机器学习框架的兼容性,为探究模型在不同语言和文化背景下的常识推理性能提供了标准化实验平台。
背景与挑战
背景概述
在自然语言处理领域,跨语言常识推理是评估模型泛化能力的关键任务。X-CSR数据集由INK-USC团队构建,其核心研究问题在于探究多语言环境下模型对常识知识的理解与迁移能力。该数据集扩展自CODAH与Commonsense QA,涵盖阿拉伯语、德语、英语、中文等十六种语言,旨在推动多语言模型在常识推理任务上的公平比较与性能提升,对全球化人工智能应用具有深远影响。
当前挑战
跨语言常识推理任务面临语言多样性带来的语义对齐挑战,模型需克服文化差异导致的常识表达歧义。数据集构建过程中,多语言标注的准确性与一致性难以保障,机器翻译可能引入噪声,而众包标注则需平衡成本与质量。此外,低资源语言的语料稀缺性进一步增加了数据收集与验证的复杂度。
常用场景
经典使用场景
在跨语言常识推理领域,X-CSR数据集以其涵盖十六种语言的多元结构,成为评估多语言模型常识理解能力的基准工具。该数据集通过多项选择题形式,要求模型基于日常知识进行逻辑推断,典型应用场景包括测试模型在零样本或少样本跨语言迁移中的表现,揭示语言间常识表征的共性与差异。研究者常利用其平行语料探究多语言预训练模型的知识泛化边界,为跨语言语义对齐提供实证基础。
解决学术问题
该数据集有效应对了多语言自然语言处理中常识知识迁移的核心挑战,解决了单一语言常识数据集难以评估模型跨文化泛化能力的局限。其构建填补了非英语常识推理基准的空白,促使学术界重新审视多语言模型的知识表示偏差问题。通过提供标准化评估框架,X-CSR推动了跨语言语义相似性度量、低资源语言常识建模等研究方向的发展,为构建真正理解人类共通知识的智能系统奠定数据基石。
衍生相关工作
围绕X-CSR数据集衍生的经典研究呈现多元化态势。在模型架构层面,催生了如XLM-R、mT5等多语言预训练模型的系统性评估范式;在方法论上,激发了跨语言对抗训练、知识蒸馏等迁移学习技术的创新。代表性工作包括基于该数据集构建的多语言常识图谱对齐算法,以及探索语言间知识传递路径的因果推理框架。这些研究共同深化了对多语言表征中常识编码机制的理论认知。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作