five

nthakur/indic-swim-ir-cross-lingual

收藏
Hugging Face2024-04-28 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/nthakur/indic-swim-ir-cross-lingual
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: as features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3233005 num_examples: 5899 download_size: 1803172 dataset_size: 3233005 - config_name: bho features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3132621 num_examples: 5763 download_size: 1745932 dataset_size: 3132621 - config_name: gom features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3241395 num_examples: 5755 download_size: 1772947 dataset_size: 3241395 - config_name: gu features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3171432 num_examples: 5870 download_size: 1786644 dataset_size: 3171432 - config_name: hi features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3140921 num_examples: 5752 download_size: 1761474 dataset_size: 3140921 - config_name: kn features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3222300 num_examples: 5763 download_size: 1781977 dataset_size: 3222300 - config_name: mai features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3106563 num_examples: 5768 download_size: 1732399 dataset_size: 3106563 - config_name: ml features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3386716 num_examples: 5907 download_size: 1853611 dataset_size: 3386716 - config_name: mni features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2699051 num_examples: 5604 download_size: 1430986 dataset_size: 2699051 - config_name: mr features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3301413 num_examples: 5977 download_size: 1839741 dataset_size: 3301413 - config_name: or features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3124722 num_examples: 5837 download_size: 1753854 dataset_size: 3124722 - config_name: pa features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3174739 num_examples: 5840 download_size: 1792406 dataset_size: 3174739 - config_name: ps features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2813503 num_examples: 5694 download_size: 1669583 dataset_size: 2813503 - config_name: sa features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3110486 num_examples: 5779 download_size: 1722194 dataset_size: 3110486 - config_name: ta features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3334815 num_examples: 5930 download_size: 1819387 dataset_size: 3334815 - config_name: ur features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2854099 num_examples: 5816 download_size: 1715776 dataset_size: 2854099 configs: - config_name: as data_files: - split: train path: as/train-* - config_name: bho data_files: - split: train path: bho/train-* - config_name: gom data_files: - split: train path: gom/train-* - config_name: gu data_files: - split: train path: gu/train-* - config_name: hi data_files: - split: train path: hi/train-* - config_name: kn data_files: - split: train path: kn/train-* - config_name: mai data_files: - split: train path: mai/train-* - config_name: ml data_files: - split: train path: ml/train-* - config_name: mni data_files: - split: train path: mni/train-* - config_name: mr data_files: - split: train path: mr/train-* - config_name: or data_files: - split: train path: or/train-* - config_name: pa data_files: - split: train path: pa/train-* - config_name: ps data_files: - split: train path: ps/train-* - config_name: sa data_files: - split: train path: sa/train-* - config_name: ta data_files: - split: train path: ta/train-* - config_name: ur data_files: - split: train path: ur/train-* license: cc-by-sa-4.0 task_categories: - text-retrieval - question-answering language: - as - bho - gom - gu - hi - kn - mai - ml - mni - mr - or - pa - ps - sa - ta - ur pretty_name: Indic SWIM-IR (Cross-lingual) language_creators: - machine-generated multilinguality: - multilingual source_datasets: - original size_categories: - 100K<n<1M --- # Dataset Card for Indic SWIM-IR (Cross-lingual) ![SWIM-IR Logo](./swimir_header.png) <!-- Provide a quick summary of the dataset. --> This is the cross-lingual Indic subset of the SWIM-IR dataset, where the query generated is in the Indo-European language and the passage is in English. The SWIM-IR dataset is available as CC-BY-SA 4.0. 18 languages (including English) are available in the cross-lingual dataset. For full details of the dataset, please read our upcoming [NAACL 2024 paper](https://arxiv.org/abs/2311.05800) and check out our [website](https://github.com/google-research-datasets/swim-ir). # What is SWIM-IR? SWIM-IR dataset is a synthetic multilingual retrieval dataset spanning around 29 million retrieval training pairs across 27 languages. Each question has been automatically generated with the Summarize-then-Ask (STA) prompting technique using PaLM-2 as the question generator. **Note**: As the question is synthetically generated, there is scope for hallucinations during query generation. The hallucinated queries do not affect retrieval effectiveness. If you are using SWIM-IR in your research, please cite the following paper: ``` @article{thakur:2023, author = {Nandan Thakur and Jianmo Ni and Gustavo Hern{\'{a}}ndez {\'{A}}brego and John Wieting and Jimmy Lin and Daniel Cer}, title = {Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval}, journal = {CoRR}, volume = {abs/2311.05800}, year = {2023}, url = {https://doi.org/10.48550/arXiv.2311.05800}, doi = {10.48550/ARXIV.2311.05800}, eprinttype = {arXiv}, eprint = {2311.05800}, timestamp = {Tue, 14 Nov 2023 14:47:55 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2311-05800.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ## Dataset Details ### Dataset Description - **Homepage:** [SWIM-IR homepage](https://github.com/google-research-datasets/swim-ir) - **Repository:** [SWIM-IR repository](https://github.com/google-research-datasets/swim-ir) - **Paper:** [Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval ](https://arxiv.org/abs/2311.05800) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Nandan Thakur](mailto:nandan.thakur@uwaterloo.ca) #### Dataset Link <!-- info: Provide a link to the dataset: --> <!-- width: half --> SWIM-IR v1.0: http://storage.googleapis.com/gresearch/swim-ir/swim_ir_v1.tar.gz #### Data Card Author(s) <!-- info: Select **one role per** Data Card Author: (Usage Note: Select the most appropriate choice to describe the author's role in creating the Data Card.) --> <!-- width: half --> - **Nandan Thakur, University of Waterloo:** Owner - **Daniel Cer, Google Research:** Owner - **Jianmo Ni, Google DeepMind:** Contributor - **John Wieting, Google DeepMind:** Contributor - **Gustavo Hernandez Abrego, Google Research:** Contributor - **Jimmy Lin, University of Waterloo:** Contributor ## Authorship ### Publishers #### Publishing Organization(s) <!-- scope: telescope --> <!-- info: Provide the names of the institution or organization responsible for publishing the dataset: --> University of Waterloo, Google Research, Google DeepMind #### Industry Type(s) <!-- scope: periscope --> <!-- info: Select **all applicable** industry types to which the publishing organizations belong: --> - Corporate - Tech - Academic - Tech ### Dataset Owners #### Team(s) <!-- scope: telescope --> <!-- info: Provide the names of the groups or team(s) that own the dataset: --> SWIM-IR Team #### Contact Detail(s) <!-- scope: periscope --> <!-- info: Provide pathways to contact dataset owners: --> - **Dataset Owner(s):** Nandan Thakur, Daniel Cer - **Affiliation:** University of Waterloo, Google Research - **Contact:** [nandan.thakur@uwaterloo.ca](mailto:nandan.thakur@uwaterloo.ca) ## Dataset Overview #### Data Subject(s) <!-- scope: telescope --> <!-- info: Select ***all applicable**** subjects contained the dataset: --> - Synthetically generated data #### Dataset Snapshot <!-- scope: periscope --> <!-- info: Provide a snapshot of the dataset:<br><br>(Use the additional notes to include relevant information, considerations, and links to table(s) with more detailed breakdowns.) --> SWIM-IR is a synthetic multilingual retrieval training dataset. It contains training pairs for both settings: monolingual, i.e. within the same language, and cross-lingual, i.e. across language. The dataset is useful to fine-tune state-of-the-art (SoTA) synthetic monolingual and cross-lingual neural retrievers across diverse languages. Category | Data --- | --- Size of Dataset | ~6-7 GB Number of Instances | 28,265,848 Number of Fields | 6 Labeled Classes | 33* Number of Labels | 1 **Above:** Dataset statistics comprises both in-language and cross-language settings. The classes above denote a language. **Additional Notes:** (*) Classes denote the languages we cover in the SWIM-IR dataset. Here is a list of the 18 languages and their ISO codes listed in alphabetical order: Arabic (ar), Bengali (bn), German (de), English (en), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Thai (th), Yoruba (yo), Chinese (zh) and rest 15 Indo-European Languages: Assamese (as), Bhojpuri (bho), Konkani (gom), Gujarati (gu), Kannada (kn), Maithili (mai), Malayalam (ml), Manipuri (mni), Marathi (mr), Odia (or), Punjabi (pa), Pashto (ps), Sanskrit (sa), Tamil (ta), Urdu (ur). #### Content Description <!-- scope: microscope --> <!-- info: Provide a short description of the content in a data point: --> A paragraph is sampled from the Wikipedia corpus which describes an entity. The question arising from the Wikipedia paragraph is generated using a large language model (LLM). In our work, we used the PaLM 2-S (small) model to generate synthetic queries across **33 languages**, covering 11 distinct scripts, and 10 language families comprising over 3 billion speakers in the world. The SWIM-IR dataset contains about **28 million** Wikipedia synthetic query-paragraph training pairs with a multilingual query for each passage generated using PaLM 2 (small), for both cross-lingual and monolingual retrieval settings. **Additional Notes:** - The dataset creation follows a specific procedure that involves a `summarize-then-ask` prompting technique inspired by chain-of-thought prompting. - PaLM 2 uses **summarize-then-ask promping** containing 5-shot exemplars for cross-lingual and 3-shot exemplars for monolingual query generation. - The prompt includes the original paragraph, a human-generated summary, and a question translated from English using Machine Translation (MT) for cross-lingual generation, - whereas for randomly sampled training dataset pairs, and summaries generated using Google BARD for monolingual generation. - PaLM 2 generates an extractive summary which is used as a proxy to help understand the document and highlight relevant sections within the document. - Finally, the model generates a question in the target language (different in cross-lingual or same in monolingual) which can be answered using the input paragraph. ### Sensitivity of Data #### Sensitivity Type(s) <!-- scope: telescope --> <!-- info: Select ***all applicable*** data types present in the dataset: --> - None #### Field(s) with Sensitive Data <!-- scope: periscope --> <!-- info: List fields in the dataset that contain S/PII, and specify if their collection was intentional or unintentional. Use additional notes to capture any other relevant information or considerations. --> **Intentional Collected Sensitive Data** No sensitive data was intentionally collected. **Unintentionally Collected Sensitive Data** S/PII, violent, abusive, or toxic text containing racial slurs was not explicitly collected as a part of the dataset creation process. Sensitive subject and adult content was automatically filtered using the method described in (Thakur et al. 2023). #### Security and Privacy Handling <!-- scope: microscope --> <!-- info: Summarize the measures or steps to handle sensitive data in this dataset. Use additional notes to capture any other relevant information or considerations. --> We used algorithmic methods and relied on other classifiers for data filtration. Specifically, we (1) did a human inspection of text samples, with the questions automatically translated to English; (2) our observations motivated using a classifier to filter text containing sensitive subjects and adult content. ## Example of Data Points #### Primary Data Modality <!-- scope: telescope --> <!-- info: Select **one**: --> - Text Data #### Data Fields <!-- scope: microscope --> <!-- info: List the fields in data points and their descriptions. (Usage Note: Describe each field in a data point. Optionally use this to show the example.) --> | Field name | Datapoint Example | Description | | --------- | -------- | -------- | | `lang` | String | The language of the generated question | | `code` | String | The ISO-Code for the language | | `query` | String | The generated query using PaLM 2 | | `_id` | String | unique ID denoting the training pair | | `title` | String | Title of the Wikipedia article | | `text` | String | Paragraph of the Wikipedia article #### Typical Data Point <!-- width: half --> <!-- info: Provide an example of a typical data point and describe what makes it typical. **Use additional notes to capture any other relevant information or considerations.** --> Example of (English -> Japanese) datapoint from our cross-lingual dataset on the topic of “The Roki Tunnel” from the English Wikipedia. ```bash { '_id': '1234', 'lang': 'Japanese', 'code': 'ja', 'query': 'The Roki Tunnel は、北オセチア自治共和国と南オセチア共 和国の間を通る唯一の道路ですか?', 'title': 'The Roki Tunnel', 'text': "The Roki Tunnel (also called Roksky Tunnel, ; Ossetic: Ручъы тъунел; ) is a mountain tunnel of the Transkam road through the Greater Caucasus Mountains, north of the village Upper Roka. It is the only road joining North Ossetia–Alania in the Russian Federation into South Ossetia, a breakaway republic of Georgia. The road is manned at the town of Nizhny Zaramag in North Ossetia and is sometimes referred to as the Roki-Nizhny Zaramag border crossing. The tunnel, completed by the Soviet government in 1984, is one of only a handful of routes that cross the North Caucasus Range." } ``` Example of Hindi (hn) datapoint from our monolingual dataset on the topic of “Aryabhata” from the Hindi Wikipedia ```bash { '_id': 'hindi_8987#4', 'lang': 'Hindi', 'code': 'hn', 'query': 'आर्यभर्य ट केरल के कि स स्थान के नि वासी थे ?', 'title': 'आर्यभर्य ट', 'text': "एक ताजा अध्ययन के अनसु ार आर्यभर्य ट, केरल के चाम्रवत्तम (१०उत्तर५१, ७५पर्वू ४र्व ५) के नि वासी थे। अध्ययन के अनसु ार अस्मका एक जनै प्रदेश था जो कि श्रवणबेलगोल के चारों तरफ फैला हुआ था और यहाँके पत्थर के खम्बों के कारण इसका नाम अस्मका पड़ा। चाम्रवत्तम इस जनै बस्ती का हि स्सा था, इसका प्रमाण है भारतापझु ा नदी जि सका नाम जनै ों के पौराणि क राजा भारता के नाम पर रखा गया है। आर्यभर्य ट ने भी यगु ों को परि भाषि त करते वक्त राजा भारता का जि क्र कि या है- दसगीति का के पांचवें छंद में राजा भारत के समय तक बीत चकुे काल का वर्णनर्ण आता है। उन दि नों में कुसमु परुा में एक प्रसि द्ध वि श्ववि द्यालय था जहाँजनै ों का नि र्णा यक प्रभाव था और आर्यभर्य ट का काम इस प्रकार कुसमु परुा पहुँच सका और उसे पसदं भी कि या गया।" } ``` #### Atypical Data Point <!-- width: half --> <!-- info: Provide an example of an outlier data point and describe what makes it atypical. **Use additional notes to capture any other relevant information or considerations.** --> The dataset does not contain atypical data points as far as we know. ## Motivations & Intentions ### Motivations #### Purpose(s) <!-- scope: telescope --> <!-- info: Select **one**: --> - Research #### Domain(s) of Application <!-- scope: periscope --> <!-- info: Provide a list of key domains of application that the dataset has been designed for:<br><br>(Usage Note: Use comma-separated keywords.) --> `Multilingual Dense Retrieval`, `Synthetic Dataset` ## Provenance ### Collection #### Method(s) Used <!-- scope: telescope --> <!-- info: Select **all applicable** methods used to collect data: --> - Artificially Generated - Taken from other existing datasets #### Methodology Detail(s) <!-- scope: periscope --> <!-- info: Provide a description of each collection method used. Use additional notes to capture any other relevant information or considerations. (Usage Note: Duplicate and complete the following for collection method type.) --> **Collection Type** **Source:** TyDI-QA dataset which provided the English Wikipedia dataset for SWIM cross-lingual IR dataset. MIRACL provided the language-specific Wikipedia datasets for monolingual SWIM-IR datasets. **Is this source considered sensitive or high-risk?** [Yes/**No**] **Dates of Collection:** TyDI-QA [unknown - 01/02/2019], MIRACL [unknown - 01/02/2023], XTREME-UP [unknown - 01/02/2023] **Primary modality of collection data:** - Text Data **Update Frequency for collected data:** - Static #### Source Description(s) <!-- scope: microscope --> <!-- info: Provide a description of each upstream source of data. Use additional notes to capture any other relevant information or considerations. --> - **TyDI-QA:** TyDi-QA [(Clark et al. 2020)](https://aclanthology.org/2020.tacl-1.30/) provided the English Wikipedia passages which have been split into 100-word long paragraphs. It contains around 18.2M passages from the complete English Wikipedia. We selected passages with a maximum of 1M pairs for each language pair (for 17 languages) at random for the preparation of our cross-lingual SWIM-IR dataset. - **MIRACL:** MIRACL [(Zhang et al. 2023)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595/117438/MIRACL-A-Multilingual-Retrieval-Dataset-Covering) provides language-specific paragraphs from the Wikipedia Corpus. The paragraphs were generated by splitting on the “\n\n” delimiter. The MIRACL dataset provides corpora for 18 languages. We selected passages with a maximum of 1M pairs for each language at random for the preparation of our mono-lingual SWIM-IR dataset. - **XTREME-UP:** XTREME-UP [(Ruder et al. 2023)](https://aclanthology.org/2023.findings-emnlp.125/) provides a 120K sample of the TyDi-QA (Clark et al. 2020) English Wikipedia passages which have been split into 100-word long paragraphs. This sample has been used in the original dataset for cross-language question answering. #### Collection Cadence <!-- scope: telescope --> <!-- info: Select **all applicable**: --> **Static:** Data was collected once from single or multiple sources. #### Data Integration <!-- scope: periscope --> <!-- info: List all fields collected from different sources, and specify if they were included or excluded from the dataset. Use additional notes to capture any other relevant information or considerations. (Usage Note: Duplicate and complete the following for each upstream source.) --> **TyDi-QA (XOR-Retrieve and XTREME-UP)** **Included Fields** The English Wikipedia title, text, and `_id` fields were taken from the TyDi-QA dataset originally provided as a TSV file containing all fields. **Excluded Fields** The rest of the metadata apart from the fields mentioned above were excluded from our SWIM-IR dataset. We do not use any training data provided from the TyDI-QA dataset. **MIRACL** **Included Fields** The Language Wikipedia title, text, and `_id` fields were taken from the MIRACL dataset, originally provided as a JSON-lines file containing all fields. **Excluded Fields** The rest of the metadata apart from the fields mentioned above were excluded from our SWIM-IR dataset. We do not use any training data provided from the MIRACL dataset. #### Data Processing <!-- scope: microscope --> <!-- info: Summarize how data from different sources or methods aggregated, processed, or connected. Use additional notes to capture any other relevant information or considerations. (Usage Note: Duplicate and complete the following for each source OR collection method.) --> All data is coming directly from the TyDI-QA and MIRACL datasets without any preprocessing. ### Collection Criteria #### Data Selection <!-- scope: telescope --> <!-- info: Summarize the data selection criteria. Use additional notes to capture any other relevant information or considerations. --> For the Cross-lingual SWIM-IR dataset, we use a stratified sampling technique to select a subset of passages from the English Wikipedia corpus. We use it to generate questions for SWIM-IR. We ensure all languages have relatively an equal amount of training samples, wherever possible. Our Wikipedia corpus contains entities that are sorted alphabetically (A-Z). We then compute inclusion threshold $I_{th}$, which is defined as $I_{th} = D_{sample} / D_{total}$, where $(D_{sample})$ is number of passages required to sample and $(D_{total})$ is the total numbers of passages in corpus. Next, for each passage ($p_i$) in the corpus, we randomly generate an inclusion probability $\hat{p_i} \in [0,1]$. We select the passage ($p_i$) if $p_i \leq I_{th}$. This ensures uniform sampling of passages with Wikipedia entities between all letters (A-Z). For the Monolingual SWIM-IR dataset, the language selection criteria were dependent on the Wikipedia corpora availability for the monolingual task. Hence, we chose to fix on the 18 languages provided in MIRACL. To complete the dataset, we included the same languages for the cross-lingual task. #### Data Inclusion <!-- scope: periscope --> <!-- info: Summarize the data inclusion criteria. Use additional notes to capture any other relevant information or considerations. --> We include all data available in TyDi-QA English Wikipedia Corpus (maximum of 1M training pairs per language pair), which we use to generate our cross-lingual SWIM-IR dataset. We use the language-specific MIRACL Wikipedia corpora to generate our monolingual queries in SWIM-IR. #### Data Exclusion <!-- scope: microscope --> <!-- info: Summarize the data exclusion criteria. Use additional notes to capture any other relevant information or considerations. --> We removed data classified as containing sensitive subjects and adult content using the method described in our paper. No additional filters were applied for data exclusion from MIRACL or TyDi-QA. The TyDi-QA English paragraph data has been split with a maximum of up to 100 tokens. However, MIRACL used the “\n\n” delimiter to segment paragraphs from the Wikipedia articles.
提供机构:
nthakur
原始信息汇总

数据集概述

数据集描述

数据集名称

Indic SWIM-IR (Cross-lingual)

数据集版本

v1.0

数据集大小

约6-7 GB

实例数量

28,265,848

字段数量

6

语言

包括以下18种语言:

  • as (Assamese)
  • bho (Bhojpuri)
  • gom (Konkani)
  • gu (Gujarati)
  • hi (Hindi)
  • kn (Kannada)
  • mai (Maithili)
  • ml (Malayalam)
  • mni (Manipuri)
  • mr (Marathi)
  • or (Odia)
  • pa (Punjabi)
  • ps (Pashto)
  • sa (Sanskrit)
  • ta (Tamil)
  • ur (Urdu)

数据集类型

多语言检索训练数据集

数据集结构

配置名称及特征

  • as:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3233005, num_examples: 5899)
    • 下载大小: 1803172
    • 数据集大小: 3233005
  • bho:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3132621, num_examples: 5763)
    • 下载大小: 1745932
    • 数据集大小: 3132621
  • gom:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3241395, num_examples: 5755)
    • 下载大小: 1772947
    • 数据集大小: 3241395
  • gu:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3171432, num_examples: 5870)
    • 下载大小: 1786644
    • 数据集大小: 3171432
  • hi:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3140921, num_examples: 5752)
    • 下载大小: 1761474
    • 数据集大小: 3140921
  • kn:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3222300, num_examples: 5763)
    • 下载大小: 1781977
    • 数据集大小: 3222300
  • mai:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3106563, num_examples: 5768)
    • 下载大小: 1732399
    • 数据集大小: 3106563
  • ml:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3386716, num_examples: 5907)
    • 下载大小: 1853611
    • 数据集大小: 3386716
  • mni:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 2699051, num_examples: 5604)
    • 下载大小: 1430986
    • 数据集大小: 2699051
  • mr:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3301413, num_examples: 5977)
    • 下载大小: 1839741
    • 数据集大小: 3301413
  • or:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3124722, num_examples: 5837)
    • 下载大小: 1753854
    • 数据集大小: 3124722
  • pa:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3174739, num_examples: 5840)
    • 下载大小: 1792406
    • 数据集大小: 3174739
  • ps:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 2813503, num_examples: 5694)
    • 下载大小: 1669583
    • 数据集大小: 2813503
  • sa:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3110486, num_examples: 5779)
    • 下载大小: 1722194
    • 数据集大小: 3110486
  • ta:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 3334815, num_examples: 5930)
    • 下载大小: 1819387
    • 数据集大小: 3334815
  • ur:
    • 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
    • 分割: train (num_bytes: 2854099, num_examples: 5816)
    • 下载大小: 1715776
    • 数据集大小: 2854099

数据文件路径

  • as: train, path: as/train-*
  • bho: train, path: bho/train-*
  • gom: train, path: gom/train-*
  • gu: train, path: gu/train-*
  • hi: train, path: hi/train-*
  • kn: train, path: kn/train-*
  • mai: train, path: mai/train-*
  • ml: train, path: ml/train-*
  • mni: train, path: mni/train-*
  • mr: train, path: mr/train-*
  • or: train, path: or/train-*
  • pa: train, path: pa/train-*
  • ps: train, path: ps/train-*
  • sa: train, path: sa/train-*
  • ta: train, path: ta/train-*
  • ur: train, path: ur/train-*

数据集内容

数据点描述

每个数据点包含以下字段:

  • _id: 唯一ID,表示训练对
  • lang: 生成问题的语言
  • code: 语言的ISO代码
  • query: 使用PaLM 2生成的查询
  • title: Wikipedia文章的标题
  • text: Wikipedia文章的段落

数据点示例

  • English -> Japanese: json { "_id": "1234", "lang": "Japanese", "code": "ja", "query": "The Roki Tunnel は、北オセチア自治共和国と南オセチア共和国の間を通る唯一の道路ですか?", "title": "The Roki Tunnel", "text": "The Roki Tunnel (also called Roksky Tunnel, ; Ossetic: Ручъы тъунел; ) is a mountain tunnel of the Transkam road through the Greater Caucasus Mountains, north of the village Upper Roka. It is the only road joining North Ossetia–Alania in the Russian Federation into South Ossetia, a breakaway republic of Georgia. The road is manned at the town of Nizhny Zaramag in North Ossetia and is sometimes referred to as the Roki-Nizhny Zaramag border crossing. The tunnel, completed by the Soviet government in 1984, is one of only a handful of routes that cross the North Caucasus Range." }

  • Hindi: json { "_id": "hindi_8987#4", "lang": "Hindi", "code": "hn", "query": "आर्यभर्य ट केरल के कि स स्थान के नि वासी थे ?", "title": "आर्यभर्य ट", "text": "एक ताजा अध्ययन के अनसु ार आर्यभर्य ट, केरल के चाम्रवत्तम (१०उत्तर५१, ७५पर्वू ४र्व ५) के नि वासी थे। अध्ययन के अनसु ार अस्मका एक जनै प्रदेश था जो कि श्रवणबेलगोल के चारों तरफ फैला हुआ था और यहाँके पत्थर के खम्बों के कारण इसका नाम अस्मका पड़ा। चाम्रवत्तम इस जनै बस्ती का हि स्सा था, इसका प्रमाण है भारतापझु ा नदी जि सका नाम जनै ों के पौराणि क राजा भारता के नाम पर रखा गया है। आर्यभर्य ट ने भी यगु ों को परि भाषि त करते वक्त राजा भारता का जि क्र कि या है- दसगीति का के पांचवें छंद में राजा भारत के समय तक बीत चकुे काल का वर्णनर्ण आता है। उन दि नों में कुसमु परुा में एक प्रसि द्ध वि श्ववि द्यालय था जहाँजनै ों का नि र्णा यक प्रभाव था और आर्यभर्य ट का काम इस प्रकार कुसमु परुा पहुँच सका और उसे पसदं भी कि या गया।" }

数据集用途

目的

  • 研究

应用领域

  • 多语言密集检索
  • 合成数据集
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作