nthakur/indic-swim-ir-cross-lingual

Name: nthakur/indic-swim-ir-cross-lingual
Creator: nthakur
Published: 2024-04-28 05:10:02
License: 暂无描述

Hugging Face2024-04-28 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/nthakur/indic-swim-ir-cross-lingual

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: as features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3233005 num_examples: 5899 download_size: 1803172 dataset_size: 3233005 - config_name: bho features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3132621 num_examples: 5763 download_size: 1745932 dataset_size: 3132621 - config_name: gom features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3241395 num_examples: 5755 download_size: 1772947 dataset_size: 3241395 - config_name: gu features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3171432 num_examples: 5870 download_size: 1786644 dataset_size: 3171432 - config_name: hi features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3140921 num_examples: 5752 download_size: 1761474 dataset_size: 3140921 - config_name: kn features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3222300 num_examples: 5763 download_size: 1781977 dataset_size: 3222300 - config_name: mai features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3106563 num_examples: 5768 download_size: 1732399 dataset_size: 3106563 - config_name: ml features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3386716 num_examples: 5907 download_size: 1853611 dataset_size: 3386716 - config_name: mni features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2699051 num_examples: 5604 download_size: 1430986 dataset_size: 2699051 - config_name: mr features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3301413 num_examples: 5977 download_size: 1839741 dataset_size: 3301413 - config_name: or features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3124722 num_examples: 5837 download_size: 1753854 dataset_size: 3124722 - config_name: pa features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3174739 num_examples: 5840 download_size: 1792406 dataset_size: 3174739 - config_name: ps features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2813503 num_examples: 5694 download_size: 1669583 dataset_size: 2813503 - config_name: sa features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3110486 num_examples: 5779 download_size: 1722194 dataset_size: 3110486 - config_name: ta features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3334815 num_examples: 5930 download_size: 1819387 dataset_size: 3334815 - config_name: ur features: - name: _id dtype: string - name: lang dtype: string - name: code dtype: string - name: query dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2854099 num_examples: 5816 download_size: 1715776 dataset_size: 2854099 configs: - config_name: as data_files: - split: train path: as/train-* - config_name: bho data_files: - split: train path: bho/train-* - config_name: gom data_files: - split: train path: gom/train-* - config_name: gu data_files: - split: train path: gu/train-* - config_name: hi data_files: - split: train path: hi/train-* - config_name: kn data_files: - split: train path: kn/train-* - config_name: mai data_files: - split: train path: mai/train-* - config_name: ml data_files: - split: train path: ml/train-* - config_name: mni data_files: - split: train path: mni/train-* - config_name: mr data_files: - split: train path: mr/train-* - config_name: or data_files: - split: train path: or/train-* - config_name: pa data_files: - split: train path: pa/train-* - config_name: ps data_files: - split: train path: ps/train-* - config_name: sa data_files: - split: train path: sa/train-* - config_name: ta data_files: - split: train path: ta/train-* - config_name: ur data_files: - split: train path: ur/train-* license: cc-by-sa-4.0 task_categories: - text-retrieval - question-answering language: - as - bho - gom - gu - hi - kn - mai - ml - mni - mr - or - pa - ps - sa - ta - ur pretty_name: Indic SWIM-IR (Cross-lingual) language_creators: - machine-generated multilinguality: - multilingual source_datasets: - original size_categories: - 100K<n<1M --- # Dataset Card for Indic SWIM-IR (Cross-lingual) ![SWIM-IR Logo](./swimir_header.png)  This is the cross-lingual Indic subset of the SWIM-IR dataset, where the query generated is in the Indo-European language and the passage is in English. The SWIM-IR dataset is available as CC-BY-SA 4.0. 18 languages (including English) are available in the cross-lingual dataset. For full details of the dataset, please read our upcoming [NAACL 2024 paper](https://arxiv.org/abs/2311.05800) and check out our [website](https://github.com/google-research-datasets/swim-ir). # What is SWIM-IR? SWIM-IR dataset is a synthetic multilingual retrieval dataset spanning around 29 million retrieval training pairs across 27 languages. Each question has been automatically generated with the Summarize-then-Ask (STA) prompting technique using PaLM-2 as the question generator. **Note**: As the question is synthetically generated, there is scope for hallucinations during query generation. The hallucinated queries do not affect retrieval effectiveness. If you are using SWIM-IR in your research, please cite the following paper: ``` @article{thakur:2023, author = {Nandan Thakur and Jianmo Ni and Gustavo Hern{\'{a}}ndez {\'{A}}brego and John Wieting and Jimmy Lin and Daniel Cer}, title = {Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval}, journal = {CoRR}, volume = {abs/2311.05800}, year = {2023}, url = {https://doi.org/10.48550/arXiv.2311.05800}, doi = {10.48550/ARXIV.2311.05800}, eprinttype = {arXiv}, eprint = {2311.05800}, timestamp = {Tue, 14 Nov 2023 14:47:55 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2311-05800.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ## Dataset Details ### Dataset Description - **Homepage:** [SWIM-IR homepage](https://github.com/google-research-datasets/swim-ir) - **Repository:** [SWIM-IR repository](https://github.com/google-research-datasets/swim-ir) - **Paper:** [Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval ](https://arxiv.org/abs/2311.05800) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Nandan Thakur](mailto:nandan.thakur@uwaterloo.ca) #### Dataset Link   SWIM-IR v1.0: http://storage.googleapis.com/gresearch/swim-ir/swim_ir_v1.tar.gz #### Data Card Author(s)   - **Nandan Thakur, University of Waterloo:** Owner - **Daniel Cer, Google Research:** Owner - **Jianmo Ni, Google DeepMind:** Contributor - **John Wieting, Google DeepMind:** Contributor - **Gustavo Hernandez Abrego, Google Research:** Contributor - **Jimmy Lin, University of Waterloo:** Contributor ## Authorship ### Publishers #### Publishing Organization(s)   University of Waterloo, Google Research, Google DeepMind #### Industry Type(s)   - Corporate - Tech - Academic - Tech ### Dataset Owners #### Team(s)   SWIM-IR Team #### Contact Detail(s)   - **Dataset Owner(s):** Nandan Thakur, Daniel Cer - **Affiliation:** University of Waterloo, Google Research - **Contact:** [nandan.thakur@uwaterloo.ca](mailto:nandan.thakur@uwaterloo.ca) ## Dataset Overview #### Data Subject(s)   - Synthetically generated data #### Dataset Snapshot   SWIM-IR is a synthetic multilingual retrieval training dataset. It contains training pairs for both settings: monolingual, i.e. within the same language, and cross-lingual, i.e. across language. The dataset is useful to fine-tune state-of-the-art (SoTA) synthetic monolingual and cross-lingual neural retrievers across diverse languages. Category | Data --- | --- Size of Dataset | ~6-7 GB Number of Instances | 28,265,848 Number of Fields | 6 Labeled Classes | 33* Number of Labels | 1 **Above:** Dataset statistics comprises both in-language and cross-language settings. The classes above denote a language. **Additional Notes:** (*) Classes denote the languages we cover in the SWIM-IR dataset. Here is a list of the 18 languages and their ISO codes listed in alphabetical order: Arabic (ar), Bengali (bn), German (de), English (en), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Thai (th), Yoruba (yo), Chinese (zh) and rest 15 Indo-European Languages: Assamese (as), Bhojpuri (bho), Konkani (gom), Gujarati (gu), Kannada (kn), Maithili (mai), Malayalam (ml), Manipuri (mni), Marathi (mr), Odia (or), Punjabi (pa), Pashto (ps), Sanskrit (sa), Tamil (ta), Urdu (ur). #### Content Description   A paragraph is sampled from the Wikipedia corpus which describes an entity. The question arising from the Wikipedia paragraph is generated using a large language model (LLM). In our work, we used the PaLM 2-S (small) model to generate synthetic queries across **33 languages**, covering 11 distinct scripts, and 10 language families comprising over 3 billion speakers in the world. The SWIM-IR dataset contains about **28 million** Wikipedia synthetic query-paragraph training pairs with a multilingual query for each passage generated using PaLM 2 (small), for both cross-lingual and monolingual retrieval settings. **Additional Notes:** - The dataset creation follows a specific procedure that involves a `summarize-then-ask` prompting technique inspired by chain-of-thought prompting. - PaLM 2 uses **summarize-then-ask promping** containing 5-shot exemplars for cross-lingual and 3-shot exemplars for monolingual query generation. - The prompt includes the original paragraph, a human-generated summary, and a question translated from English using Machine Translation (MT) for cross-lingual generation, - whereas for randomly sampled training dataset pairs, and summaries generated using Google BARD for monolingual generation. - PaLM 2 generates an extractive summary which is used as a proxy to help understand the document and highlight relevant sections within the document. - Finally, the model generates a question in the target language (different in cross-lingual or same in monolingual) which can be answered using the input paragraph. ### Sensitivity of Data #### Sensitivity Type(s)   - None #### Field(s) with Sensitive Data   **Intentional Collected Sensitive Data** No sensitive data was intentionally collected. **Unintentionally Collected Sensitive Data** S/PII, violent, abusive, or toxic text containing racial slurs was not explicitly collected as a part of the dataset creation process. Sensitive subject and adult content was automatically filtered using the method described in (Thakur et al. 2023). #### Security and Privacy Handling   We used algorithmic methods and relied on other classifiers for data filtration. Specifically, we (1) did a human inspection of text samples, with the questions automatically translated to English; (2) our observations motivated using a classifier to filter text containing sensitive subjects and adult content. ## Example of Data Points #### Primary Data Modality   - Text Data #### Data Fields   | Field name | Datapoint Example | Description | | --------- | -------- | -------- | | `lang` | String | The language of the generated question | | `code` | String | The ISO-Code for the language | | `query` | String | The generated query using PaLM 2 | | `_id` | String | unique ID denoting the training pair | | `title` | String | Title of the Wikipedia article | | `text` | String | Paragraph of the Wikipedia article #### Typical Data Point   Example of (English -> Japanese) datapoint from our cross-lingual dataset on the topic of “The Roki Tunnel” from the English Wikipedia. ```bash { '_id': '1234', 'lang': 'Japanese', 'code': 'ja', 'query': 'The Roki Tunnel は、北オセチア自治共和国と南オセチア共和国の間を通る唯一の道路ですか?', 'title': 'The Roki Tunnel', 'text': "The Roki Tunnel (also called Roksky Tunnel, ; Ossetic: Ручъы тъунел; ) is a mountain tunnel of the Transkam road through the Greater Caucasus Mountains, north of the village Upper Roka. It is the only road joining North Ossetia–Alania in the Russian Federation into South Ossetia, a breakaway republic of Georgia. The road is manned at the town of Nizhny Zaramag in North Ossetia and is sometimes referred to as the Roki-Nizhny Zaramag border crossing. The tunnel, completed by the Soviet government in 1984, is one of only a handful of routes that cross the North Caucasus Range." } ``` Example of Hindi (hn) datapoint from our monolingual dataset on the topic of “Aryabhata” from the Hindi Wikipedia ```bash { '_id': 'hindi_8987#4', 'lang': 'Hindi', 'code': 'hn', 'query': 'आर्यभर्य ट केरल के कि स स्थान के नि वासी थे ?', 'title': 'आर्यभर्य ट', 'text': "एक ताजा अध्ययन के अनसु ार आर्यभर्य ट, केरल के चाम्रवत्तम (१०उत्तर५१, ७५पर्वू ४र्व ५) के नि वासी थे। अध्ययन के अनसु ार अस्मका एक जनै प्रदेश था जो कि श्रवणबेलगोल के चारों तरफ फैला हुआ था और यहाँके पत्थर के खम्बों के कारण इसका नाम अस्मका पड़ा। चाम्रवत्तम इस जनै बस्ती का हि स्सा था, इसका प्रमाण है भारतापझु ा नदी जि सका नाम जनै ों के पौराणि क राजा भारता के नाम पर रखा गया है। आर्यभर्य ट ने भी यगु ों को परि भाषि त करते वक्त राजा भारता का जि क्र कि या है- दसगीति का के पांचवें छंद में राजा भारत के समय तक बीत चकुे काल का वर्णनर्ण आता है। उन दि नों में कुसमु परुा में एक प्रसि द्ध वि श्ववि द्यालय था जहाँजनै ों का नि र्णा यक प्रभाव था और आर्यभर्य ट का काम इस प्रकार कुसमु परुा पहुँच सका और उसे पसदं भी कि या गया।" } ``` #### Atypical Data Point   The dataset does not contain atypical data points as far as we know. ## Motivations & Intentions ### Motivations #### Purpose(s)   - Research #### Domain(s) of Application   `Multilingual Dense Retrieval`, `Synthetic Dataset` ## Provenance ### Collection #### Method(s) Used   - Artificially Generated - Taken from other existing datasets #### Methodology Detail(s)   **Collection Type** **Source:** TyDI-QA dataset which provided the English Wikipedia dataset for SWIM cross-lingual IR dataset. MIRACL provided the language-specific Wikipedia datasets for monolingual SWIM-IR datasets. **Is this source considered sensitive or high-risk?** [Yes/**No**] **Dates of Collection:** TyDI-QA [unknown - 01/02/2019], MIRACL [unknown - 01/02/2023], XTREME-UP [unknown - 01/02/2023] **Primary modality of collection data:** - Text Data **Update Frequency for collected data:** - Static #### Source Description(s)   - **TyDI-QA:** TyDi-QA [(Clark et al. 2020)](https://aclanthology.org/2020.tacl-1.30/) provided the English Wikipedia passages which have been split into 100-word long paragraphs. It contains around 18.2M passages from the complete English Wikipedia. We selected passages with a maximum of 1M pairs for each language pair (for 17 languages) at random for the preparation of our cross-lingual SWIM-IR dataset. - **MIRACL:** MIRACL [(Zhang et al. 2023)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595/117438/MIRACL-A-Multilingual-Retrieval-Dataset-Covering) provides language-specific paragraphs from the Wikipedia Corpus. The paragraphs were generated by splitting on the “\n\n” delimiter. The MIRACL dataset provides corpora for 18 languages. We selected passages with a maximum of 1M pairs for each language at random for the preparation of our mono-lingual SWIM-IR dataset. - **XTREME-UP:** XTREME-UP [(Ruder et al. 2023)](https://aclanthology.org/2023.findings-emnlp.125/) provides a 120K sample of the TyDi-QA (Clark et al. 2020) English Wikipedia passages which have been split into 100-word long paragraphs. This sample has been used in the original dataset for cross-language question answering. #### Collection Cadence   **Static:** Data was collected once from single or multiple sources. #### Data Integration   **TyDi-QA (XOR-Retrieve and XTREME-UP)** **Included Fields** The English Wikipedia title, text, and `_id` fields were taken from the TyDi-QA dataset originally provided as a TSV file containing all fields. **Excluded Fields** The rest of the metadata apart from the fields mentioned above were excluded from our SWIM-IR dataset. We do not use any training data provided from the TyDI-QA dataset. **MIRACL** **Included Fields** The Language Wikipedia title, text, and `_id` fields were taken from the MIRACL dataset, originally provided as a JSON-lines file containing all fields. **Excluded Fields** The rest of the metadata apart from the fields mentioned above were excluded from our SWIM-IR dataset. We do not use any training data provided from the MIRACL dataset. #### Data Processing   All data is coming directly from the TyDI-QA and MIRACL datasets without any preprocessing. ### Collection Criteria #### Data Selection   For the Cross-lingual SWIM-IR dataset, we use a stratified sampling technique to select a subset of passages from the English Wikipedia corpus. We use it to generate questions for SWIM-IR. We ensure all languages have relatively an equal amount of training samples, wherever possible. Our Wikipedia corpus contains entities that are sorted alphabetically (A-Z). We then compute inclusion threshold $I_{th}$, which is defined as $I_{th} = D_{sample} / D_{total}$, where $(D_{sample})$ is number of passages required to sample and $(D_{total})$ is the total numbers of passages in corpus. Next, for each passage ($p_i$) in the corpus, we randomly generate an inclusion probability $\hat{p_i} \in [0,1]$. We select the passage ($p_i$) if $p_i \leq I_{th}$. This ensures uniform sampling of passages with Wikipedia entities between all letters (A-Z). For the Monolingual SWIM-IR dataset, the language selection criteria were dependent on the Wikipedia corpora availability for the monolingual task. Hence, we chose to fix on the 18 languages provided in MIRACL. To complete the dataset, we included the same languages for the cross-lingual task. #### Data Inclusion   We include all data available in TyDi-QA English Wikipedia Corpus (maximum of 1M training pairs per language pair), which we use to generate our cross-lingual SWIM-IR dataset. We use the language-specific MIRACL Wikipedia corpora to generate our monolingual queries in SWIM-IR. #### Data Exclusion   We removed data classified as containing sensitive subjects and adult content using the method described in our paper. No additional filters were applied for data exclusion from MIRACL or TyDi-QA. The TyDi-QA English paragraph data has been split with a maximum of up to 100 tokens. However, MIRACL used the “\n\n” delimiter to segment paragraphs from the Wikipedia articles.

提供机构：

nthakur

原始信息汇总

数据集概述

数据集描述

数据集名称

Indic SWIM-IR (Cross-lingual)

数据集版本

v1.0

数据集大小

约6-7 GB

实例数量

28,265,848

字段数量

语言

包括以下18种语言：

as (Assamese)
bho (Bhojpuri)
gom (Konkani)
gu (Gujarati)
hi (Hindi)
kn (Kannada)
mai (Maithili)
ml (Malayalam)
mni (Manipuri)
mr (Marathi)
or (Odia)
pa (Punjabi)
ps (Pashto)
sa (Sanskrit)
ta (Tamil)
ur (Urdu)

数据集类型

多语言检索训练数据集

数据集结构

配置名称及特征

as:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3233005, num_examples: 5899)
- 下载大小: 1803172
- 数据集大小: 3233005
bho:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3132621, num_examples: 5763)
- 下载大小: 1745932
- 数据集大小: 3132621
gom:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3241395, num_examples: 5755)
- 下载大小: 1772947
- 数据集大小: 3241395
gu:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3171432, num_examples: 5870)
- 下载大小: 1786644
- 数据集大小: 3171432
hi:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3140921, num_examples: 5752)
- 下载大小: 1761474
- 数据集大小: 3140921
kn:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3222300, num_examples: 5763)
- 下载大小: 1781977
- 数据集大小: 3222300
mai:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3106563, num_examples: 5768)
- 下载大小: 1732399
- 数据集大小: 3106563
ml:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3386716, num_examples: 5907)
- 下载大小: 1853611
- 数据集大小: 3386716
mni:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 2699051, num_examples: 5604)
- 下载大小: 1430986
- 数据集大小: 2699051
mr:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3301413, num_examples: 5977)
- 下载大小: 1839741
- 数据集大小: 3301413
or:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3124722, num_examples: 5837)
- 下载大小: 1753854
- 数据集大小: 3124722
pa:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3174739, num_examples: 5840)
- 下载大小: 1792406
- 数据集大小: 3174739
ps:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 2813503, num_examples: 5694)
- 下载大小: 1669583
- 数据集大小: 2813503
sa:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3110486, num_examples: 5779)
- 下载大小: 1722194
- 数据集大小: 3110486
ta:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 3334815, num_examples: 5930)
- 下载大小: 1819387
- 数据集大小: 3334815
ur:
- 特征: _id (string), lang (string), code (string), query (string), title (string), text (string)
- 分割: train (num_bytes: 2854099, num_examples: 5816)
- 下载大小: 1715776
- 数据集大小: 2854099

数据文件路径

as: train, path: as/train-*
bho: train, path: bho/train-*
gom: train, path: gom/train-*
gu: train, path: gu/train-*
hi: train, path: hi/train-*
kn: train, path: kn/train-*
mai: train, path: mai/train-*
ml: train, path: ml/train-*
mni: train, path: mni/train-*
mr: train, path: mr/train-*
or: train, path: or/train-*
pa: train, path: pa/train-*
ps: train, path: ps/train-*
sa: train, path: sa/train-*
ta: train, path: ta/train-*
ur: train, path: ur/train-*

数据集内容

数据点描述

每个数据点包含以下字段：

_id: 唯一ID，表示训练对
lang: 生成问题的语言
code: 语言的ISO代码
query: 使用PaLM 2生成的查询
title: Wikipedia文章的标题
text: Wikipedia文章的段落

数据点示例

English -> Japanese: json { "_id": "1234", "lang": "Japanese", "code": "ja", "query": "The Roki Tunnel は、北オセチア自治共和国と南オセチア共和国の間を通る唯一の道路ですか?", "title": "The Roki Tunnel", "text": "The Roki Tunnel (also called Roksky Tunnel, ; Ossetic: Ручъы тъунел; ) is a mountain tunnel of the Transkam road through the Greater Caucasus Mountains, north of the village Upper Roka. It is the only road joining North Ossetia–Alania in the Russian Federation into South Ossetia, a breakaway republic of Georgia. The road is manned at the town of Nizhny Zaramag in North Ossetia and is sometimes referred to as the Roki-Nizhny Zaramag border crossing. The tunnel, completed by the Soviet government in 1984, is one of only a handful of routes that cross the North Caucasus Range." }
Hindi: json { "_id": "hindi_8987#4", "lang": "Hindi", "code": "hn", "query": "आर्यभर्य ट केरल के कि स स्थान के नि वासी थे ?", "title": "आर्यभर्य ट", "text": "एक ताजा अध्ययन के अनसु ार आर्यभर्य ट, केरल के चाम्रवत्तम (१०उत्तर५१, ७५पर्वू ४र्व ५) के नि वासी थे। अध्ययन के अनसु ार अस्मका एक जनै प्रदेश था जो कि श्रवणबेलगोल के चारों तरफ फैला हुआ था और यहाँके पत्थर के खम्बों के कारण इसका नाम अस्मका पड़ा। चाम्रवत्तम इस जनै बस्ती का हि स्सा था, इसका प्रमाण है भारतापझु ा नदी जि सका नाम जनै ों के पौराणि क राजा भारता के नाम पर रखा गया है। आर्यभर्य ट ने भी यगु ों को परि भाषि त करते वक्त राजा भारता का जि क्र कि या है- दसगीति का के पांचवें छंद में राजा भारत के समय तक बीत चकुे काल का वर्णनर्ण आता है। उन दि नों में कुसमु परुा में एक प्रसि द्ध वि श्ववि द्यालय था जहाँजनै ों का नि र्णा यक प्रभाव था और आर्यभर्य ट का काम इस प्रकार कुसमु परुा पहुँच सका और उसे पसदं भी कि या गया।" }

数据集用途

目的

研究

应用领域

多语言密集检索
合成数据集

5,000+

优质数据集

54 个

任务类型

进入经典数据集