nanakonoda/xnli_cm
收藏Hugging Face2023-04-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nanakonoda/xnli_cm
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- en
- de
- fr
language_creators:
- found
license: []
multilinguality:
- multilingual
pretty_name: XNLI Code-Mixed Corpus
size_categories:
- 1M<n<10M
source_datasets:
- extended|xnli
tags:
- mode classification
- aligned
- code-mixed
task_categories:
- text-classification
task_ids: []
dataset_info:
- config_name: de_ec
features:
- name: text
dtype: string
- name: label
dtype: int64
# class_label:
# names:
# '0': spoken
# '1': written
splits:
- name: train
num_bytes: 576
num_examples: 2490
- name: test
num_bytes: 194139776
num_examples: 1610549
- config_name: de_ml
features:
- name: text
dtype: string
- name: label
dtype: int64
# class_label:
# names:
# '0': spoken
# '1': written
splits:
- name: train
num_bytes: 576
num_examples: 2490
- name: test
num_bytes: 87040
num_examples: 332326
- config_name: fr_ec
features:
- name: text
dtype: string
- name: label
dtype: int64
# class_label:
# names:
# '0': spoken
# '1': written
splits:
- name: train
num_bytes: 576
num_examples: 2490
- name: test
num_bytes: 564416
num_examples: 2562631
- config_name: fr_ml
features:
- name: text
dtype: string
- name: label
dtype: int64
# class_label:
# names:
# '0': spoken
# '1': written
splits:
- name: train
num_bytes: 576
num_examples: 2490
- name: test
num_bytes: 361472
num_examples: 1259159
download_size: 1376728
dataset_size: 1376704
---
# Dataset Card for XNLI Code-Mixed Corpus
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
### Supported Tasks and Leaderboards
Binary mode classification (spoken vs written)
### Languages
- English
- German
- French
- German-English code-mixed by Equivalence Constraint Theory
- German-English code-mixed by Matrix Language Theory
- French-English code-mixed by Equivalence Constraint Theory
- German-English code-mixed by Matrix Language Theory
## Dataset Structure
### Data Instances
{
'text': "And he said , Mama , I 'm home",
'label': 0
}
### Data Fields
- text: sentence
- label: binary label of text (0: spoken 1: written)
### Data Splits
- de-ec
- train (English, German, French monolingual):
- test (German-English code-mixed by Equivalence Constraint Theory):
- de-ml:
- train (English, German, French monolingual):
- test (German-English code-mixed by Matrix Language Theory):
- fr-ec
- train (English, German, French monolingual):
- test (French-English code-mixed by Equivalence Constraint Theory):
- fr-ml:
- train (English, German, French monolingual):
- test (French-English code-mixed by Matrix Language Theory):
### Other Statistics
#### Average Sentence Length
- German
- train:
- test:
- French
- train:
- test:
#### Label Split
- train:
- 0:
- 1:
- test:
- 0:
- 1:
## Dataset Creation
### Curation Rationale
Using the XNLI Parallel Corpus, we generated a code-mixed corpus using CodeMixed Text Generator.
The XNLI Parallel Corpus is available here:
https://huggingface.co/datasets/nanakonoda/xnli_parallel
It was created from the XNLI corpus.
More information is available in the datacard for the XNLI Parallel Corpus.
Here is the link and citation for the original CodeMixed Text Generator paper.
https://github.com/microsoft/CodeMixed-Text-Generator
```
@inproceedings{rizvi-etal-2021-gcm,
title = "{GCM}: A Toolkit for Generating Synthetic Code-mixed Text",
author = "Rizvi, Mohd Sanad Zaki and
Srinivasan, Anirudh and
Ganu, Tanuja and
Choudhury, Monojit and
Sitaram, Sunayana",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eacl-demos.24",
pages = "205--211",
abstract = "Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data. We describe a tool that can automatically generate code-mixed data given parallel data in two languages. We implement two linguistic theories of code-mixing, the Equivalence Constraint theory and the Matrix Language theory to generate all possible code-mixed sentences in the language-pair, followed by sampling of the generated data to generate natural code-mixed sentences. The toolkit provides three modes: a batch mode, an interactive library mode and a web-interface to address the needs of researchers, linguists and language experts. The toolkit can be used to generate unlabeled text data for pre-trained models, as well as visualize linguistic theories of code-mixing. We plan to release the toolkit as open source and extend it by adding more implementations of linguistic theories, visualization techniques and better sampling techniques. We expect that the release of this toolkit will help facilitate more research in code-mixing in diverse language pairs.",
}
```
### Source Data
XNLI Parallel Corpus
https://huggingface.co/datasets/nanakonoda/xnli_parallel
#### Original Source Data
XNLI Parallel Corpus was created using the XNLI Corpus.
https://github.com/facebookresearch/XNLI
Here is the citation for the original XNLI paper.
```
@InProceedings{conneau2018xnli,
author = "Conneau, Alexis
and Rinott, Ruty
and Lample, Guillaume
and Williams, Adina
and Bowman, Samuel R.
and Schwenk, Holger
and Stoyanov, Veselin",
title = "XNLI: Evaluating Cross-lingual Sentence Representations",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing",
year = "2018",
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium",
}
```
#### Initial Data Collection and Normalization
We removed all punctuation from the XNLI Parallel Corpus except apostrophes.
#### Who are the source language producers?
N/A
### Annotations
#### Annotation process
N/A
#### Who are the annotators?
N/A
### Personal and Sensitive Information
N/A
## Considerations for Using the Data
### Social Impact of Dataset
N/A
### Discussion of Biases
N/A
### Other Known Limitations
N/A
## Additional Information
### Dataset Curators
N/A
### Licensing Information
N/A
### Citation Information
### Contributions
N/A
提供机构:
nanakonoda
原始信息汇总
XNLI Code-Mixed Corpus 数据集概述
数据集描述
数据集摘要
XNLI Code-Mixed Corpus 是一个多语言数据集,包含英语、德语和法语的代码混合文本。该数据集用于二元模式分类(口语 vs 书面语)。
支持的任务和排行榜
- 二元模式分类(口语 vs 书面语)
语言
- 英语
- 德语
- 法语
- 德语-英语代码混合(基于等价约束理论)
- 德语-英语代码混合(基于矩阵语言理论)
- 法语-英语代码混合(基于等价约束理论)
- 德语-英语代码混合(基于矩阵语言理论)
数据集结构
数据实例
json { text: "And he said , Mama , I m home", label: 0 }
数据字段
text: 句子label: 文本的二元标签(0: 口语 1: 书面语)
数据分割
de-ec- 训练集(英语、德语、法语单语)
- 测试集(德语-英语代码混合,基于等价约束理论)
de-ml- 训练集(英语、德语、法语单语)
- 测试集(德语-英语代码混合,基于矩阵语言理论)
fr-ec- 训练集(英语、德语、法语单语)
- 测试集(法语-英语代码混合,基于等价约束理论)
fr-ml- 训练集(英语、德语、法语单语)
- 测试集(法语-英语代码混合,基于矩阵语言理论)
其他统计信息
-
平均句子长度
- 德语
- 训练集
- 测试集
- 法语
- 训练集
- 测试集
- 德语
-
标签分布
- 训练集
- 0: 口语
- 1: 书面语
- 测试集
- 0: 口语
- 1: 书面语
- 训练集
数据集创建
创建理由
使用 XNLI 并行语料库,通过 CodeMixed Text Generator 生成了代码混合语料库。
源数据
- XNLI 并行语料库
- 原始来源:XNLI 语料库
初始数据收集和规范化
从 XNLI 并行语料库中移除了所有标点符号,除了撇号。
其他信息
数据集创建者
N/A
许可信息
N/A
引用信息
N/A
贡献
N/A
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



