nanakonoda/xnli_cm

Name: nanakonoda/xnli_cm
Creator: nanakonoda
Published: 2023-04-18 13:58:12
License: 暂无描述

Hugging Face2023-04-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nanakonoda/xnli_cm

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - en - de - fr language_creators: - found license: [] multilinguality: - multilingual pretty_name: XNLI Code-Mixed Corpus size_categories: - 1M<n<10M source_datasets: - extended|xnli tags: - mode classification - aligned - code-mixed task_categories: - text-classification task_ids: [] dataset_info: - config_name: de_ec features: - name: text dtype: string - name: label dtype: int64 # class_label: # names: # '0': spoken # '1': written splits: - name: train num_bytes: 576 num_examples: 2490 - name: test num_bytes: 194139776 num_examples: 1610549 - config_name: de_ml features: - name: text dtype: string - name: label dtype: int64 # class_label: # names: # '0': spoken # '1': written splits: - name: train num_bytes: 576 num_examples: 2490 - name: test num_bytes: 87040 num_examples: 332326 - config_name: fr_ec features: - name: text dtype: string - name: label dtype: int64 # class_label: # names: # '0': spoken # '1': written splits: - name: train num_bytes: 576 num_examples: 2490 - name: test num_bytes: 564416 num_examples: 2562631 - config_name: fr_ml features: - name: text dtype: string - name: label dtype: int64 # class_label: # names: # '0': spoken # '1': written splits: - name: train num_bytes: 576 num_examples: 2490 - name: test num_bytes: 361472 num_examples: 1259159 download_size: 1376728 dataset_size: 1376704 --- # Dataset Card for XNLI Code-Mixed Corpus ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary ### Supported Tasks and Leaderboards Binary mode classification (spoken vs written) ### Languages - English - German - French - German-English code-mixed by Equivalence Constraint Theory - German-English code-mixed by Matrix Language Theory - French-English code-mixed by Equivalence Constraint Theory - German-English code-mixed by Matrix Language Theory ## Dataset Structure ### Data Instances { 'text': "And he said , Mama , I 'm home", 'label': 0 } ### Data Fields - text: sentence - label: binary label of text (0: spoken 1: written) ### Data Splits - de-ec - train (English, German, French monolingual): - test (German-English code-mixed by Equivalence Constraint Theory): - de-ml: - train (English, German, French monolingual): - test (German-English code-mixed by Matrix Language Theory): - fr-ec - train (English, German, French monolingual): - test (French-English code-mixed by Equivalence Constraint Theory): - fr-ml: - train (English, German, French monolingual): - test (French-English code-mixed by Matrix Language Theory): ### Other Statistics #### Average Sentence Length - German - train: - test: - French - train: - test: #### Label Split - train: - 0: - 1: - test: - 0: - 1: ## Dataset Creation ### Curation Rationale Using the XNLI Parallel Corpus, we generated a code-mixed corpus using CodeMixed Text Generator. The XNLI Parallel Corpus is available here: https://huggingface.co/datasets/nanakonoda/xnli_parallel It was created from the XNLI corpus. More information is available in the datacard for the XNLI Parallel Corpus. Here is the link and citation for the original CodeMixed Text Generator paper. https://github.com/microsoft/CodeMixed-Text-Generator ``` @inproceedings{rizvi-etal-2021-gcm, title = "{GCM}: A Toolkit for Generating Synthetic Code-mixed Text", author = "Rizvi, Mohd Sanad Zaki and Srinivasan, Anirudh and Ganu, Tanuja and Choudhury, Monojit and Sitaram, Sunayana", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.eacl-demos.24", pages = "205--211", abstract = "Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data. We describe a tool that can automatically generate code-mixed data given parallel data in two languages. We implement two linguistic theories of code-mixing, the Equivalence Constraint theory and the Matrix Language theory to generate all possible code-mixed sentences in the language-pair, followed by sampling of the generated data to generate natural code-mixed sentences. The toolkit provides three modes: a batch mode, an interactive library mode and a web-interface to address the needs of researchers, linguists and language experts. The toolkit can be used to generate unlabeled text data for pre-trained models, as well as visualize linguistic theories of code-mixing. We plan to release the toolkit as open source and extend it by adding more implementations of linguistic theories, visualization techniques and better sampling techniques. We expect that the release of this toolkit will help facilitate more research in code-mixing in diverse language pairs.", } ``` ### Source Data XNLI Parallel Corpus https://huggingface.co/datasets/nanakonoda/xnli_parallel #### Original Source Data XNLI Parallel Corpus was created using the XNLI Corpus. https://github.com/facebookresearch/XNLI Here is the citation for the original XNLI paper. ``` @InProceedings{conneau2018xnli, author = "Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin", title = "XNLI: Evaluating Cross-lingual Sentence Representations", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", year = "2018", publisher = "Association for Computational Linguistics", location = "Brussels, Belgium", } ``` #### Initial Data Collection and Normalization We removed all punctuation from the XNLI Parallel Corpus except apostrophes. #### Who are the source language producers? N/A ### Annotations #### Annotation process N/A #### Who are the annotators? N/A ### Personal and Sensitive Information N/A ## Considerations for Using the Data ### Social Impact of Dataset N/A ### Discussion of Biases N/A ### Other Known Limitations N/A ## Additional Information ### Dataset Curators N/A ### Licensing Information N/A ### Citation Information ### Contributions N/A

提供机构：

nanakonoda

原始信息汇总

XNLI Code-Mixed Corpus 数据集概述

数据集描述

数据集摘要

XNLI Code-Mixed Corpus 是一个多语言数据集，包含英语、德语和法语的代码混合文本。该数据集用于二元模式分类（口语 vs 书面语）。

支持的任务和排行榜

二元模式分类（口语 vs 书面语）

语言

英语
德语
法语
德语-英语代码混合（基于等价约束理论）
德语-英语代码混合（基于矩阵语言理论）
法语-英语代码混合（基于等价约束理论）
德语-英语代码混合（基于矩阵语言理论）

数据集结构

数据实例

json { text: "And he said , Mama , I m home", label: 0 }

数据字段

text: 句子
label: 文本的二元标签（0: 口语 1: 书面语）

数据分割

de-ec
- 训练集（英语、德语、法语单语）
- 测试集（德语-英语代码混合，基于等价约束理论）
de-ml
- 训练集（英语、德语、法语单语）
- 测试集（德语-英语代码混合，基于矩阵语言理论）
fr-ec
- 训练集（英语、德语、法语单语）
- 测试集（法语-英语代码混合，基于等价约束理论）
fr-ml
- 训练集（英语、德语、法语单语）
- 测试集（法语-英语代码混合，基于矩阵语言理论）

其他统计信息

平均句子长度
- 德语
  - 训练集
  - 测试集
- 法语
  - 训练集
  - 测试集
标签分布
- 训练集
  - 0: 口语
  - 1: 书面语
- 测试集
  - 0: 口语
  - 1: 书面语

数据集创建

创建理由

使用 XNLI 并行语料库，通过 CodeMixed Text Generator 生成了代码混合语料库。

源数据

XNLI 并行语料库
- 原始来源：XNLI 语料库

初始数据收集和规范化

从 XNLI 并行语料库中移除了所有标点符号，除了撇号。

其他信息

数据集创建者

N/A

许可信息

N/A

引用信息

N/A

贡献

N/A

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集