nanakonoda/xnli_parallel

Name: nanakonoda/xnli_parallel
Creator: nanakonoda
Published: 2023-04-18 13:23:10
License: 暂无描述

Hugging Face2023-04-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nanakonoda/xnli_parallel

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - en - de - fr language_creators: - found license: [] multilinguality: - multilingual pretty_name: XNLI Parallel Corpus size_categories: - 100K<n<1M source_datasets: - extended|xnli tags: - mode classification - aligned task_categories: - text-classification task_ids: [] dataset_info: - config_name: en features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 92288 num_examples: 830 - name: test num_bytes: 186853 num_examples: 1669 - config_name: de features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 105681 num_examples: 830 - name: test num_bytes: 214008 num_examples: 1669 - config_name: fr features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 830 num_examples: 109164 - name: test num_bytes: 221286 num_examples: 1669 download_size: 1864 dataset_size: 1840 --- # Dataset Card for XNLI Parallel Corpus ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary ### Supported Tasks and Leaderboards Binary mode classification (spoken vs written) ### Languages - English - German - French ## Dataset Structure ### Data Instances { 'text': "And he said , Mama , I 'm home .", 'label': 0 } ### Data Fields - text: sentence - label: binary label of text (0: spoken 1: written) ### Data Splits - train: 830 - test: 1669 ### Other Statistics #### Vocabulary Size - English - train: 4363 - test: 7128 - German - train: 5070 - test: 8601 - French - train: 4881 - test: 7935 #### Average Sentence Length - English - train: 20.689156626506023 - test: 20.75254643499101 - German - train: 20.367469879518072 - test: 20.639904134212102 - French - train: 23.455421686746988 - test: 23.731575793888556 #### Label Split - train: - 0: 166 - 1: 664 - test: - 0: 334 - 1: 1335 #### Out-of-vocabulary words in model - English - BERT (bert-base-uncased) - train: 800 - test: 1638 - mBERT (bert-base-multilingual-uncased) - train: 1347 - test: 2693 - German BERT (bert-base-german-dbmdz-uncased) - train: 3228 - test: 5581 - flauBERT (flaubert-base-uncased) - train: 4363 - test: 7128 - German - BERT (bert-base-uncased) - train: 4285 - test: 7387 - mBERT (bert-base-multilingual-uncased) - train: 3126 - test: 5863 - German BERT (bert-base-german-dbmdz-uncased) - train: 2033 - test: 3938 - flauBERT (flaubert-base-uncased) - train: 5069 - test: 8600 - French - BERT (bert-base-uncased) - train: 3784 - test: 6289 - mBERT (bert-base-multilingual-uncased) - train: 2847 - test: 5084 - German BERT (bert-base-german-dbmdz-uncased) - train: 4212 - test: 6964 - flauBERT (flaubert-base-uncased) - train: 4881 - test: 7935 ## Dataset Creation ### Curation Rationale N/A ### Source Data https://github.com/facebookresearch/XNLI Here is the citation for the original XNLI paper. ``` @InProceedings{conneau2018xnli, author = "Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin", title = "XNLI: Evaluating Cross-lingual Sentence Representations", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", year = "2018", publisher = "Association for Computational Linguistics", location = "Brussels, Belgium", } ``` #### Initial Data Collection and Normalization N/A #### Who are the source language producers? N/A ### Annotations #### Annotation process N/A #### Who are the annotators? N/A ### Personal and Sensitive Information N/A ## Considerations for Using the Data ### Social Impact of Dataset N/A ### Discussion of Biases N/A ### Other Known Limitations N/A ## Additional Information ### Dataset Curators N/A ### Licensing Information N/A ### Citation Information ### Contributions N/A

提供机构：

nanakonoda

原始信息汇总

数据集概述

数据集名称: XNLI Parallel Corpus
语言: 英语、德语、法语
多语言性: 多语言
数据集大小: 100K<n<1M
来源数据集: 扩展自XNLI
标签创建者: 专家生成
任务类别: 文本分类
任务: 二元模式分类（口语 vs 书面语）

数据集结构

数据实例

文本字段: 句子
标签字段: 二元标签（0: 口语, 1: 书面语）

数据分割

训练集:
- 英语: 830个样本
- 德语: 830个样本
- 法语: 830个样本
测试集:
- 英语: 1669个样本
- 德语: 1669个样本
- 法语: 1669个样本

其他统计信息

词汇量:
- 英语: 训练集4363, 测试集7128
- 德语: 训练集5070, 测试集8601
- 法语: 训练集4881, 测试集7935
平均句子长度:
- 英语: 训练集20.689, 测试集20.753
- 德语: 训练集20.367, 测试集20.640
- 法语: 训练集23.455, 测试集23.732
标签分割:
- 训练集: 口语166, 书面语664
- 测试集: 口语334, 书面语1335

数据集创建

来源数据: 扩展自XNLI，原始数据集由Conneau等人于2018年发布。

5,000+

优质数据集

54 个

任务类型

进入经典数据集