nanakonoda/xnli_parallel
收藏Hugging Face2023-04-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nanakonoda/xnli_parallel
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- en
- de
- fr
language_creators:
- found
license: []
multilinguality:
- multilingual
pretty_name: XNLI Parallel Corpus
size_categories:
- 100K<n<1M
source_datasets:
- extended|xnli
tags:
- mode classification
- aligned
task_categories:
- text-classification
task_ids: []
dataset_info:
- config_name: en
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': spoken
'1': written
splits:
- name: train
num_bytes: 92288
num_examples: 830
- name: test
num_bytes: 186853
num_examples: 1669
- config_name: de
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': spoken
'1': written
splits:
- name: train
num_bytes: 105681
num_examples: 830
- name: test
num_bytes: 214008
num_examples: 1669
- config_name: fr
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': spoken
'1': written
splits:
- name: train
num_bytes: 830
num_examples: 109164
- name: test
num_bytes: 221286
num_examples: 1669
download_size: 1864
dataset_size: 1840
---
# Dataset Card for XNLI Parallel Corpus
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
### Supported Tasks and Leaderboards
Binary mode classification (spoken vs written)
### Languages
- English
- German
- French
## Dataset Structure
### Data Instances
{
'text': "And he said , Mama , I 'm home .",
'label': 0
}
### Data Fields
- text: sentence
- label: binary label of text (0: spoken 1: written)
### Data Splits
- train: 830
- test: 1669
### Other Statistics
#### Vocabulary Size
- English
- train: 4363
- test: 7128
- German
- train: 5070
- test: 8601
- French
- train: 4881
- test: 7935
#### Average Sentence Length
- English
- train: 20.689156626506023
- test: 20.75254643499101
- German
- train: 20.367469879518072
- test: 20.639904134212102
- French
- train: 23.455421686746988
- test: 23.731575793888556
#### Label Split
- train:
- 0: 166
- 1: 664
- test:
- 0: 334
- 1: 1335
#### Out-of-vocabulary words in model
- English
- BERT (bert-base-uncased)
- train: 800
- test: 1638
- mBERT (bert-base-multilingual-uncased)
- train: 1347
- test: 2693
- German BERT (bert-base-german-dbmdz-uncased)
- train: 3228
- test: 5581
- flauBERT (flaubert-base-uncased)
- train: 4363
- test: 7128
- German
- BERT (bert-base-uncased)
- train: 4285
- test: 7387
- mBERT (bert-base-multilingual-uncased)
- train: 3126
- test: 5863
- German BERT (bert-base-german-dbmdz-uncased)
- train: 2033
- test: 3938
- flauBERT (flaubert-base-uncased)
- train: 5069
- test: 8600
- French
- BERT (bert-base-uncased)
- train: 3784
- test: 6289
- mBERT (bert-base-multilingual-uncased)
- train: 2847
- test: 5084
- German BERT (bert-base-german-dbmdz-uncased)
- train: 4212
- test: 6964
- flauBERT (flaubert-base-uncased)
- train: 4881
- test: 7935
## Dataset Creation
### Curation Rationale
N/A
### Source Data
https://github.com/facebookresearch/XNLI
Here is the citation for the original XNLI paper.
```
@InProceedings{conneau2018xnli,
author = "Conneau, Alexis
and Rinott, Ruty
and Lample, Guillaume
and Williams, Adina
and Bowman, Samuel R.
and Schwenk, Holger
and Stoyanov, Veselin",
title = "XNLI: Evaluating Cross-lingual Sentence Representations",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing",
year = "2018",
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium",
}
```
#### Initial Data Collection and Normalization
N/A
#### Who are the source language producers?
N/A
### Annotations
#### Annotation process
N/A
#### Who are the annotators?
N/A
### Personal and Sensitive Information
N/A
## Considerations for Using the Data
### Social Impact of Dataset
N/A
### Discussion of Biases
N/A
### Other Known Limitations
N/A
## Additional Information
### Dataset Curators
N/A
### Licensing Information
N/A
### Citation Information
### Contributions
N/A
提供机构:
nanakonoda
原始信息汇总
数据集概述
- 数据集名称: XNLI Parallel Corpus
- 语言: 英语、德语、法语
- 多语言性: 多语言
- 数据集大小: 100K<n<1M
- 来源数据集: 扩展自XNLI
- 标签创建者: 专家生成
- 任务类别: 文本分类
- 任务: 二元模式分类(口语 vs 书面语)
数据集结构
数据实例
- 文本字段: 句子
- 标签字段: 二元标签(0: 口语, 1: 书面语)
数据分割
- 训练集:
- 英语: 830个样本
- 德语: 830个样本
- 法语: 830个样本
- 测试集:
- 英语: 1669个样本
- 德语: 1669个样本
- 法语: 1669个样本
其他统计信息
- 词汇量:
- 英语: 训练集4363, 测试集7128
- 德语: 训练集5070, 测试集8601
- 法语: 训练集4881, 测试集7935
- 平均句子长度:
- 英语: 训练集20.689, 测试集20.753
- 德语: 训练集20.367, 测试集20.640
- 法语: 训练集23.455, 测试集23.732
- 标签分割:
- 训练集: 口语166, 书面语664
- 测试集: 口语334, 书面语1335
数据集创建
- 来源数据: 扩展自XNLI,原始数据集由Conneau等人于2018年发布。



