five

nanakonoda/xnli_parallel

收藏
Hugging Face2023-04-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nanakonoda/xnli_parallel
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - en - de - fr language_creators: - found license: [] multilinguality: - multilingual pretty_name: XNLI Parallel Corpus size_categories: - 100K<n<1M source_datasets: - extended|xnli tags: - mode classification - aligned task_categories: - text-classification task_ids: [] dataset_info: - config_name: en features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 92288 num_examples: 830 - name: test num_bytes: 186853 num_examples: 1669 - config_name: de features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 105681 num_examples: 830 - name: test num_bytes: 214008 num_examples: 1669 - config_name: fr features: - name: text dtype: string - name: label dtype: class_label: names: '0': spoken '1': written splits: - name: train num_bytes: 830 num_examples: 109164 - name: test num_bytes: 221286 num_examples: 1669 download_size: 1864 dataset_size: 1840 --- # Dataset Card for XNLI Parallel Corpus ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary ### Supported Tasks and Leaderboards Binary mode classification (spoken vs written) ### Languages - English - German - French ## Dataset Structure ### Data Instances { 'text': "And he said , Mama , I 'm home .", 'label': 0 } ### Data Fields - text: sentence - label: binary label of text (0: spoken 1: written) ### Data Splits - train: 830 - test: 1669 ### Other Statistics #### Vocabulary Size - English - train: 4363 - test: 7128 - German - train: 5070 - test: 8601 - French - train: 4881 - test: 7935 #### Average Sentence Length - English - train: 20.689156626506023 - test: 20.75254643499101 - German - train: 20.367469879518072 - test: 20.639904134212102 - French - train: 23.455421686746988 - test: 23.731575793888556 #### Label Split - train: - 0: 166 - 1: 664 - test: - 0: 334 - 1: 1335 #### Out-of-vocabulary words in model - English - BERT (bert-base-uncased) - train: 800 - test: 1638 - mBERT (bert-base-multilingual-uncased) - train: 1347 - test: 2693 - German BERT (bert-base-german-dbmdz-uncased) - train: 3228 - test: 5581 - flauBERT (flaubert-base-uncased) - train: 4363 - test: 7128 - German - BERT (bert-base-uncased) - train: 4285 - test: 7387 - mBERT (bert-base-multilingual-uncased) - train: 3126 - test: 5863 - German BERT (bert-base-german-dbmdz-uncased) - train: 2033 - test: 3938 - flauBERT (flaubert-base-uncased) - train: 5069 - test: 8600 - French - BERT (bert-base-uncased) - train: 3784 - test: 6289 - mBERT (bert-base-multilingual-uncased) - train: 2847 - test: 5084 - German BERT (bert-base-german-dbmdz-uncased) - train: 4212 - test: 6964 - flauBERT (flaubert-base-uncased) - train: 4881 - test: 7935 ## Dataset Creation ### Curation Rationale N/A ### Source Data https://github.com/facebookresearch/XNLI Here is the citation for the original XNLI paper. ``` @InProceedings{conneau2018xnli, author = "Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin", title = "XNLI: Evaluating Cross-lingual Sentence Representations", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", year = "2018", publisher = "Association for Computational Linguistics", location = "Brussels, Belgium", } ``` #### Initial Data Collection and Normalization N/A #### Who are the source language producers? N/A ### Annotations #### Annotation process N/A #### Who are the annotators? N/A ### Personal and Sensitive Information N/A ## Considerations for Using the Data ### Social Impact of Dataset N/A ### Discussion of Biases N/A ### Other Known Limitations N/A ## Additional Information ### Dataset Curators N/A ### Licensing Information N/A ### Citation Information ### Contributions N/A
提供机构:
nanakonoda
原始信息汇总

数据集概述

  • 数据集名称: XNLI Parallel Corpus
  • 语言: 英语、德语、法语
  • 多语言性: 多语言
  • 数据集大小: 100K<n<1M
  • 来源数据集: 扩展自XNLI
  • 标签创建者: 专家生成
  • 任务类别: 文本分类
  • 任务: 二元模式分类(口语 vs 书面语)

数据集结构

数据实例

  • 文本字段: 句子
  • 标签字段: 二元标签(0: 口语, 1: 书面语)

数据分割

  • 训练集:
    • 英语: 830个样本
    • 德语: 830个样本
    • 法语: 830个样本
  • 测试集:
    • 英语: 1669个样本
    • 德语: 1669个样本
    • 法语: 1669个样本

其他统计信息

  • 词汇量:
    • 英语: 训练集4363, 测试集7128
    • 德语: 训练集5070, 测试集8601
    • 法语: 训练集4881, 测试集7935
  • 平均句子长度:
    • 英语: 训练集20.689, 测试集20.753
    • 德语: 训练集20.367, 测试集20.640
    • 法语: 训练集23.455, 测试集23.732
  • 标签分割:
    • 训练集: 口语166, 书面语664
    • 测试集: 口语334, 书面语1335

数据集创建

  • 来源数据: 扩展自XNLI,原始数据集由Conneau等人于2018年发布。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作