djstrong/ppc

Name: djstrong/ppc
Creator: djstrong
Published: 2024-01-18 22:07:42
License: 暂无描述

Hugging Face2024-01-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/djstrong/ppc

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - pl license: - cc-by-nc-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K task_categories: - text-classification task_ids: - semantic-similarity-classification pretty_name: Polish Paraphrase Corpus dataset_info: features: - name: sentence_A dtype: string - name: sentence_B dtype: string - name: label dtype: class_label: names: 0: not used 1: exact paraphrases 2: similar sentences 3: non-paraphrases splits: - name: train num_bytes: 539121 num_examples: 5000 - name: validation num_bytes: 107010 num_examples: 1000 - name: test num_bytes: 106515 num_examples: 1000 --- # PPC - Polish Paraphrase Corpus ### Dataset Summary Polish Paraphrase Corpus contains 7000 manually labeled sentence pairs. The dataset was divided into training, validation and test splits. The training part includes 5000 examples, while the other parts contain 1000 examples each. The main purpose of creating such a dataset was to verify how machine learning models perform in the challenging problem of paraphrase identification, where most records contain semantically overlapping parts. Technically, this is a three-class classification task, where each record can be assigned to one of the following categories: - Exact paraphrases - Sentence pairs that convey exactly the same information. We are interested only in the semantic meaning of the sentence, therefore this category also includes sentences that are semantically identical but, for example, have different emotional emphasis. - Close paraphrases - Sentence pairs with similar semantic meaning. In this category we include all pairs which contain the same information, but in addition to it there may be other semantically non-overlapping parts. This category also contains context-dependent paraphrases - sentence pairs that may have the same meaning in some contexts but are different in others. - Non-paraphrases - All other cases, including contradictory sentences and semantically unrelated sentences. The corpus contains 2911, 1297, and 2792 examples for the above three categories, respectively. The process of annotating the dataset was preceded by an automated generation of candidate pairs, which were then manually labeled. We experimented with two popular techniques of generating possible paraphrases: backtranslation with a set of neural machine translation models and paraphrase mining using a pre-trained multilingual sentence encoder. The extracted sentence pairs are drawn from different data sources: Taboeba, Polish news articles, Wikipedia and Polish version of SICK dataset. Since most of the sentence pairs obtained in this way fell into the first two categories, in order to balance the dataset, some of the examples were manually modified to convey different information. In this way, even negative examples often have high semantic overlap, making this problem difficult for machine learning models. ### Data Instances Example instance: ``` { "sentence_A": "Libia: lotnisko w w Trypolisie ostrzelane rakietami.", "sentence_B": "Jedyne lotnisko w stolicy Libii - Trypolisie zostało w nocy z wtorku na środę ostrzelane rakietami.", "label": "2" } ``` ### Data Fields - sentence_A: first sentence text - sentence_B: second sentence text - label: label identifier corresponding to one of three categories ### Citation Information ``` @inproceedings{9945218, author={Dadas, S{\l}awomir}, booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases}, year={2022}, volume={}, number={}, pages={371-378}, doi={10.1109/SMC53654.2022.9945218} } ```

提供机构：

djstrong

原始信息汇总

PPC - Polish Paraphrase Corpus

数据集概述

Polish Paraphrase Corpus 包含 7000 个手动标注的句子对，分为训练、验证和测试集。训练集包含 5000 个样本，验证和测试集各包含 1000 个样本。该数据集旨在验证机器学习模型在识别同义句（paraphrase identification）这一具有挑战性的任务中的表现，其中大多数记录包含语义重叠的部分。这是一个三分类任务，每个记录可以被分配到以下类别之一：

Exact paraphrases - 传达完全相同信息的句子对。
Close paraphrases - 语义相似的句子对。
Non-paraphrases - 所有其他情况，包括矛盾句子和语义无关的句子。

该语料库包含 2911、1297 和 2792 个样本分别对应上述三个类别。数据集的标注过程之前进行了候选对的自动生成，然后进行手动标注。使用了两种流行的技术生成可能的同义句：使用神经机器翻译模型的回译和使用预训练的多语言句子编码器的同义句挖掘。提取的句子对来自不同的数据源：Taboeba、波兰新闻文章、维基百科和波兰版本的 SICK 数据集。

数据实例

示例实例： json { "sentence_A": "Libia: lotnisko w w Trypolisie ostrzelane rakietami.", "sentence_B": "Jedyne lotnisko w stolicy Libii - Trypolisie zostało w nocy z wtorku na środę ostrzelane rakietami.", "label": "2" }

数据字段

sentence_A: 第一个句子文本
sentence_B: 第二个句子文本
label: 对应三个类别之一的标签标识符

引用信息

plaintext @inproceedings{9945218, author={Dadas, S{l}awomir}, booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases}, year={2022}, volume={}, number={}, pages={371-378}, doi={10.1109/SMC53654.2022.9945218} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集