deepvk/ru-WANLI

Name: deepvk/ru-WANLI
Creator: deepvk
Published: 2024-06-10 11:59:27
License: 暂无描述

Hugging Face2024-06-10 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/deepvk/ru-WANLI

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - feature-extraction language: - ru size_categories: - 100K<n<1M --- # RuWANLI RuWaNLI (Russian-Worker-AI Collaboration for NLI) is a natural language inference dataset inspired by [Liu et al. (2022)](https://arxiv.org/pdf/2201.05955). We replicated the WaNLI generation pipeline, but for Russian with some changes in labeling process. See [Dataset Structure](#dataset-structure) for details about the dataset itself and [Dataset Creation](#dataset-creation) for details about the collection process. ## Supported Tasks and Leaderboards > The dataset can be used to train natural language inference models which determine whether a premise entails (i.e., implies the truth of) a hypothesis, both expressed in natural language. Success on this task is typically measured by achieving a high accuracy. As we add an additional step, the dataset can be used for sentence encoder training, using contradictions as hard negatives. ## Dataset Structure Each data instance has the following fields: - `premise`: a piece of text. - `hypothesis`: a piece of text that may be true, false, or whose truth conditions may not be known when compared to the premise. - `label`: either `entailment`, `contradiction`, or `neutral`. For example: ```json { "premise": "Мальчик бежит в детскую игровую комнату с разноцветными шарами.", "hypothesis": "Мальчик идет спать в свою кровать.", "label": "contradiction" } ``` The dataset is split into train/val/test with a ratio of 100000/2360/5000. The distribution over classes is follow: <img src="images/pie.jpg" width="50%"> ## Dataset Creation At first, we combine the translated ANLI, SNLI and MNLI datasets into ALLNLI. Following [Liu et al. (2022)](https://arxiv.org/pdf/2201.05955), we use dataset cartography to identify challenging data samples. We train a classification model based on the backbone of [deepvk/roberta-base](https://huggingface.co/deepvk/roberta-base) for creating data maps. The resulting data map is shown below. Compared to the original paper, we have acquired less a structured figure with less ambiguous examples. <img src="images/confidence-variability.jpg"> Then, we leverage ChatGPT (`gpt-3.5-turbo`) to generate new examples that are likely to have the same pattern. Afterward, we validate the generated examples using human review, where crowd workers assign a label or revise for quality. As we want to use RuWANLI for sentence encoder training, we add an extra step to generate missing entailments and contradictions for text/entailment/contradiction triplets using ChatGPT (`gpt-4-turbo`). ### Prompts #### Prompt for initial generation: ```json <instruction> Примеры: <example 1 first sentence> <label>: <example 1 second sentence> <example 2 first sentence> <label>: <example 2 second sentence> <example 3 first sentence> <label>: <example 3 second sentence> <example 4 first sentence> <label>: <example 4 second sentence> <example 5 first sentence> <label>: <example 5 second sentence> ``` Possible values for instructions: - **contradiction**: Написать 5 пар предложений, которые противоречат друг другу, как и предыдущие примеры. - **entailment**: Написать 5 пар предложений, как и предыдущие примеры. Второе предложение должно логически следовать из первого. - **neutral**: Написать 5 пар предложений, которые имеют такую же взаимосвязь, как и предыдущие примеры #### Prompt for contradiction generation: ``` Я хочу, чтобы ты действовал в качестве генератора данных для NLI датасета. Я буду передавать тебе предложение, которое я назову Q. Для Q ты должен будешь сгенерировать C: противоречие (contradicton). C не должно просто быть отрицанием Q, используй более сложные связи. Все сгенерированные тексты должны быть на русском языке. В качестве ответа верни json с такой структурой: {"query": <текст Q>, "contradiction": <текст C>}. Примеры пар (Q, C): 1. { "query": "Области, обслуживаемые дорогами, были застроены и, как правило, переполнены в разгар лета.", "contradiction": "Застроенные районы наиболее переполнены в мягкие зимние месяцы." }; 2. { "query": "За человеком, привязанным к веревкам, наблюдает толпа.", "contradiction": "Человек не привязан к верёвкам." }; 3. { "query": "Мужчина стоит, читает газету и курит сигару.", "contradiction": "Мужчина сидит на скамейке."}; Текст Q: ``` #### Prompt for entailment generation: ``` Я хочу, чтобы ты действовал в качестве генератора данных для NLI датасета. Я буду передавать тебе текст, который я назову Q. Для Q ты должен будешь сгенерировать E: логическое следствие (entailment). E не должно просто быть перефразированным текстом Q, используй более сложные связи. Все сгенерированные тексты должны быть на русском языке. В качестве ответа верни json с такой структурой: {"query": <текст Q>, "entailment": <текст E>}. Примеры пар (Q, E): 1. { "query": "Я плачу сто двадцать один доллар в месяц у меня есть еще один год, чтобы заплатить за мой дом.", "entailment": "Я плачу чуть больше 120 долларов в месяц." }; 2. { "query": "Прогресс Японии к парламентской демократии был остановлен в 1930-х годах растущим национализмом, навязанным правительству генералами и адмиралами.", "entailment": "Рост национализма остановил продвижение Японии к парламентской демократии." }; 3. { "query": "Если мы решим остаться, многие люди умрут, но мы надеемся, что сможем укусить бандитов на их пути.", "entailment": "Мы не можем остаться, потому что в противном случае погибнет много мирных жителей." }. Текст Q: ``` ### Annotation Process A total of 119 people participated in the annotation process. Each of the 74,258 texts received a rating between 3 and 5 annotators. In addition to one of the three classes, annotators could label the data as a “bad example”. This label serves as a signal for potentially excluding this data from the final version of the dataset. The exclusion occurs if the majority of annotators vote for the poor quality of a specific example. Despite the model's goal to generate text according to specific classes, additional verification by annotators revealed that only about half of the examples matched the ratings between the model and humans. A major challenge in data preparation was the absence of trusted annotators (“gold labels”), which complicated the filtering of results. To address this, a methodology was developed based on annotators' agreement with the majority. It was determined that removing 16 annotators was optimal, as further removal led to significant data loss. At this point, the Fleiss' kappa index was 0.65 (0.56 without data removal), and the data loss was -5.27%. <img src="images/kappa.jpg"> ### Limitations For more than half of the metrics, the quality on the test sample increased, but we were unable to achieve the same level of generalization as in the original WANLI study. The issue may be that we are creating synthetic data based on translations rather than original texts. In the table below, values that increased with the substitution of part of the original dataset with RuWANLI are highlighted green. Values not considered in the average calculation are highlighted gray. <img src="images/results.jpg"/> ### Personal and Sensitive Information The dataset does not contain any personal information about the authors or the crowd workers. ## Citations ``` @misc{deepvk2024ru_wanli, title={RuWANLI}, author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor}, url={https://huggingface.co/datasets/deepvk/ru-WANLI}, publisher={Hugging Face} year={2024}, } ```

提供机构：

deepvk

原始信息汇总

RuWANLI

RuWANLI (Russian-Worker-AI Collaboration for NLI) 是一个自然语言推理数据集，灵感来源于 Liu et al. (2022)。

支持的任务和排行榜

该数据集可用于训练自然语言推理模型，这些模型确定一个前提是否蕴含（即暗示了）一个假设，两者均以自然语言表达。该任务的成功通常通过高准确率来衡量。此外，该数据集还可用于句子编码器训练，使用矛盾作为硬负例。

数据集结构

每个数据实例包含以下字段：

premise: 一段文本。
hypothesis: 一段文本，可能为真、假，或其真实条件在比较前提时未知。
label: 可以是 entailment、contradiction 或 neutral。

例如： json { "premise": "Мальчик бежит в детскую игровую комнату с разноцветными шарами.", "hypothesis": "Мальчик идет спать в свою кровать.", "label": "contradiction" }

数据集分为训练/验证/测试集，比例为 100000/2360/5000。

数据集创建

首先，我们将翻译的 ANLI、SNLI 和 MNLI 数据集合并为 ALLNLI。根据 Liu et al. (2022) 的方法，我们使用数据集制图法识别具有挑战性的数据样本。我们基于 deepvk/roberta-base 的骨干训练了一个分类模型来创建数据图。然后，我们利用 ChatGPT (gpt-3.5-turbo) 生成具有相同模式的新示例，并通过人工审核验证生成的示例。

为了使用 RuWANLI 进行句子编码器训练，我们增加了额外的步骤，使用 ChatGPT (gpt-4-turbo) 生成缺失的蕴含和矛盾。

提示

初始生成提示：

json <instruction> Примеры: <example 1 first sentence> <label>: <example 1 second sentence> <example 2 first sentence> <label>: <example 2 second sentence> <example 3 first sentence> <label>: <example 3 second sentence> <example 4 first sentence> <label>: <example 4 second sentence> <example 5 first sentence> <label>: <example 5 second sentence>

指令的可能值：

contradiction: Написать 5 пар предложений, которые противоречат друг другу, как и предыдущие примеры.
entailment: Написать 5 пар предложений, как и предыдущие примеры. Второе предложение должно логически следовать из первого.
neutral: Написать 5 пар предложений, которые имеют такую же взаимосвязь, как и предыдущие примеры

矛盾生成提示：

Я хочу, чтобы ты действовал в качестве генератора данных для NLI датасета. Я буду передавать тебе предложение, которое я назову Q. Для Q ты должен будешь сгенерировать C: противоречие (contradicton). C не должно просто быть отрицанием Q, используй более сложные связи. Все сгенерированные тексты должны быть на русском языке. В качестве ответа верни json с такой структурой: {"query": <текст Q>, "contradiction": <текст C>}.

蕴含生成提示：

Я хочу, чтобы ты действовал в качестве генератора данных для NLI датасета. Я буду передавать тебе текст, который я назову Q. Для Q ты должен будешь сгенерировать E: логическое следствие (entailment). E не должно просто быть перефразированным текстом Q, используй более сложные связи. Все сгенерированные тексты должны быть на русском языке. В качестве ответа верни json с такой структурой: {"query": <текст Q>, "entailment": <текст E>}.

标注过程

共有 119 人参与了标注过程。每个文本由 3 到 5 名标注者进行评分。除了三个类别之一，标注者还可以将数据标记为“坏示例”。如果多数标注者投票认为某个示例质量差，则该示例将被排除在最终版本的数据集之外。

尽管模型旨在根据特定类别生成文本，但通过标注者的额外验证发现，只有约一半的示例在模型和人类之间匹配。数据准备中的一个主要挑战是缺乏可信的标注者（“黄金标签”），这使得结果筛选变得复杂。为了解决这个问题，开发了一种基于标注者与多数人一致性的方法。结果表明，移除 16 名标注者是最优的，因为进一步移除会导致显著的数据损失。此时，Fleiss kappa 指数为 0.65（移除数据前为 0.56），数据损失为 -5.27%。

限制

对于超过一半的指标，测试样本的质量有所提高，但我们无法达到原始 WANLI 研究中的泛化水平。问题可能是我们基于翻译而非原始文本创建合成数据。在下面的表格中，通过用 RuWANLI 替换部分原始数据集而增加的值以绿色突出显示。未计入平均计算的值以灰色突出显示。

个人和敏感信息

该数据集不包含任何关于作者或众包工作者的个人信息。

引用

@misc{deepvk2024ru_wanli, title={RuWANLI}, author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor}, url={https://huggingface.co/datasets/deepvk/ru-WANLI}, publisher={Hugging Face} year={2024}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集