GeRaCl_synthethic_dataset

Name: GeRaCl_synthethic_dataset
Creator: maas
Published: 2025-12-05 16:44:17
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/deepvk/GeRaCl_synthethic_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# CLAZER (CLAssification in a ZERo-shot scenario) CLAZER is a freely available dataset of texts taken from [`allenai/c4`](https://huggingface.co/datasets/allenai/c4) and labeled with positive and hard negative classes. They were labeled using [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) with the aim to provide high-quality classification samples to enhance understanding of zero-shot classification task by sentence encoders. ## Dataset Structure There are 4 subdatasets: 1. `synthetic_positives`. This subdataset contains: - `train` (93426 samples), `val` (3000 samples) and `test` (3000 samples) - `text`: a segment of a text from allenai/c4 - `classes`: a list of 3-5 positive classes that describe the text 2. `synthetic_classes`. This subdataset contains: - `train` (92953 samples) - `text`: a segment of a text from allenai/c4 - `classes_0` ... `classes_4`: lists of classes where the first class is positive and other classes are hard negatives - `scenarios`: a list of classification scenarios corresponding to `classes_0` ... `classes_4` columns - `val` (2960 samples) and `test` (2961 samples) - `text`: a segment of a text from allenai/c4 - `classes`: a list of classes that contains one positive class and several hard negative classes - `label`: an integer that represents the index of the positive class in the `classes` list - `scenarios`: a string representing the classification scenario 3. `ru_mteb_classes`. This subdataset contains: - `train` (45907 samples), `val` (2936 samples) and `test` (2942 samples) - `text`: a segment of a text from allenai/c4 - `classes`: a list of classes taken from RU-MTEB classification tasks that contains one positive class and several negative classes - `label`: an integer that represents the index of the positive class in the `classes` list 4. `ru_mteb_extended_classes`. This subdataset contains: - `train` (87103 samples), `val` (2800 samples) and `test` (2796 samples) - `text`: a segment of a text from allenai/c4 - `classes`: a list of redacted classes taken from RU-MTEB classification tasks that contains one positive class and several negative classes - `label`: an integer that represents the index of the positive class in the `classes` list Example from `synthetic_classes` validation set: ``` { 'text': '"Стараемся выбрасывать мусор в специальные урны, отделять пластиковые урны от всего остального, бытового, органического. То есть элементарные вещи: экономия электричества, я лично готова полдня со свечой сидеть, чтобы только не строили дополнительные атомные станции, а может даже закрыли", - говорят девушки из группы SMS.', 'classes': [ 'правительственное учреждение', 'группа активистов', 'частное предприятие', ], 'label': 1 'scenarios': 'Сценарий классификации по источнику высказывания' } ``` ## Dataset Creation Each subdataset was built using 100,000 segments of Russian text from [`allenai/c4`](https://huggingface.co/datasets/allenai/c4). TThere are four methods for mining positive and negative classes: - **Positive classes mining**. We use `meta-llama/Llama-3.3-70B-Instruct` to generate 5 relevant classes that describe the given text. After filtering, some samples may contain fewer than 5 classes. The prompt used for generation is located in the `prompts/synthetic_positives_generation.txt` file. - **Classification scenarios mining**. Following the *GLiNER* training strategy, we use `meta-llama/Llama-3.3-70B-Instruct` to generate relevant positive and negative classes for each text in the dataset. The LLM is prompted to generate 5 lists of relevant classes in a multiclass setup, under the condition that each list contains exactly one positive class. The remaining classes in each list are considered hard negatives. Every list is related to a specific aspect of the text, referred to as *scenario*. Thus, for every text there are 5 distinct classification scenarios, each containing: - The name of the scenario - The list of generated classes related to that scenario, where one is positive and the others are negative. The prompt used for generation is located in the `prompts/synthetic_classes_generation.txt` file. - **RU-MTEB Classification**. This method utilises classes from six multiclass classification tasks in the RU-MTEB benchmark. For each text in the dataset, `meta-llama/Llama-3.3-70B-Instruct` is given a random list of classes from one of the benchmark's tasks and is prompted to classify the text into one class from the list. The prompt used for classification is located in the `prompts/llm_classification.txt` file. - **RU-MTEB Classification extended**. This method is similar to the previous one. It also uses classes from the RU-MTEB benchmark and prompts `meta-llama/Llama-3.3-70B-Instruct` to classify the text into one of these classes. However, before classification, the original class lists are augmented. For each list of RU-MTEB classes, 5-8 augmented lists of classes are generated based on the original list. These augmented lists introduce different details into original classes. These were generated using the OpenAI's o3 model. The prompt used for generation is located in the `prompts/ru_mteb_extended_classes.txt` file. After augmentation, Llama is given a random list of classes, either augmented or original from RU-MTEB, and is prompted to classify the text into one class from the list. The prompt used for classification is located in the `prompts/llm_classification.txt` file. ### Text segmentation Texts from `allenai/c4` were segmented using the [`razdel`](https://github.com/natasha/razdel/) Python library. Segmentation was performed in 2 steps: - **Sentence Splitting**: Each text was split into individual sentences using the `razdel` library. - **Segment creation**: Texts were divided into segments of random length between 20 and 150 words, ensuring that no sentence was split across segment boundaries. ## Citations ``` @misc{deepvk2025clazer, title={CLAZER: CLAssification in a ZERo-shot scenario}, author={Vyrodov, Mikhail and Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/synthetic-classes}, publisher={Hugging Face} year={2025}, } ```

# CLAZER（零样本分类任务数据集，全称为CLAssification in a ZERo-shot scenario） CLAZER是一款免费开源的文本数据集，其文本源自[`allenai/c4`](https://huggingface.co/datasets/allenai/c4)，并标注了正样本类别与难负样本（hard negative）类别。该数据集的标注工作通过[`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)完成，其构建目标为提供高质量的分类样本，以助力句子编码器对零样本分类任务的研究与理解。 ## 数据集结构本数据集包含4个子集： 1. `synthetic_positives`（合成正样本子集）。该子集包含： - `train`（93426条样本）、`val`（3000条样本）与`test`（3000条样本） - `text`：源自`allenai/c4`的文本片段 - `classes`：用于描述该文本的3-5个正样本类别列表 2. `synthetic_classes`（合成类别子集）。该子集包含： - `train`（92953条样本） - `text`：源自`allenai/c4`的文本片段 - `classes_0` 至 `classes_4`：类别列表，其中首个类别为正样本类别，其余类别为难负样本类别 - `scenarios`：与`classes_0`至`classes_4`列对应的分类场景列表 - `val`（2960条样本）与`test`（2961条样本） - `text`：源自`allenai/c4`的文本片段 - `classes`：包含1个正样本类别与若干难负样本类别的列表 - `label`：表示正样本类别在`classes`列表中索引的整数值 - `scenarios`：表示分类场景的字符串 3. `ru_mteb_classes`（RU-MTEB类别子集）。该子集包含： - `train`（45907条样本）、`val`（2936条样本）与`test`（2942条样本） - `text`：源自`allenai/c4`的文本片段 - `classes`：源自RU-MTEB分类任务的类别列表，包含1个正样本类别与若干负样本类别 - `label`：表示正样本类别在`classes`列表中索引的整数值 4. `ru_mteb_extended_classes`（RU-MTEB扩展类别子集）。该子集包含： - `train`（87103条样本）、`val`（2800条样本）与`test`（2796条样本） - `text`：源自`allenai/c4`的文本片段 - `classes`：源自RU-MTEB分类任务的经编辑后的类别列表，包含1个正样本类别与若干负样本类别 - `label`：表示正样本类别在`classes`列表中索引的整数值以下为`synthetic_classes`验证集的一条样本示例： { 'text': ""Стараемся выбрасывать мусор в специальные урны, отделять пластиковые урны от всего остального, бытового, органического. То есть элементарные вещи: экономия электричества, я лично готова полдня со свечой сидеть, чтобы только не строили дополнительные атомные станции, а может даже закрыли", - говорят девушки из группы SMS.", 'classes': [ 'правительственное учреждение', 'группа активистов', 'частное предприятие', ], 'label': 1, 'scenarios': 'Сценарий классификации по источнику высказывания' } ## 数据集构建本数据集的每个子集均源自`allenai/c4`中的10万条俄语文本片段。共采用四种方法挖掘正样本与负样本类别： - **正样本类别挖掘**：我们使用`meta-llama/Llama-3.3-70B-Instruct`生成5个与给定文本相关的类别，经过滤后部分样本的类别数量可能少于5个。生成所用的提示词存储于`prompts/synthetic_positives_generation.txt`文件中。 - **分类场景挖掘**：遵循GLiNER训练策略，我们使用`meta-llama/Llama-3.3-70B-Instruct`为数据集中的每条文本生成相关的正、负样本类别。提示大语言模型（Large Language Model，简称LLM）生成5组多分类设置下的相关类别列表，要求每组列表恰好包含1个正样本类别，列表中其余类别均视为难负样本。每组列表对应文本的某一特定维度，即*场景*。因此，每条文本对应5种不同的分类场景，每种场景包含： - 场景名称 - 与该场景相关的类别列表，其中1个为正样本类别，其余为负样本类别。生成所用的提示词存储于`prompts/synthetic_classes_generation.txt`文件中。 - **RU-MTEB分类任务适配**：该方法使用RU-MTEB基准测试集中的6个多分类任务的类别。对于数据集中的每条文本，我们为`meta-llama/Llama-3.3-70B-Instruct`提供基准任务中随机抽取的一组类别，并提示模型将该文本归类至列表中的某一类别。分类所用的提示词存储于`prompts/llm_classification.txt`文件中。 - **RU-MTEB扩展分类任务适配**：该方法与前述方法类似，同样使用RU-MTEB基准测试集的类别，并提示`meta-llama/Llama-3.3-70B-Instruct`将文本归类至给定类别列表中的某一类。但在分类前，我们会对原始类别列表进行增强：针对每组RU-MTEB类别列表，使用OpenAI的o3模型生成5-8个增强后的类别列表，为原始类别补充不同的细节信息。增强所用的提示词存储于`prompts/ru_mteb_extended_classes.txt`文件中。增强完成后，我们会为模型随机抽取一组类别（既可以是增强后的列表，也可以是RU-MTEB原始类别列表），并提示模型将文本归类至该列表中的某一类别。分类所用的提示词存储于`prompts/llm_classification.txt`文件中。 ### 文本分段我们使用Python库[`razdel`](https://github.com/natasha/razdel/)对`allenai/c4`中的文本进行分段处理。分段流程分为两步： - **句子拆分**：使用`razdel`库将每条文本拆分为独立的句子。 - **分段创建**：将文本划分为长度介于20至150词之间的片段，且保证不会将单个句子拆分至不同分段中。 ## 引用信息 @misc{deepvk2025clazer, title={CLAZER: CLAssification in a ZERo-shot scenario}, author={Vyrodov, Mikhail and Spirin, Egor and Sokolov, Andrey}, url={https://huggingface.co/datasets/deepvk/synthetic-classes}, publisher={Hugging Face}, year={2025}, }

提供机构：

maas

创建时间：

2025-08-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集