proxectonos/parafrases_gl

Name: proxectonos/parafrases_gl
Creator: proxectonos
Published: 2026-04-24 12:26:38
License: 暂无描述

Hugging Face2026-04-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/proxectonos/parafrases_gl

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - gl pretty_name: Paraphrases GL license: cc-by-4.0 task_categories: - text-classification task_ids: - semantic-similarity-classification tags: - galician - paraphrase - semantic-similarity - evaluation - nlp size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: train.tsv - split: validation path: validation.tsv - split: test path: test.tsv --- # Paraphrases GL ## Dataset description Paraphrases GL is a dataset for the evaluation of paraphrase resources in Galician. Because paraphrase is a complex notion and lacks a fully precise and universally accepted definition, this dataset adopts a three-way annotation scheme designed to capture both clear paraphrases and borderline cases. Each pair of sentences is labeled with one of the following values: | Label | Category | Definition | |------:|----------|------------| | 0 | NON-PARAPHRASE | The two sentences do not clearly and strictly share the same meaning. | | 1 | BORDERLINE PARAPHRASE | The two sentences share a very similar, nearly paraphrastic meaning, and it is difficult to classify them in the other categories. | | 2 | PARAPHRASE | The two sentences clearly and strictly share the same meaning. | ## Data creation methodology ### Text selection The source texts were selected from materials whose licenses allowed publication. The dataset was designed to include varied text types, with a preference for spoken or conversational language where possible. The sources include: - **Wikipedia**, providing varied encyclopedic texts - **Parliament**, using transcripts of political sessions in the Parliament of Galicia - **CRTVG**, using transcripts of programs from the Galician public broadcaster - **Hugin e Munin novels**, using dialogues from novels made available by Hugin e Munin ### Creation of paraphrase pairs Two types of paraphrase candidates were generated: - **syntactic or full paraphrases**, created through backtranslation using a pivot language such as Czech, English, or Spanish - **lexical paraphrases**, created through automatic term substitution with BERT Two annotators assigned labels 0, 1, or 2 to each sentence pair according to the non-paraphrase / borderline paraphrase / paraphrase scheme. ## Dataset composition - Number of syntactic paraphrases created through backtranslation: **1,619** - Number of lexical paraphrases: **1,316** - Total number of sentence pairs: **2,935** ## Data format The dataset is distributed in **TSV format** and includes three splits: - `train` - `validation` - `test` Each row contains the following columns: - `Dataset`: identifier of the paraphrase generation type or subset - `ID`: identifier of the original source sentence or example - `Frase`: original sentence in Galician - `Paráfrase`: candidate paraphrase in Galician - `Avaliación`: annotation label on the 0–2 scale ## Example | Dataset | ID | Frase | Paráfrase | Avaliación | |---------|----|-------|-----------|-----------:| | LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 | | COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 | ## Intended uses This dataset can be used for: - evaluation of paraphrase detection systems in Galician - semantic similarity classification - research on borderline paraphrase phenomena - low-resource NLP research for Galician - development and evaluation of paraphrase generation or filtering methods ## Limitations - The notion of paraphrase is inherently fuzzy, which is why the dataset includes a borderline category. - Some examples may be difficult even for human annotators to classify consistently. - The dataset was created partly through automatic generation methods, so generated candidates may reflect artifacts of backtranslation or lexical substitution. - The dataset is intended primarily for evaluation and analysis, rather than large-scale training. ## License ## License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. Users are free to share and adapt the material, provided that appropriate credit is given to the original source. ## Usage Example with `datasets`: ```python from datasets import load_dataset ds = load_dataset( "csv", data_files={ "train": "train.tsv", "validation": "validation.tsv", "test": "test.tsv", }, delimiter="\t" ) print(ds["train"][0]) print(ds["validation"][0]) print(ds["test"][0]) ``` ## Acknowledgements This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.

--- 语言： - 加利西亚语（gl）美观名称：Paraphrases GL 许可证：CC BY 4.0 任务类别： - 文本分类（text-classification）任务子类型： - 语义相似度分类（semantic-similarity-classification）标签： - 加利西亚语（galician） - 复述（paraphrase） - 语义相似度（semantic-similarity） - 评估（evaluation） - 自然语言处理（Natural Language Processing，NLP）样本规模： - 1000 < 样本数 < 10000 配置项： - 配置名称：默认配置（default）数据文件： - 划分集：训练集（train）路径：train.tsv - 划分集：验证集（validation）路径：validation.tsv - 划分集：测试集（test）路径：test.tsv --- # Paraphrases GL 加利西亚语复述数据集 ## 数据集概述 Paraphrases GL 是一款用于评估加利西亚语复述（paraphrase）资源的数据集。由于复述是一个复杂的概念，且尚无完全精准且被普遍接受的定义，本数据集采用三元标注方案，以同时覆盖明确复述与边界模糊案例。每一对句子将被标注为以下类别之一： | 标签 | 类别 | 定义 | |------:|----------|------------| | 0 | 非复述（NON-PARAPHRASE） | 两个句子未明确且严格地表达相同语义。 | | 1 | 边界型复述（BORDERLINE PARAPHRASE） | 两个句子语义高度相似，近乎复述，难以归入其他类别。 | | 2 | 标准复述（PARAPHRASE） | 两个句子明确且严格地表达相同语义。 | ## 数据创建方法 ### 文本遴选源文本均选自许可范围内可公开传播的素材。本数据集旨在涵盖多样化文本类型，尽可能优先选用口语或会话类语料。数据来源包括： - **维基百科（Wikipedia）**：提供各类百科类文本 - **加利西亚议会**：采用加利西亚议会政治会议的议事转录文本 - **CRTVG**：采用加利西亚公共广播电视机构的节目转录文本 - **Hugin e Munin 小说**：采用该小说集中的对话内容 ### 复述对创建本次生成了两类复述候选样本： - **句法型/全量复述（syntactic or full paraphrases）**：通过以捷克语、英语或西班牙语为中间语言的回译（backtranslation）生成 - **词汇型复述（lexical paraphrases）**：通过基于BERT的自动术语替换生成两名标注人员依据非复述/边界型复述/标准复述的标注规则，为每一对句子分配0、1或2的标签。 ## 数据集构成 - 通过回译生成的句法型复述样本数：**1619条** - 词汇型复述样本数：**1316条** - 句子对总数量：**2935条** ## 数据格式本数据集以**TSV格式（TSV format）**分发，包含三个划分集： - 训练集（train） - 验证集（validation） - 测试集（test）每一行包含以下列： - `Dataset`：复述生成类型或子集的标识符 - `ID`：原始源句子或示例的标识符 - `Frase`：加利西亚语原句 - `Paráfrase`：加利西亚语候选复述句 - `Avaliación`：0-2区间的标注标签 ## 示例 | `Dataset` | `ID` | `Frase` | `Paráfrase` | `Avaliación` | |---------|----|-------|-----------|-----------:| | LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 | | COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 | ## 预期用途本数据集可应用于以下场景： - 加利西亚语复述检测系统的评估 - 语义相似度分类 - 边界型复述现象研究 - 加利西亚语低资源自然语言处理（Natural Language Processing，NLP）研究 - 复述生成或过滤方法的开发与评估 ## 局限性 - 复述概念本身具有模糊性，这也是本数据集设置边界型复述类别的原因 - 部分示例即便对于人类标注人员也难以实现一致的分类 - 本数据集部分通过自动生成方法创建，因此生成的候选样本可能带有回译或词汇替换带来的人工痕迹 - 本数据集主要面向评估与分析场景，而非大规模模型训练 ## 许可证本数据集采用**知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International，CC BY 4.0）**进行授权。用户可自由分享、改编本数据集素材，但需为原始来源提供恰当署名。 ## 使用示例使用`datasets`库的代码示例如下： python from datasets import load_dataset ds = load_dataset( "csv", data_files={ "train": "train.tsv", "validation": "validation.tsv", "test": "test.tsv", }, delimiter=" " ) print(ds["train"][0]) print(ds["validation"][0]) print(ds["test"][0]) ## 致谢本数据集在Nós项目框架下完成编译，该项目由西班牙数字化与公共职能部资助，依托欧盟下一代欧盟（NextGenerationEU）计划下的ILENIA项目框架实施，项目编号为2022/TL22/00215336。

提供机构：

proxectonos

5,000+

优质数据集

54 个

任务类型

进入经典数据集