five

proxectonos/parafrases_gl

收藏
Hugging Face2026-04-24 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/parafrases_gl
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - gl pretty_name: Paraphrases GL license: cc-by-4.0 task_categories: - text-classification task_ids: - semantic-similarity-classification tags: - galician - paraphrase - semantic-similarity - evaluation - nlp size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: train.tsv - split: validation path: validation.tsv - split: test path: test.tsv --- # Paraphrases GL ## Dataset description Paraphrases GL is a dataset for the evaluation of paraphrase resources in Galician. Because paraphrase is a complex notion and lacks a fully precise and universally accepted definition, this dataset adopts a three-way annotation scheme designed to capture both clear paraphrases and borderline cases. Each pair of sentences is labeled with one of the following values: | Label | Category | Definition | |------:|----------|------------| | 0 | NON-PARAPHRASE | The two sentences do not clearly and strictly share the same meaning. | | 1 | BORDERLINE PARAPHRASE | The two sentences share a very similar, nearly paraphrastic meaning, and it is difficult to classify them in the other categories. | | 2 | PARAPHRASE | The two sentences clearly and strictly share the same meaning. | ## Data creation methodology ### Text selection The source texts were selected from materials whose licenses allowed publication. The dataset was designed to include varied text types, with a preference for spoken or conversational language where possible. The sources include: - **Wikipedia**, providing varied encyclopedic texts - **Parliament**, using transcripts of political sessions in the Parliament of Galicia - **CRTVG**, using transcripts of programs from the Galician public broadcaster - **Hugin e Munin novels**, using dialogues from novels made available by Hugin e Munin ### Creation of paraphrase pairs Two types of paraphrase candidates were generated: - **syntactic or full paraphrases**, created through backtranslation using a pivot language such as Czech, English, or Spanish - **lexical paraphrases**, created through automatic term substitution with BERT Two annotators assigned labels 0, 1, or 2 to each sentence pair according to the non-paraphrase / borderline paraphrase / paraphrase scheme. ## Dataset composition - Number of syntactic paraphrases created through backtranslation: **1,619** - Number of lexical paraphrases: **1,316** - Total number of sentence pairs: **2,935** ## Data format The dataset is distributed in **TSV format** and includes three splits: - `train` - `validation` - `test` Each row contains the following columns: - `Dataset`: identifier of the paraphrase generation type or subset - `ID`: identifier of the original source sentence or example - `Frase`: original sentence in Galician - `Paráfrase`: candidate paraphrase in Galician - `Avaliación`: annotation label on the 0–2 scale ## Example | Dataset | ID | Frase | Paráfrase | Avaliación | |---------|----|-------|-----------|-----------:| | LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 | | COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 | ## Intended uses This dataset can be used for: - evaluation of paraphrase detection systems in Galician - semantic similarity classification - research on borderline paraphrase phenomena - low-resource NLP research for Galician - development and evaluation of paraphrase generation or filtering methods ## Limitations - The notion of paraphrase is inherently fuzzy, which is why the dataset includes a borderline category. - Some examples may be difficult even for human annotators to classify consistently. - The dataset was created partly through automatic generation methods, so generated candidates may reflect artifacts of backtranslation or lexical substitution. - The dataset is intended primarily for evaluation and analysis, rather than large-scale training. ## License ## License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. Users are free to share and adapt the material, provided that appropriate credit is given to the original source. ## Usage Example with `datasets`: ```python from datasets import load_dataset ds = load_dataset( "csv", data_files={ "train": "train.tsv", "validation": "validation.tsv", "test": "test.tsv", }, delimiter="\t" ) print(ds["train"][0]) print(ds["validation"][0]) print(ds["test"][0]) ``` ## Acknowledgements This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.

--- 语言: - 加利西亚语(gl) 美观名称:Paraphrases GL 许可证:CC BY 4.0 任务类别: - 文本分类(text-classification) 任务子类型: - 语义相似度分类(semantic-similarity-classification) 标签: - 加利西亚语(galician) - 复述(paraphrase) - 语义相似度(semantic-similarity) - 评估(evaluation) - 自然语言处理(Natural Language Processing,NLP) 样本规模: - 1000 < 样本数 < 10000 配置项: - 配置名称:默认配置(default) 数据文件: - 划分集:训练集(train) 路径:train.tsv - 划分集:验证集(validation) 路径:validation.tsv - 划分集:测试集(test) 路径:test.tsv --- # Paraphrases GL 加利西亚语复述数据集 ## 数据集概述 Paraphrases GL 是一款用于评估加利西亚语复述(paraphrase)资源的数据集。 由于复述是一个复杂的概念,且尚无完全精准且被普遍接受的定义,本数据集采用三元标注方案,以同时覆盖明确复述与边界模糊案例。每一对句子将被标注为以下类别之一: | 标签 | 类别 | 定义 | |------:|----------|------------| | 0 | 非复述(NON-PARAPHRASE) | 两个句子未明确且严格地表达相同语义。 | | 1 | 边界型复述(BORDERLINE PARAPHRASE) | 两个句子语义高度相似,近乎复述,难以归入其他类别。 | | 2 | 标准复述(PARAPHRASE) | 两个句子明确且严格地表达相同语义。 | ## 数据创建方法 ### 文本遴选 源文本均选自许可范围内可公开传播的素材。本数据集旨在涵盖多样化文本类型,尽可能优先选用口语或会话类语料。数据来源包括: - **维基百科(Wikipedia)**:提供各类百科类文本 - **加利西亚议会**:采用加利西亚议会政治会议的议事转录文本 - **CRTVG**:采用加利西亚公共广播电视机构的节目转录文本 - **Hugin e Munin 小说**:采用该小说集中的对话内容 ### 复述对创建 本次生成了两类复述候选样本: - **句法型/全量复述(syntactic or full paraphrases)**:通过以捷克语、英语或西班牙语为中间语言的回译(backtranslation)生成 - **词汇型复述(lexical paraphrases)**:通过基于BERT的自动术语替换生成 两名标注人员依据非复述/边界型复述/标准复述的标注规则,为每一对句子分配0、1或2的标签。 ## 数据集构成 - 通过回译生成的句法型复述样本数:**1619条** - 词汇型复述样本数:**1316条** - 句子对总数量:**2935条** ## 数据格式 本数据集以**TSV格式(TSV format)**分发,包含三个划分集: - 训练集(train) - 验证集(validation) - 测试集(test) 每一行包含以下列: - `Dataset`:复述生成类型或子集的标识符 - `ID`:原始源句子或示例的标识符 - `Frase`:加利西亚语原句 - `Paráfrase`:加利西亚语候选复述句 - `Avaliación`:0-2区间的标注标签 ## 示例 | `Dataset` | `ID` | `Frase` | `Paráfrase` | `Avaliación` | |---------|----|-------|-----------|-----------:| | LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 | | COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 | ## 预期用途 本数据集可应用于以下场景: - 加利西亚语复述检测系统的评估 - 语义相似度分类 - 边界型复述现象研究 - 加利西亚语低资源自然语言处理(Natural Language Processing,NLP)研究 - 复述生成或过滤方法的开发与评估 ## 局限性 - 复述概念本身具有模糊性,这也是本数据集设置边界型复述类别的原因 - 部分示例即便对于人类标注人员也难以实现一致的分类 - 本数据集部分通过自动生成方法创建,因此生成的候选样本可能带有回译或词汇替换带来的人工痕迹 - 本数据集主要面向评估与分析场景,而非大规模模型训练 ## 许可证 本数据集采用**知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International,CC BY 4.0)**进行授权。用户可自由分享、改编本数据集素材,但需为原始来源提供恰当署名。 ## 使用示例 使用`datasets`库的代码示例如下: python from datasets import load_dataset ds = load_dataset( "csv", data_files={ "train": "train.tsv", "validation": "validation.tsv", "test": "test.tsv", }, delimiter=" " ) print(ds["train"][0]) print(ds["validation"][0]) print(ds["test"][0]) ## 致谢 本数据集在Nós项目框架下完成编译,该项目由西班牙数字化与公共职能部资助,依托欧盟下一代欧盟(NextGenerationEU)计划下的ILENIA项目框架实施,项目编号为2022/TL22/00215336。
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作