proxectonos/parafrases_gl
收藏Hugging Face2026-04-24 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/parafrases_gl
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- gl
pretty_name: Paraphrases GL
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- semantic-similarity-classification
tags:
- galician
- paraphrase
- semantic-similarity
- evaluation
- nlp
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: train.tsv
- split: validation
path: validation.tsv
- split: test
path: test.tsv
---
# Paraphrases GL
## Dataset description
Paraphrases GL is a dataset for the evaluation of paraphrase resources in Galician.
Because paraphrase is a complex notion and lacks a fully precise and universally accepted definition, this dataset adopts a three-way annotation scheme designed to capture both clear paraphrases and borderline cases.
Each pair of sentences is labeled with one of the following values:
| Label | Category | Definition |
|------:|----------|------------|
| 0 | NON-PARAPHRASE | The two sentences do not clearly and strictly share the same meaning. |
| 1 | BORDERLINE PARAPHRASE | The two sentences share a very similar, nearly paraphrastic meaning, and it is difficult to classify them in the other categories. |
| 2 | PARAPHRASE | The two sentences clearly and strictly share the same meaning. |
## Data creation methodology
### Text selection
The source texts were selected from materials whose licenses allowed publication. The dataset was designed to include varied text types, with a preference for spoken or conversational language where possible.
The sources include:
- **Wikipedia**, providing varied encyclopedic texts
- **Parliament**, using transcripts of political sessions in the Parliament of Galicia
- **CRTVG**, using transcripts of programs from the Galician public broadcaster
- **Hugin e Munin novels**, using dialogues from novels made available by Hugin e Munin
### Creation of paraphrase pairs
Two types of paraphrase candidates were generated:
- **syntactic or full paraphrases**, created through backtranslation using a pivot language such as Czech, English, or Spanish
- **lexical paraphrases**, created through automatic term substitution with BERT
Two annotators assigned labels 0, 1, or 2 to each sentence pair according to the non-paraphrase / borderline paraphrase / paraphrase scheme.
## Dataset composition
- Number of syntactic paraphrases created through backtranslation: **1,619**
- Number of lexical paraphrases: **1,316**
- Total number of sentence pairs: **2,935**
## Data format
The dataset is distributed in **TSV format** and includes three splits:
- `train`
- `validation`
- `test`
Each row contains the following columns:
- `Dataset`: identifier of the paraphrase generation type or subset
- `ID`: identifier of the original source sentence or example
- `Frase`: original sentence in Galician
- `Paráfrase`: candidate paraphrase in Galician
- `Avaliación`: annotation label on the 0–2 scale
## Example
| Dataset | ID | Frase | Paráfrase | Avaliación |
|---------|----|-------|-----------|-----------:|
| LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 |
| COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 |
## Intended uses
This dataset can be used for:
- evaluation of paraphrase detection systems in Galician
- semantic similarity classification
- research on borderline paraphrase phenomena
- low-resource NLP research for Galician
- development and evaluation of paraphrase generation or filtering methods
## Limitations
- The notion of paraphrase is inherently fuzzy, which is why the dataset includes a borderline category.
- Some examples may be difficult even for human annotators to classify consistently.
- The dataset was created partly through automatic generation methods, so generated candidates may reflect artifacts of backtranslation or lexical substitution.
- The dataset is intended primarily for evaluation and analysis, rather than large-scale training.
## License
## License
This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
Users are free to share and adapt the material, provided that appropriate credit is given to the original source.
## Usage
Example with `datasets`:
```python
from datasets import load_dataset
ds = load_dataset(
"csv",
data_files={
"train": "train.tsv",
"validation": "validation.tsv",
"test": "test.tsv",
},
delimiter="\t"
)
print(ds["train"][0])
print(ds["validation"][0])
print(ds["test"][0])
```
## Acknowledgements
This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
---
语言:
- 加利西亚语(gl)
美观名称:Paraphrases GL
许可证:CC BY 4.0
任务类别:
- 文本分类(text-classification)
任务子类型:
- 语义相似度分类(semantic-similarity-classification)
标签:
- 加利西亚语(galician)
- 复述(paraphrase)
- 语义相似度(semantic-similarity)
- 评估(evaluation)
- 自然语言处理(Natural Language Processing,NLP)
样本规模:
- 1000 < 样本数 < 10000
配置项:
- 配置名称:默认配置(default)
数据文件:
- 划分集:训练集(train)
路径:train.tsv
- 划分集:验证集(validation)
路径:validation.tsv
- 划分集:测试集(test)
路径:test.tsv
---
# Paraphrases GL 加利西亚语复述数据集
## 数据集概述
Paraphrases GL 是一款用于评估加利西亚语复述(paraphrase)资源的数据集。
由于复述是一个复杂的概念,且尚无完全精准且被普遍接受的定义,本数据集采用三元标注方案,以同时覆盖明确复述与边界模糊案例。每一对句子将被标注为以下类别之一:
| 标签 | 类别 | 定义 |
|------:|----------|------------|
| 0 | 非复述(NON-PARAPHRASE) | 两个句子未明确且严格地表达相同语义。 |
| 1 | 边界型复述(BORDERLINE PARAPHRASE) | 两个句子语义高度相似,近乎复述,难以归入其他类别。 |
| 2 | 标准复述(PARAPHRASE) | 两个句子明确且严格地表达相同语义。 |
## 数据创建方法
### 文本遴选
源文本均选自许可范围内可公开传播的素材。本数据集旨在涵盖多样化文本类型,尽可能优先选用口语或会话类语料。数据来源包括:
- **维基百科(Wikipedia)**:提供各类百科类文本
- **加利西亚议会**:采用加利西亚议会政治会议的议事转录文本
- **CRTVG**:采用加利西亚公共广播电视机构的节目转录文本
- **Hugin e Munin 小说**:采用该小说集中的对话内容
### 复述对创建
本次生成了两类复述候选样本:
- **句法型/全量复述(syntactic or full paraphrases)**:通过以捷克语、英语或西班牙语为中间语言的回译(backtranslation)生成
- **词汇型复述(lexical paraphrases)**:通过基于BERT的自动术语替换生成
两名标注人员依据非复述/边界型复述/标准复述的标注规则,为每一对句子分配0、1或2的标签。
## 数据集构成
- 通过回译生成的句法型复述样本数:**1619条**
- 词汇型复述样本数:**1316条**
- 句子对总数量:**2935条**
## 数据格式
本数据集以**TSV格式(TSV format)**分发,包含三个划分集:
- 训练集(train)
- 验证集(validation)
- 测试集(test)
每一行包含以下列:
- `Dataset`:复述生成类型或子集的标识符
- `ID`:原始源句子或示例的标识符
- `Frase`:加利西亚语原句
- `Paráfrase`:加利西亚语候选复述句
- `Avaliación`:0-2区间的标注标签
## 示例
| `Dataset` | `ID` | `Frase` | `Paráfrase` | `Avaliación` |
|---------|----|-------|-----------|-----------:|
| LEX_12 | Parlamento_3572 | E entre eses exemplos de boas prácticas atopamos o Plan estratéxico de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | E entre eses exemplos de boas prácticas atopamos o Plan integral de igualdade de oportunidades entre mulleres e homes da Universidade de Santiago de Compostela, do 25 de marzo de 2009. | 0 |
| COMP_1301 | Parlamento_5971 | Pero, entón, ¿de que vén falar aquí, señor Rueda? | Mais, entón, de que vén falar aquí, señor Roda? | 2 |
## 预期用途
本数据集可应用于以下场景:
- 加利西亚语复述检测系统的评估
- 语义相似度分类
- 边界型复述现象研究
- 加利西亚语低资源自然语言处理(Natural Language Processing,NLP)研究
- 复述生成或过滤方法的开发与评估
## 局限性
- 复述概念本身具有模糊性,这也是本数据集设置边界型复述类别的原因
- 部分示例即便对于人类标注人员也难以实现一致的分类
- 本数据集部分通过自动生成方法创建,因此生成的候选样本可能带有回译或词汇替换带来的人工痕迹
- 本数据集主要面向评估与分析场景,而非大规模模型训练
## 许可证
本数据集采用**知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International,CC BY 4.0)**进行授权。用户可自由分享、改编本数据集素材,但需为原始来源提供恰当署名。
## 使用示例
使用`datasets`库的代码示例如下:
python
from datasets import load_dataset
ds = load_dataset(
"csv",
data_files={
"train": "train.tsv",
"validation": "validation.tsv",
"test": "test.tsv",
},
delimiter=" "
)
print(ds["train"][0])
print(ds["validation"][0])
print(ds["test"][0])
## 致谢
本数据集在Nós项目框架下完成编译,该项目由西班牙数字化与公共职能部资助,依托欧盟下一代欧盟(NextGenerationEU)计划下的ILENIA项目框架实施,项目编号为2022/TL22/00215336。
提供机构:
proxectonos



