taln-ls2n/silk
收藏Hugging Face2024-09-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/taln-ls2n/silk
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- text-generation
- text2text-generation
language:
- en
tags:
- keyphrase-generation
- domain-adaptation
- paleontology
- astrophysics
- natural-language-processing
size_categories:
- 1K<n<10K
configs:
- config_name: paleo
data_files:
- split: train
path:
- "paleo/train.jsonl"
- split: test
path: "paleo/test.jsonl"
- config_name: nlp
data_files:
- split: train
path:
- "nlp/train.jsonl"
- split: test
path: "nlp/test.jsonl"
- config_name: astro
data_files:
- split: train
path:
- "astro/train.jsonl"
- split: test
path: "astro/test.jsonl"
---
# `silk` synthetic training samples and human-labeled test sets for domain adaptation in keyphrase generation
This dataset contains the synthetic samples generated by 🧵 `silk`, a method that leverages citation contexts to create synthetic samples of documents paired with silver-standard keyphrases for adapting keyphrase generation models to new domains.
We applied `silk` on three domains: Natural Language Processing (nlp), Astrophysics (astro) and Paleontology (paleo).
This dataset also includes three human-labeled test sets to assess the performance of keyphrase generation across these domains.
## Citation
If you use this dataset, please cite the following paper:
```
Florian Boudin and Akiko Aizawa.
Unsupervised Domain Adaptation for Keyphrase Generation using Citation Context,
Proceedings of EMNLP 2024 (Findings).
```
许可证:cc
任务类别:
- 文本生成
- 文本到文本生成
语言:
- 英语
标签:
- 关键词生成
- 领域自适应
- 古生物学
- 天体物理学
- 自然语言处理
规模类别:
- 1K<n<10K
配置:
- 配置名称:paleo
数据文件:
- 拆分:训练
路径:
- "paleo/train.jsonl"
- 拆分:测试
路径:"paleo/test.jsonl"
- 配置名称:nlp
数据文件:
- 拆分:训练
路径:
- "nlp/train.jsonl"
- 拆分:测试
路径:"nlp/test.jsonl"
- 配置名称:astro
数据文件:
- 拆分:训练
路径:
- "astro/train.jsonl"
- 拆分:测试
路径:"astro/test.jsonl"
# `silk`合成训练样本与人工标注测试集:用于关键词生成的领域自适应
该数据集包含由🧵`silk`生成的合成样本,这是一种利用引文上下文创建文档合成样本并配以银标准关键词的方法,旨在将关键词生成模型适配到新领域。
我们将`silk`应用于三个领域:自然语言处理(nlp)、天体物理学(astro)和古生物学(paleo)。
该数据集还包含三个人工标注的测试集,用于评估跨这些领域的关键词生成性能。
## 引用
若使用本数据集,请引用以下论文:
Florian Boudin and Akiko Aizawa.
Unsupervised Domain Adaptation for Keyphrase Generation using Citation Context,
Proceedings of EMNLP 2024 (Findings).
提供机构:
taln-ls2n



