five

ParaSCI

收藏
arXiv2021-02-05 更新2024-06-21 收录
下载链接:
https://github.com/dqxiu/ParaSCI
下载链接
链接失效反馈
官方服务:
资源简介:
ParaSCI是首个大规模科学领域释义数据集,由北京大学王选计算机研究所创建,包含350,044对释义句子,分为ParaSCI-ACL和ParaSCI-arXiv两个子集。数据集通过内部和外部论文方法构建,如收集对同一论文的引用或通过科学术语聚合定义。ParaSCI的特点在于显著的长度和文本多样性,适用于训练释义生成模型,并可用于扩大科学领域其他NLP任务的训练数据。

ParaSCI is the first large-scale scientific paraphrase dataset, developed by the Wangxuan Institute of Computer Technology at Peking University. It contains 350,044 pairs of paraphrased sentences and is divided into two subsets: ParaSCI-ACL and ParaSCI-arXiv. The dataset is constructed via both internal and external academic paper-based approaches, such as collecting citations for the same paper or aggregating definitions through scientific terminology. Boasting remarkable scale and textual diversity, ParaSCI is suitable for training paraphrase generation models and can be used to expand training data for other NLP tasks in the scientific domain.
提供机构:
北京大学王选计算机研究所
创建时间:
2021-01-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作