ParaSCI
收藏arXiv2021-02-05 更新2024-06-21 收录
下载链接:
https://github.com/dqxiu/ParaSCI
下载链接
链接失效反馈官方服务:
资源简介:
ParaSCI是首个大规模科学领域释义数据集,由北京大学王选计算机研究所创建,包含350,044对释义句子,分为ParaSCI-ACL和ParaSCI-arXiv两个子集。数据集通过内部和外部论文方法构建,如收集对同一论文的引用或通过科学术语聚合定义。ParaSCI的特点在于显著的长度和文本多样性,适用于训练释义生成模型,并可用于扩大科学领域其他NLP任务的训练数据。
ParaSCI is the first large-scale scientific paraphrase dataset, developed by the Wangxuan Institute of Computer Technology at Peking University. It contains 350,044 pairs of paraphrased sentences and is divided into two subsets: ParaSCI-ACL and ParaSCI-arXiv. The dataset is constructed via both internal and external academic paper-based approaches, such as collecting citations for the same paper or aggregating definitions through scientific terminology. Boasting remarkable scale and textual diversity, ParaSCI is suitable for training paraphrase generation models and can be used to expand training data for other NLP tasks in the scientific domain.
提供机构:
北京大学王选计算机研究所
创建时间:
2021-01-21



