five

Citation-Context Dataset (C2D)

收藏
DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
https://ordo.open.ac.uk/articles/Citation-Context_Dataset_C2D_/6865298/2
下载链接
链接失效反馈
官方服务:
资源简介:
We have released the first version of a citation-context based dataset called C2D, created while doing an experiment in the work which will be published in RecSys 2018 as a short paper.<br><br>C2D dataset is created by using 2 million full-text open-source research publications obtained from CORE. It contains 53 million unique records of citation-information. To construct C2D, we extracted citation information from each publication. Information such as cited document's title, author(s), published date and citation-context. We will describe the assumption of extracting citation-context in a bit more detail below:<br>First of all, we extracted positions of citations where they are mentioned including citation-contexts which are texts around the cited documents. For our purpose, we created a citation-context using three sentences; the sentence where the reference has been cited, the preceding, and the following sentence. Additionally, at the start or end of a paragraph, the preceding or following sentence is not extracted respectively.<br>Therefore, the attributes of the dataset contain:Attributes:<strong>ReferenceID</strong> - unique identifier of cited reference in a citing document<strong>SourceID</strong> - unique identifier of a citing document.<strong>ChapterNumber</strong> - Chapter number of the citing document where the <b>ReferenceID</b> has mentioned.<strong>ParagraphNumber</strong> - paragraph number of the citing document where the reference <strong>ReferenceID</strong> has mentioned.<strong>SentenceNumber</strong> - sentence number of the citing document where the reference <strong>ReferencedID</strong> has mentioned.<strong>Title</strong> - Title of the reference <strong>ReferenceID.</strong><strong>PublishedDate</strong> - Publication date when the reference <strong>ReferenceID </strong>was published.<strong>Authors</strong> - Author(s) of the reference <strong>ReferenceID</strong><strong>TextBeforeRefMention</strong> - Sentence just before the sentence where the reference <strong>ReferenceID</strong> has been cited.<strong> TextWhereRefMention</strong> - Sentence where the reference <strong>ReferenceID</strong> has been cited.<strong>TextAfterRefMention</strong> - Sentence just after the sentence where the reference <strong>ReferenceID</strong> has been cited.Please cite our paper if you use this dataset. <br>Note:<br>The actual size of the dataset is ~40gb however compressed size is ~6.7gb.Requirements of different users may be different therefore we have released the raw version of the dataset. Please note, data cleansing (such as special character and stop-word removal) has not been performed.

我们已发布基于引用上下文(citation-context)的数据集C2D的首个版本,该数据集是在一项研究实验中构建完成,相关成果将作为短文发表于RecSys 2018。 C2D数据集依托从CORE平台获取的200万篇开源全文研究文献构建,共包含5300万条唯一的引用信息记录。为构建该数据集,我们从每篇文献中提取引用相关信息,包括被引文献的标题、作者、发表日期以及引用上下文。下文将详细说明引用上下文的提取规则: 首先,我们提取了引用的出现位置,以及被引文献周边的引用上下文文本。针对本研究的需求,我们以三句话构建引用上下文:即被引文献被提及的句子本身、其前一句与后一句。此外,若引用位于段落开头或结尾,则分别不提取对应的前一句或后一句,仅保留存在的相邻句。 因此,本数据集包含以下字段: - 被引文献标识(ReferenceID):施引文献中被引参考文献的唯一标识符 - 施引文献标识(SourceID):单篇施引文献的唯一标识符 - 章节编号(ChapterNumber):施引文献中出现被引文献(ReferenceID)的章节序号 - 段落编号(ParagraphNumber):施引文献中出现被引文献(ReferenceID)的段落序号 - 句子编号(SentenceNumber):施引文献中出现被引文献(ReferenceID)的句子序号 - 被引文献标题(Title):被引文献(ReferenceID)的标题 - 被引文献发表日期(PublishedDate):被引文献(ReferenceID)的正式发表时间 - 被引文献作者(Authors):被引文献(ReferenceID)的作者信息 - 引用前文(TextBeforeRefMention):被引文献(ReferenceID)被提及句子的前一句文本 - 引用句原文(TextWhereRefMention):被引文献(ReferenceID)被提及的句子本身 - 引用后文(TextAfterRefMention):被引文献(ReferenceID)被提及句子的后一句文本 若您使用本数据集,请引用我们的相关论文。 注意事项: 本数据集未压缩时总容量约为40GB,压缩后容量约为6.7GB。鉴于不同用户的使用需求存在差异,我们发布了该数据集的原始未处理版本。请注意,本数据集未执行任何数据清洗操作(如特殊字符过滤、停用词移除等)。
提供机构:
The Open University
创建时间:
2018-08-15
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Citation-Context Dataset (C2D)是一个包含5300万条引用记录的学术数据集,基于200万篇开源论文构建,重点提取引用上下文和文献元数据,适用于信息检索和推荐系统研究。数据集保留原始格式,未进行数据清洗,压缩后大小为6.33GB。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作