MLPA-400

Mendeley Data2019-04-20 更新2026-04-09 收录

下载链接：

https://data.mendeley.com/datasets/mvkcpkx9ww

下载链接

链接失效反馈

官方服务：

资源简介：

we considered a realistic problem of multilabel AA in the realm of scientific publications by creating a publicly available dataset consisting of 400 Ma- chine Learning papers, Machine Learning Papers’ Authorship 400 (MLPA-400). To the best of our knowledge, multi-label AA of scientific publications has not received a lot of attention. It deserves more attention because automatic resolution of authorship issues in papers can have a variety of downstream applications in intellectual property managements, citation analysis, archival systems, and author disambiguation. The task is challenging: papers have many authors whose writing style can evolve or influenced by colleagues, they contain direct quotes from other works, authors’ contribution to the paper in terms of the amount of text written is unknown; the number of papers and authors is large. Considerations Many approaches to creating a suitable corpus exist. For example, papers can be chosen across domains. However, even within one domain the stylistic differences between venues are significant enough to make individual style hard to detect. A random sample of authors can be taken, but the number of multi-labeled documents would be few. Another possibility is taking the transitive closure of the set of co-authors and extracting at least k papers per author. However, creation of such a dataset for any reasonable k results in a very large transitive set. Design Using Google Scholar as a source, we created a list of top 20 authors in Machine Learning, ranked by the number of citations. We ensured a reasonable number of papers had an overlap of authors (i.e., we also included pa-pers that were jointly authored by the set of authors). For each author, 20 papers were downloaded for a total of 400 publications for the entire dataset. Each work is assigned 20 binary labels. The labels indicate which of the authors contributed to the paper’s creation. 100 papers out of 400 have more than one author from the 20 listed. The number of authors ranged from 1 to 3 and the average was 1.2925. The text was extracted from the PDF files using pdfminer (Hinyama, 2017) and pre-processed. The title, authorship information, and bibliography fields were removed from each paper to ensure the classifier abides by the rules of blind review instead of simply using author list while learning authorship. Formulas, table and figure captions were retained as they may contain valuable author specific style and topic information. The dataset is available here: URL to repo If you find this dataset useful, please cite as follows: Dainis Boumber, Yifan Zhang and A. Mukherjee. “Experiments with convolutional neural networks for multi-label authorship attribution.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, 2018. European Language Resources Association (ELRA).

我们针对科学出版场景下的多标签作者归属（Authorship Attribution，简称AA）问题，构建了一款公开可用的数据集，包含400篇机器学习论文，命名为机器学习论文作者归属400（Machine Learning Papers’ Authorship 400，简称MLPA-400）。据我们所知，目前学界对科学出版物的多标签AA研究尚未受到足够关注。该任务值得深入探究，因为自动解决论文作者归属问题，可在知识产权管理、引文分析、档案系统与作者消歧等诸多下游场景中拥有广泛应用价值。该任务颇具挑战性：论文作者数量较多，且作者写作风格可能随时间演变或受合作者影响；论文中包含其他作品的直接引用；作者对论文的文本贡献量难以量化；同时论文与作者的整体规模庞大。数据集构建考量：当前存在多种构建适配语料库的方案。例如，可跨领域遴选论文，但即便限定单一领域内，不同刊发渠道间的文体差异仍十分显著，导致个体写作风格难以被捕捉。亦可随机抽取作者样本，但此时多标签文档的数量会偏少。另一种思路是构建合作者集合的传递闭包，并为每位作者提取至少k篇论文，但即便选取合理的k值，最终得到的传递闭集合规模也会过于庞大。数据集设计与构建：我们以谷歌学术（Google Scholar）为数据源，选取了机器学习领域按被引量排名的前20位作者，确保有合理数量的论文存在作者重叠情况（即同时纳入该20位作者联合撰写的论文）。为每位作者选取20篇论文，最终整个数据集共包含400篇学术成果。每篇论文被赋予20个二元标签，用于标识该论文的贡献作者。其中400篇论文中有100篇包含上述20位作者中的多位作者，作者数量介于1至3之间，平均为1.2925。文本预处理环节：我们使用pdfminer（Hinyama, 2017）从PDF文件中提取文本并进行预处理：移除每篇论文的标题、作者信息与参考文献字段，以确保分类器遵循盲审规则，而非仅通过作者列表完成作者归属学习；同时保留公式、表格与图表题注，因其可能包含具有区分度的作者专属风格与主题信息。本数据集可通过下述仓库链接获取：URL to repo。若您认为本数据集对研究有所助益，请引用如下文献：Dainis Boumber、Yifan Zhang与A. Mukherjee. "用于多标签作者归属的卷积神经网络实验". 第十一届国际语言资源与评价会议（LREC 2018）论文集，法国巴黎，2018. 欧洲语言资源协会（European Language Resources Association，简称ELRA）出版。

创建时间：

2019-04-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集