CASIMIR
收藏arXiv2024-03-19 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/taln-ls2n/CASIMIR
下载链接
链接失效反馈官方服务:
资源简介:
CASIMIR是由南特大学等机构创建的一个包含15,646篇科学文章修订版本的大型数据集。该数据集从OpenReview平台收集,包含370万对自动对齐的编辑句子,总计520万个单独编辑,每个编辑都附有自动修订意图标签。数据集的创建过程涉及从PDF文件中提取文本内容,进行句子级别的对齐和编辑提取,以及编辑类型的自动标注。CASIMIR数据集主要用于训练和评估科学写作辅助工具,特别是在文本修订领域,旨在帮助研究人员提高科学文章的写作质量。
CASIMIR is a large-scale dataset consisting of 15,646 revised versions of scientific articles, created by institutions including the University of Nantes and other relevant organizations. Collected from the OpenReview platform, this dataset contains 3.7 million automatically aligned edited sentence pairs, totaling 5.2 million individual edits, with each edit paired with an automatically assigned revision intent label. The dataset creation process involves extracting text content from PDF files, performing sentence-level alignment and edit extraction, as well as automatic annotation of edit types. The CASIMIR dataset is primarily used for training and evaluating scientific writing assistance tools, particularly in the field of text revision, aiming to help researchers improve the writing quality of scientific articles.
提供机构:
南特大学,中央理工-南特大学,CNRS,LS2N,UMR 6004,F-44000南特,法国
创建时间:
2024-03-01



