taln-ls2n/CASIMIR

Name: taln-ls2n/CASIMIR
Creator: taln-ls2n
Published: 2025-10-21 17:17:34
License: 暂无描述

Hugging Face2025-10-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/taln-ls2n/CASIMIR

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en configs: - config_name: article_pairs data_files: - split: train path: "article_pairs_train.jsonl" - split: validation path: "article_pairs_dev.jsonl" - split: test path: "article_pairs_test.jsonl" - config_name: mapping data_files: - split: train path: "mapping_train.jsonl" - split: validation path: "mapping_dev.jsonl" - split: test path: "mapping_test.jsonl" - config_name: mapping_full_corpus data_files: "mapping_full_corpus.jsonl" - config_name: article_pairs_small_test data_files: "article_pairs_small_test.jsonl" - config_name: mapping_small_test data_files: "mapping_small_test.jsonl" - config_name: metadata data_files: "metadata.jsonl" - config_name: reviews data_files: "reviews.jsonl" --- # CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions ## About This repository contains the CASIMIR dataset, a dataset of the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. The creation process of this dataset is described in the following references: - [CASIMIR: A Corpus of Scientific Articles Enhanced with Multiple Author-Integrated Revisions](https://aclanthology.org/2024.lrec-main.257/) (Jourdan et al., LREC-COLING 2024) - [CASIMIR : un Corpus d’Articles Scientifiques Intégrant les ModIfications et Révisions des auteurs](https://aclanthology.org/2023.jeptalnrecital-arts.10/) (Jourdan et al., JEP/TALN/RECITAL 2023) ## Content The dataset is composed of different subsets: - **article_pairs**: Main datas, pairs of articles aligned at sentence level with their edits extracted. Divided in *train*, *validation*, *test*. - **mapping**: Mapping of the OpenReview forum id with the associated articles versions id. Divided in *train*, *validation*, *test*. - **mapping_full_corpus**: Same data as previous point but for the whole corpus at once. - **article_pairs_small_test** and **mapping_small_test**: As running inference on large models over the full test set is computationally expensive and time-consuming, we also provide a smaller test subset, representing 30% of the original test set (approximately 3% of the full dataset, i.e. 468 papers). - **reviews**: All comments posted on the article's forum on OpenReview. - **metadata**: dates, authors, keywords, venue, ids, ... The dataset **article_pairs** subset is divided into the following three splits: | Split | # articles | # versions/article (avg) | # edits/pair (avg) | Edit length (avg) | % (Label) Content | % (Label) Grammar-Typo | % (Label) Format | % (Label) Language | | :--------- | ----------:| -----------: | --------: | ----------: | ------: | -------: | -------: | -------: | | Train | 12 488 | 3.49 | 141.85 | 34.90 | 41.99 | 22.69 | 20.44 | 14.88 | | Test | 1561 | 3.51 | 142.98 | 35.24 | 42.70 | 21.43 | 20.47 | 15.40 | | Validation | 1597 | 3.49 | 143.38 | 34.35 | 41.13 | 24.35 | 19.79 | 14.73 | The following data fields are available: - **WIP** ### Please cite this work as: ``` @inproceedings{jourdan-etal-2024-casimir, title = "{CASIMIR}: A Corpus of Scientific Articles Enhanced with Multiple Author-Integrated Revisions", author = "Jourdan, L{\'e}ane and Boudin, Florian and Hernandez, Nicolas and Dufour, Richard", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.257/", pages = "2883--2892", abstract = "Writing a scientific article is a challenging task as it is a highly codified and specific genre, consequently proficiency in written communication is essential for effectively conveying research findings and ideas. In this article, we propose an original textual resource on the revision step of the writing process of scientific articles. This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. To assess the initial quality on the dataset, we conducted a qualitative study of several state-of-the-art text revision approaches and compared various evaluation metrics. Our experiments led us to question the relevance of the current evaluation methods for the text revision task." } ```

提供机构：

taln-ls2n

原始信息汇总

数据集许可证

许可证类型: MIT

5,000+

优质数据集

54 个

任务类型

进入经典数据集