five

PAN21 Authorship Analysis: Style Change Detection

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4589144
下载链接
链接失效反馈
官方服务:
资源简介:
This is the dataset for the Style Change Detection task of PAN 2021. The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches.  Tasks Given a document, we ask participants to answer the following three questions: Single vs. Multiple. Given a text, find out whether the text is written by a single author or by multiple authors (task 1). Style Change Basic. Given a text written by two or more authors and that contains a number of style changes, find the position of the changes (task 2). Style Change Real-World. Given a text written by two or more authors, find all positions of writing style change, i.e., assign all paragraphs of the text uniquely to some author out of the number of authors you assume for the multi-author document (task 3). All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors. However, style changes may only occur between paragraphs (i.e., a single paragraph is always authored by a single author and does not contain any style changes). Data The dataset is split into three parts: training set: Contains 70% of the whole data set and includes ground truth data. Use this set to develop and train your models. validation set: Contains 15% of the whole data set and includes ground truth data. Use this set to evaluate and optimize your models. test set: Contains 15% of the whole data set. For the documents on the test set, you are not given ground truth data. This set is used for evaluation. The dataset is based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data. For each problem instance X (i.e., each input document), two files are provided: problem-X.txt contains the actual text, where paragraphs are denoted by \n\n. truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format: { "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "multi-author": RESULT_TASK1, "changes": RESULT_ARRAY_TASK2, "paragraph-authors": RESULT_ARRAY_TASK3 } The result for task 1 (key "multi-author") is a binary value (1 if the document is multi-authored, 0 if the document is single-authored). The result for task 2 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). If the document is single-authored, the solution to task 2 is an array filled with 0s. For task 3 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). An example of a multi-author document, where there was a style change between the third and fourth paragraph could look as follows (we only list the relevant key/value pairs here): { "multi-author": 1, "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] } A single-author document would have the following form (again, only listing the relevant key/value pairs): { "multi-author": 0, "changes": [0,0,0,...], "paragraph-authors": [1,1,1,...] }
创建时间:
2021-08-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作