peterkirby/pan2020_dict_author_fandom_doc
收藏Hugging Face2026-04-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/peterkirby/pan2020_dict_author_fandom_doc
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: pan-2020-fanfiction-dataset
license_link: https://zenodo.org/records/5106099
task_categories:
- text-classification
- text-retrieval
language:
- en
pretty_name: PAN2020 Fanfiction Author-Fandom-Disjoint Train/Validation Split
configs:
- config_name: default
default: true
data_files:
- split: train
path: train.parquet
- config_name: pan21
data_files:
- split: validation
path: validation.parquet
- split: test
path: test.parquet
- config_name: pan20
data_files:
- split: test
path: pan20_av.parquet
---
# PAN2020 Fanfiction Author-Fandom-Disjoint Train/Validation Split
PAN 2020 / PAN 2021 fanfiction authorship verification data with Train/Validation split. The training data has been pre-split into Train and Validation under Author-Fandom-Disjoint constraints as is appropriate for PAN21 test data.
The training data is one row per document to allow easy recombination. The PAN21 validation and test splits consist of fixed document pairs for consistent scoring. The string fields are the original data. The integer fields are based on sorting the unique strings.
## Usage
```python
from datasets import load_dataset
train_data = load_dataset("peterkirby/pan2020_dict_author_fandom_doc", "default", split="train")
pan21_val = load_dataset("peterkirby/pan2020_dict_author_fandom_doc", "pan21", split="validation")
pan21_test = load_dataset("peterkirby/pan2020_dict_author_fandom_doc", "pan21", split="test")
pan20_test = load_dataset("peterkirby/pan2020_dict_author_fandom_doc", "pan20", split="test")
```
## Configs
- `default`: only `train`
- `pan21`: `validation` and the PAN21 `test` set
- `pan20`: only `test` with the PAN20 authorship verification test set
## Splits
### `train`
Columns:
- `author_str`
- `fandom_str`
- `author_int`
- `fandom_int`
- `text`
### `validation`
Columns:
- `same`
- `author1_str`
- `fandom1_str`
- `author1_int`
- `fandom1_int`
- `text1`
- `author2_str`
- `fandom2_str`
- `author2_int`
- `fandom2_int`
- `text2`
Balanced validation set:
- 10,000 Same Author / Different Fandom pairs
- 10,000 Different Author pairs
Same Author / Different Fandom:
- `same = true`
- same author, different fandoms
- document usage histogram: {1: 16900, 2: 1034, 3: 206, 4: 86, 5: 14} (mostly single use)
- unordered `(fandom1, fandom2)` is not repeated within an author
Different Author:
- `same = false`
- different authors
- no document is repeated
- unordered `(author1, author2)` is not repeated
- author usage histogram: {2: 882, 3: 819, 4: 716, 5: 2583} (mostly 5 uses per author)
The validation dataset in the `pan21` config is intended to be similar in construction to the official PAN21 test dataset. A greedy approach balanced the benefits of data efficiency and random selection with a random but weighted author/fandom selection, reducing the number of documents outside both Train and Validation. Note that a document is in Train or Validation if and only if both the author and fandom are assigned to that set, where there are no overlapping authors and no overlapping fandoms. The 20k pairs were constructed from 30,670 eligible documents in the validation set, which contains 5000 authors and 438 of 1600 fandoms.
### `test`
In the `pan21` config, this is the original PAN21 test set (converted to Parquet), an open-set authorship verification problem on unseen authors and fandoms.
In the `pan20` config, this is the original PAN20 test set (converted to Parquet), a closed-set authorship verification problem on authors and fandoms already seen in the training data.
## Preprocessing
Document text has been very lightly normalized (on top of PAN20's existing normalization) to fix contractions that looked like this: n"t. Helps tokenizers and pre-trained models.
```python
APOS_TO_QUOTE = str.maketrans({
"'": '"', "’": '"', "‘": '"', "`": '"', "´": '"'
})
BETWEEN_ALPHA_QUOTE = re.compile(r'(?<=[^\W\d_])"(?=[^\W\d_])')
def fix_text(s: str) -> str:
s = str(s).translate(APOS_TO_QUOTE)
return BETWEEN_ALPHA_QUOTE.sub("'", s)
```
## Citation
If you use this dataset for your research, please cite:
Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast.
*The Importance of Suppressing Domain Style in Authorship Analysis.*
CoRR, abs/2005.14714, May 2020.
### BibTeX
```bibtex
@Article{stein:2020k,
author = {Sebastian Bischoff and Niklas Deckers and Marcel Schliebs and Ben Thies and Matthias Hagen and Efstathios Stamatatos and Benno Stein and Martin Potthast},
journal = {CoRR},
month = may,
title = {{The Importance of Suppressing Domain Style in Authorship Analysis}},
url = {https://arxiv.org/abs/2005.14714},
volume = {abs/2005.14714},
year = 2020
}
```
提供机构:
peterkirby
搜集汇总
数据集介绍

构建方式
在文本分析领域,作者身份验证任务常面临领域风格干扰的挑战。该数据集基于PAN 2020与PAN 2021同人小说作者验证任务构建,其核心在于采用作者-粉丝圈互斥的划分策略。训练集与验证集的划分严格遵循作者和粉丝圈均不重叠的原则,确保模型能够学习跨领域的作者风格特征,而非特定粉丝圈的写作惯例。数据以文档为单位组织,便于重组;验证集包含两万对文档对,其中一万对为同一作者不同粉丝圈的作品,另一万对为不同作者的作品,每对均经过平衡设计以避免重复。这种构建方式旨在模拟真实场景中作者风格与领域风格的分离,为跨领域作者验证研究提供可靠基础。
特点
该数据集在作者身份验证研究中展现出鲜明的结构性特征。其验证集精心构建了平衡的文档对,包含相同作者跨粉丝圈与不同作者两种对比情形,有效支持模型区分作者固有风格与领域特定表达。数据列设计兼顾原始字符串与整数编码,既保留文本原貌又便于计算处理。预处理环节对文本进行了轻量规范化,修正了引号与缩略形式,提升了分词器与预训练模型的兼容性。此外,数据集提供多个配置,分别对应训练、PAN21验证与测试集以及PAN20测试集,覆盖开放集与封闭集两种验证场景,为不同实验需求提供了灵活支持。
使用方法
使用该数据集时,可通过Hugging Face的datasets库便捷加载。根据研究目标选择相应配置:'default'配置仅包含训练集,'pan21'配置提供验证集与PAN21测试集,'pan20'配置则对应PAN20测试集。加载后,训练数据以单文档形式呈现,包含作者、粉丝圈及文本字段;验证与测试数据则以文档对形式组织,并标注是否属于同一作者。研究者可直接利用这些数据进行模型训练与评估,尤其适用于探索跨粉丝圈的作者风格一致性验证。数据集的互斥划分设计鼓励模型聚焦作者内在写作特征,推动领域无关的作者身份分析进展。
背景与挑战
背景概述
PAN2020同人小说作者-粉丝圈分离数据集诞生于2020年,由PAN(数字取证与隐写分析)会议组织的研究团队构建,核心成员包括Sebastian Bischoff、Martin Potthast等学者。该数据集聚焦于作者身份验证这一自然语言处理领域的经典问题,特别针对同人小说文本,旨在探究在跨粉丝圈背景下作者写作风格的稳定性与可辨识性。其设计遵循作者与粉丝圈双重分离原则,即训练集与验证集中的作者及粉丝圈均无重叠,从而有效抑制领域风格对作者分析的干扰,推动了跨域作者识别研究的发展,为文本溯源、数字取证等应用提供了重要基准。
当前挑战
该数据集致力于解决作者身份验证任务中的领域适应挑战,即如何剥离文本内容所依附的特定粉丝圈风格,以准确捕捉作者固有的写作特征。构建过程中的主要挑战在于实现严格的作者-粉丝圈分离划分,需在数万文档中确保训练集与验证集之间作者和粉丝圈完全无交集,同时维持数据平衡与统计代表性。此外,同人小说文本本身具有高度衍生性与风格混杂性,增加了风格特征提取的难度,要求模型能够区分作者个性表达与粉丝圈共享的叙事惯例。
常用场景
经典使用场景
在数字文本分析领域,同人小说作为一种富含创作者个人风格的文学形式,为作者身份验证研究提供了独特的数据基础。该数据集通过作者-粉丝圈分离的划分策略,构建了训练、验证和测试集,专门用于评估模型在跨粉丝圈情境下识别同一作者文本的能力。其经典使用场景聚焦于开放集作者身份验证任务,即模型需判断未见过的作者或粉丝圈中的文本对是否出自同一人之手,这模拟了现实世界中作者风格跨越不同主题领域的识别挑战。
实际应用
在实际应用层面,该数据集所支撑的技术可延伸至多个现实场景。在网络安全领域,它可用于匿名文本溯源,辅助识别网络攻击或虚假信息的发布者。在数字取证和司法鉴定中,能够帮助分析匿名信函或争议文档的作者身份。此外,在数字版权保护和文学研究方面,该技术有助于识别抄袭或确认匿名历史文献的著者,为内容真实性验证提供了重要的方法论工具。
衍生相关工作
围绕该数据集衍生了一系列经典研究工作。其设计思想直接源于PAN 2020/2021国际作者身份验证竞赛,催生了众多专注于领域不变特征学习和开放集验证的模型。后续研究广泛借鉴其作者-粉丝圈分离范式,探索了基于Transformer的预训练语言模型、图神经网络以及对比学习等方法在风格解耦任务上的应用。这些工作不仅深化了对写作风格本质的理解,也显著提升了跨领域作者识别技术的性能上限。
以上内容由遇见数据集搜集并总结生成



