PAN15 Author Identification: Verification
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/3737562
下载链接
链接失效反馈官方服务:
资源简介:
We provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language. However, their genre and/or topic may differ significantly. The document lengths vary from a few hundred to a few thousand words.
The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language of the documents. The following list shows the available sub-corpora, including their language, type (cross-genre or cross-topic), code, and examples of problem IDs:
Language; Type; Code; Problem IDs
Dutch; Cross-genre; DU; DU001, DU002, DU003, etc.
English; Cross-topic; EN; EN001, EN002, EN003, etc.
Greek; Cross-topic; GR; GR001, GR002, GR003, etc.
Spanish; Cross-genre; SP; SP001, SP002, SP003, etc.
The ground truth data of the training corpus found in the file truth.txt include one line per problem with problem ID and the correct binary answer (Y means the known and the questioned documents are by the same author and N means the opposite). For example:
EN001 N
EN002 Y
EN003 N
...
本训练语料库包含多语言、多体裁的作者验证问题(Author Verification Problem)集。每个问题集均包含某一作者的至多5篇已知文献,以及恰好1篇待验证文献。同一问题集中的所有文献语言一致,但体裁与/或主题可能存在显著差异。单篇文献的字数跨度从数百至数千词不等。
每个问题的文献均存储于独立文件夹中,文件夹名称(即问题ID(Problem ID))可标识该问题集的文献语言。下表列出了所有可用子语料库,包含其对应语言、类型(跨体裁或跨主题)、代码以及问题ID示例:
语言; 类型; 代码; 问题ID示例
荷兰语; 跨体裁; DU; DU001、DU002、DU003等
英语; 跨主题; EN; EN001、EN002、EN003等
希腊语; 跨主题; GR; GR001、GR002、GR003等
西班牙语; 跨体裁; SP; SP001、SP002、SP003等
训练语料库的真值数据(Ground Truth)存储于truth.txt文件中,每行对应一个问题,包含问题ID与正确的二分类标签:Y代表已知文献与待验证文献为同一作者所著,N则代表二者作者不同。例如:
EN001 N
EN002 Y
EN003 N
...
创建时间:
2023-11-30



