ForumFree 和 Diaries
收藏arXiv2020-11-16 更新2024-06-21 收录
下载链接:
https://github.com/garuggiero/Italian-Datasets-for-AV
下载链接
链接失效反馈官方服务:
资源简介:
本研究引入了两个新的数据集:ForumFree 和 Diaries,用于意大利语的作者归属研究。ForumFree 数据集包含来自 ForumFree 平台的网络论坛评论,而 Diaries 数据集则收集了海外意大利人的日记片段。这两个数据集的创建旨在通过分析短文本,探讨作者归属的可行性,特别是在数据量有限的情况下。数据集的创建过程涉及预处理和格式重构,以适应作者验证任务。这些数据集的应用领域包括抄袭检测、多账户检测和在线安全,旨在通过分析个人写作风格来识别作者。
This study introduces two novel datasets, ForumFree and Diaries, for authorship attribution research in Italian. The ForumFree dataset consists of online forum comments sourced from the ForumFree platform, while the Diaries dataset collects diary fragments written by Italian individuals living abroad. The creation of these two datasets aims to explore the feasibility of authorship attribution via short text analysis, particularly in scenarios with limited data volume. The dataset development process involves preprocessing and format restructuring to adapt to authorship verification tasks. Application scenarios of these datasets include plagiarism detection, multi-account detection, and online safety, with the goal of identifying authors by analyzing individual writing styles.
提供机构:
马耳他大学语言与语言技术研究所
创建时间:
2020-11-16



