five

Table_2.csv

收藏
frontiersin.figshare.com2023-05-31 更新2025-01-15 收录
下载链接:
https://frontiersin.figshare.com/articles/dataset/Table_2_csv/6094844/1
下载链接
链接失效反馈
官方服务:
资源简介:
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.

本文阐述了多学科项目的研究成果,旨在深入探究不同数字化策略在计算文本分析领域的影响。具体而言,文章描述了通过手写文本识别(HTR)和光学字符识别(OCR)处理未经校对的信件集,自动识别雅各布·格林和威廉·格林兄弟作者身份的尝试,并报告了这种噪声对计算识别两兄弟不同写作风格所必需的分析的影响。简而言之,我们的研究结果表明,OCR数字化技术在作为手动数字化这一费时过程的可靠替代品方面表现良好,至少在作者身份归属方面。我们的研究还表明,即使使用来自不同数字化管道的训练集和测试集,归属也是可行的。至于HTR,这项研究证实,虽然与OCR相比,自动转录显著增加了文本误分类的风险,但文本清洁度超过≈20%时,已经足以实现高于随机概率的正确二元归属。
提供机构:
Frontiers
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作