five

Replication Data for: Detecting Formatted Text: Data Collection Using Computer Vision

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://doi.org/10.7910/DVN/8BE6M9
下载链接
链接失效反馈
官方服务:
资源简介:
Research in political science has begun to explore how to use large language and object detection models to analyze text and visual data. However, few studies have explored how to use these tools for data extraction. Instead, researchers interested in extracting text from poorly formatted sources typically rely on optical character recognition and regular expressions or extract each item by hand. This letter describes a workflow process for structured text extraction using free models and software. I discuss the type of data best suited to this method, its usefulness within political science, and the steps required to convert the text into a usable dataset. Finally, I demonstrate the method by extracting agenda items from city council meeting minutes. I find the method can accurately extract sub-sections of text from a document and requires only a few hand labeled documents to adequately train.

政治学领域的研究已开始探索如何借助大语言模型(Large Language Model)与目标检测模型开展文本及视觉数据分析。然而,鲜有研究探讨如何利用此类工具开展数据抽取工作。反之,那些希望从格式不规范的数据源中抽取文本的研究者,通常只能依赖光学字符识别(Optical Character Recognition, OCR)与正则表达式,或是手动逐条提取内容。本研究快报介绍了一种借助免费模型与软件实现结构化文本抽取的工作流程。本文探讨了适配该方法的最优数据类型、该方法在政治学研究中的应用价值,以及将文本转化为可用数据集所需的操作步骤。最后,本文以从市议会会议纪要中抽取议程条目为例,对该方法进行了演示。研究结果表明,该方法可精准抽取文档中的文本子段落,且仅需少量手动标注的文档即可完成充分训练。
创建时间:
2025-05-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作