Replication Data for: Detecting Formatted Text: Data Collection Using Computer Vision
收藏DataONE2025-05-15 更新2025-12-06 收录
下载链接:
https://search.dataone.org/view/sha256:4e5c3064ff2817514ac7b917152e8aa03a612997e94e2ad1eb0ff8d9f4a6f9c9
下载链接
链接失效反馈官方服务:
资源简介:
Research in political science has begun to explore how to use large language and object detection models to analyze text and visual data. However, few studies have explored how to use these tools for data extraction. Instead, researchers interested in extracting text from poorly formatted sources typically rely on optical character recognition and regular expressions or extract each item by hand. This letter describes a workflow process for structured text extraction using free models and software. I discuss the type of data best suited to this method, its usefulness within political science, and the steps required to convert the text into a usable dataset. Finally, I demonstrate the method by extracting agenda items from city council meeting minutes. I find the method can accurately extract sub-sections of text from a document and requires only a few hand labeled documents to adequately train.
创建时间:
2025-10-29



