five

Replication Data for: Detecting Formatted Text: Data Collection Using Computer Vision

收藏
DataONE2025-05-15 更新2025-12-06 收录
下载链接:
https://search.dataone.org/view/sha256:4e5c3064ff2817514ac7b917152e8aa03a612997e94e2ad1eb0ff8d9f4a6f9c9
下载链接
链接失效反馈
官方服务:
资源简介:
Research in political science has begun to explore how to use large language and object detection models to analyze text and visual data. However, few studies have explored how to use these tools for data extraction. Instead, researchers interested in extracting text from poorly formatted sources typically rely on optical character recognition and regular expressions or extract each item by hand. This letter describes a workflow process for structured text extraction using free models and software. I discuss the type of data best suited to this method, its usefulness within political science, and the steps required to convert the text into a usable dataset. Finally, I demonstrate the method by extracting agenda items from city council meeting minutes. I find the method can accurately extract sub-sections of text from a document and requires only a few hand labeled documents to adequately train.
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作