five

biglam/doab-metadata-extraction

收藏
Hugging Face2025-10-16 更新2025-10-18 收录
下载链接:
https://hf-mirror.com/datasets/biglam/doab-metadata-extraction
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含9,363本开放获取书籍,每本书的页图像和丰富的书目元数据都是从MARC21记录中提取的,专门用于训练和评估视觉语言模型(VLMs)从学术专著中自动提取元数据。该数据集主要使用英语,但也包括一些德语、意大利语、法语和西班牙语书籍。数据集可以根据许可证类型和语言进行过滤,并包括诸如标题、作者、出版商、出版年份、ISBN、主题、摘要等各种元数据字段。数据集可以从Hugging Face网站下载,并可以使用README文件中提供的Python脚本进行加载和访问。

This dataset contains 9,363 open access books with page images and rich bibliographic metadata extracted from MARC21 records, curated specifically for training and evaluating Vision Language Models (VLMs) on automatic metadata extraction from scholarly monographs. The dataset is primarily in English, but also includes some German, Italian, French, and Spanish books. The dataset can be filtered by license type and language, and includes a variety of metadata fields such as title, authors, publisher, publication year, ISBN, subjects, abstracts, and more. The dataset is available for download from the Hugging Face website and can be loaded and accessed using Python scripts provided in the README file.
提供机构:
biglam
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作