five

ScienceMetaBench

收藏
魔搭社区2026-01-09 更新2026-01-10 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/ScienceMetaBench
下载链接
链接失效反馈
官方服务:
资源简介:
# ScienceMetaBench [English](README.md) | [中文](README_ZH.md) 🤗 [HuggingFace Dataset](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub Repository](https://github.com/DataEval/ScienceMetaBench) **Acknowledgements**: 🔍 [Dingo](https://github.com/MigoXLab/dingo) ScienceMetaBench is a benchmark dataset for evaluating the accuracy of metadata extraction from scientific literature PDF files. The dataset covers three major categories: academic papers, textbooks, and ebooks, and can be used to assess the performance of Large Language Models (LLMs) or other information extraction systems. ## 📊 Dataset Overview ### Data Types This benchmark includes three types of scientific literature: 1. **Papers** - Mainly from academic journals and conferences - Contains academic metadata such as DOI, keywords, etc. 2. **Textbooks** - Formally published textbooks - Includes ISBN, publisher, and other publication information 3. **Ebooks** - Digitized historical documents and books - Covers multiple languages and disciplines ### Data Batches This benchmark has undergone two rounds of data expansion, with each round adding new sample data: ``` data/ ├── 20250806/ # First batch (August 6, 2024) │ ├── ebook_0806.jsonl │ ├── paper_0806.jsonl │ └── textbook_0806.jsonl └── 20251022/ # Second batch (October 22, 2024) ├── ebook_1022.jsonl ├── paper_1022.jsonl └── textbook_1022.jsonl ``` **Note**: The two batches of data complement each other to form a complete benchmark dataset. You can choose to use a single batch or merge them as needed. ### PDF Files The `pdf/` directory contains the original PDF files corresponding to the benchmark data, with a directory structure consistent with the `data/` directory. **File Naming Convention**: All PDF files are named using their SHA256 hash values, in the format `{sha256}.pdf`. This naming scheme ensures file uniqueness and traceability, making it easy to locate the corresponding source file using the `sha256` field in the JSONL data. ## 📝 Data Format All data files are in JSONL format (one JSON object per line). ### Academic Paper Fields ```json { "sha256": "SHA256 hash of the file", "doi": "Digital Object Identifier", "title": "Paper title", "author": "Author name", "keyword": "Keywords (comma-separated)", "abstract": "Abstract content", "pub_time": "Publication year" } ``` ### Textbook/Ebook Fields ```json { "sha256": "SHA256 hash of the file", "isbn": "International Standard Book Number", "title": "Book title", "author": "Author name", "abstract": "Introduction/abstract", "category": "Classification number (e.g., Chinese Library Classification)", "pub_time": "Publication year", "publisher": "Publisher" } ``` ## 📖 Data Examples ### Academic Paper Example The following image shows an example of metadata fields extracted from an academic paper PDF: ![Academic Paper Example](images/paper_example.png) As shown in the image, the following key information needs to be extracted from the paper's first page: - **DOI**: Digital Object Identifier (e.g., `10.1186/s41038-017-0090-z`) - **Title**: Paper title - **Author**: Author name - **Keyword**: List of keywords - **Abstract**: Paper abstract - **pub_time**: Publication time (usually the year) ### Textbook/Ebook Example The following image shows an example of metadata fields extracted from the copyright page of a Chinese ebook PDF: ![Textbook Example](images/ebook_example.png) As shown in the image, the following key information needs to be extracted from the book's copyright page: - **ISBN**: International Standard Book Number (e.g., `978-7-5385-8594-0`) - **Title**: Book title - **Author**: Author/editor name - **Publisher**: Publisher name - **pub_time**: Publication time (year) - **Category**: Book classification number - **Abstract**: Content introduction (if available) These examples demonstrate the core task of the benchmark test: accurately extracting structured metadata information from PDF documents in various formats and languages. ## 📊 Evaluation Metrics ### Core Evaluation Metrics This benchmark uses a string similarity-based evaluation method, providing two core metrics: ### Similarity Calculation Rules This benchmark uses a string similarity algorithm based on `SequenceMatcher`, with the following specific rules: 1. **Empty Value Handling**: One is empty and the other is not → similarity is 0 2. **Complete Match**: Both are identical (including both being empty) → similarity is 1 3. **Case Insensitive**: Convert to lowercase before comparison 4. **Sequence Matching**: Use longest common subsequence algorithm to calculate similarity (range: 0-1) **Similarity Score Interpretation**: - `1.0`: Perfect match - `0.8-0.99`: Highly similar (may have minor formatting differences) - `0.5-0.79`: Partial match (extracted main information but incomplete) - `0.0-0.49`: Low similarity (extraction result differs significantly from ground truth) #### 1. Field-level Accuracy **Definition**: The average similarity score for each metadata field. **Calculation Method**: ``` Field-level Accuracy = Σ(similarity of that field across all samples) / total number of samples ``` **Example**: Suppose evaluating the `title` field on 100 samples, the sum of title similarity for each sample divided by 100 gives the accuracy for that field. **Use Cases**: - Identify which fields the model performs well or poorly on - Optimize extraction capabilities for specific fields - For example: If `doi` accuracy is 0.95 and `abstract` accuracy is 0.75, the model needs improvement in extracting abstracts #### 2. Overall Accuracy **Definition**: The average of all evaluated field accuracies, reflecting the model's overall performance. **Calculation Method**: ``` Overall Accuracy = Σ(field-level accuracies) / total number of fields ``` **Example**: Evaluating 7 fields (isbn, title, author, abstract, category, pub_time, publisher), sum these 7 field accuracies and divide by 7. **Use Cases**: - Provide a single quantitative metric for overall model performance - Facilitate horizontal comparison between different models or methods - Serve as an overall objective for model optimization ### Using the Evaluation Script `compare.py` provides a convenient evaluation interface: ```python from compare import main, write_similarity_data_to_excel # Define file paths and fields to compare file_llm = 'data/llm-label_textbook.jsonl' # LLM extraction results file_bench = 'data/benchmark_textbook.jsonl' # Benchmark data # For textbooks/ebooks key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'] # For academic papers # key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time'] # Run evaluation and get metrics accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list) # Output results to Excel (optional) write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx") # View evaluation metrics print("Field-level Accuracy:", key_accuracy) print("Overall Accuracy:", accuracy) ``` ### Output Files The script generates an Excel file containing detailed sample-by-sample analysis: - `sha256`: File identifier - For each field (e.g., `title`): - `llm_title`: LLM extraction result - `benchmark_title`: Benchmark data - `similarity_title`: Similarity score (0-1) ## 📈 Statistics ### Data Scale **First Batch (20250806)**: - **Ebooks**: 70 records - **Academic Papers**: 70 records - **Textbooks**: 71 records - **Subtotal**: 211 records **Second Batch (20251022)**: - **Ebooks**: 354 records - **Academic Papers**: 399 records - **Textbooks**: 46 records - **Subtotal**: 799 records **Total**: 1010 benchmark test records The data covers multiple languages (English, Chinese, German, Greek, etc.) and multiple disciplines, with both batches together providing a rich and diverse set of test samples. ## 🎯 Application Scenarios 1. **LLM Performance Evaluation**: Assess the ability of large language models to extract metadata from PDFs 2. **Information Extraction System Testing**: Test the accuracy of OCR, document parsing, and other systems 3. **Model Fine-tuning**: Use as training or fine-tuning data to improve model information extraction capabilities 4. **Cross-lingual Capability Evaluation**: Evaluate the model's ability to process multilingual literature ## 🔬 Data Characteristics - ✅ **Real Data**: Real metadata extracted from actual PDF files - ✅ **Diversity**: Covers literature from different eras, languages, and disciplines - ✅ **Challenging**: Includes ancient texts, non-English literature, complex layouts, and other difficult cases - ✅ **Traceable**: Each record includes SHA256 hash and original path ## 📋 Dependencies ```python pandas>=1.3.0 openpyxl>=3.0.0 ``` Install dependencies: ```bash pip install pandas openpyxl ``` ## 🤝 Contributing If you would like to: - Report data errors - Add new evaluation dimensions - Expand the dataset Please submit an Issue or Pull Request. ## 📧 Contact If you have questions or suggestions, please contact us through Issues. --- **Last Updated**: December 26, 2025

# ScienceMetaBench [English](README.md) | [中文](README_ZH.md) 🤗 [HuggingFace 数据集](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub 仓库](https://github.com/DataEval/ScienceMetaBench) **致谢**: 🔍 [Dingo](https://github.com/MigoXLab/dingo) ScienceMetaBench 是一款用于评估科学文献PDF文件元数据提取准确性的基准数据集。该数据集涵盖学术论文、教科书与电子书三大类别,可用于评估大语言模型(Large Language Model,简称LLM)或其他信息抽取系统的性能。 ## 📊 数据集概览 ### 数据类型 本基准数据集包含三类科学文献: 1. **学术论文** - 主要来源于学术期刊与会议 - 包含数字对象标识符(Digital Object Identifier,简称DOI)、关键词等学术元数据 2. **教科书** - 正式出版的教科书 - 包含国际标准书号(International Standard Book Number,简称ISBN)、出版社等出版信息 3. **电子书** - 数字化的历史文献与图书 - 覆盖多语言与多学科 ### 数据批次 本基准数据集历经两轮数据扩展,每轮均新增样本数据: data/ ├── 20250806/ # 第一批数据(2024年8月6日) │ ├── ebook_0806.jsonl │ ├── paper_0806.jsonl │ └── textbook_0806.jsonl └── 20251022/ # 第二批数据(2024年10月22日) ├── ebook_1022.jsonl ├── paper_1022.jsonl └── textbook_1022.jsonl **注意**:两批次数据互为补充,共同构成完整的基准数据集。您可根据需求选择单一批次或合并使用。 ### PDF文件 `pdf/` 目录包含与基准数据对应的原始PDF文件,目录结构与 `data/` 目录一致。 **文件命名规则**:所有PDF文件均采用其SHA256哈希值命名,格式为`{sha256}.pdf`。该命名方案可确保文件唯一性与可追溯性,便于通过JSONL数据中的`sha256`字段定位对应的源文件。 ## 📝 数据格式 所有数据文件均采用JSON Lines(简称JSONL)格式,即每行包含一个JSON对象。 ### 学术论文字段 json { "sha256": "文件的SHA256哈希值", "doi": "数字对象标识符", "title": "论文标题", "author": "作者姓名", "keyword": "关键词(逗号分隔)", "abstract": "摘要内容", "pub_time": "出版年份" } ### 教科书/电子书字段 json { "sha256": "文件的SHA256哈希值", "isbn": "国际标准书号", "title": "图书标题", "author": "作者姓名", "abstract": "内容简介/摘要", "category": "分类号(如中国图书馆分类法)", "pub_time": "出版年份", "publisher": "出版社名称" } ## 📖 数据示例 ### 学术论文示例 下图展示了从学术论文PDF中提取的元数据字段示例: ![Academic Paper Example](images/paper_example.png) 如图所示,需从论文首页提取以下关键信息: - **DOI**:数字对象标识符(例如`10.1186/s41038-017-0090-z`) - **标题**:论文标题 - **作者**:作者姓名 - **关键词**:关键词列表 - **摘要**:论文摘要 - **pub_time**:出版时间(通常为年份) ### 教科书/电子书示例 下图展示了从中文电子书PDF版权页中提取的元数据字段示例: ![Textbook Example](images/ebook_example.png) 如图所示,需从图书版权页提取以下关键信息: - **ISBN**:国际标准书号(例如`978-7-5385-8594-0`) - **标题**:图书标题 - **作者**:作者/编辑姓名 - **出版社**:出版社名称 - **pub_time**:出版时间(年份) - **分类号**:图书分类号 - **摘要**:内容简介(如有) 上述示例展示了本基准测试的核心任务:从不同格式与语言的PDF文档中精准提取结构化元数据信息。 ## 📊 评估指标 ### 核心评估指标 本基准数据集采用基于字符串相似度的评估方法,提供两项核心指标: ### 相似度计算规则 本基准数据集采用基于`SequenceMatcher`的字符串相似度算法,具体规则如下: 1. **空值处理**:一方为空、另一方非空 → 相似度为0 2. **完全匹配**:双方完全一致(包括双方均为空)→ 相似度为1 3. **大小写不敏感**:比较前统一转换为小写 4. **序列匹配**:采用最长公共子序列算法计算相似度(取值范围:0-1) **相似度分数释义**: - `1.0`:完全匹配 - `0.8-0.99`:高度相似(可能存在细微格式差异) - `0.5-0.79`:部分匹配(提取了主要信息但不完整) - `0.0-0.49`:低相似度(抽取结果与标准答案差异显著) #### 1. 字段级准确率 **定义**:各元数据字段的平均相似度得分。 **计算方法**: 字段级准确率 = Σ(所有样本中该字段的相似度)/ 总样本数 **示例**:假设对100个样本的`title`字段进行评估,将每个样本的标题相似度求和后除以100,即得到该字段的准确率。 **使用场景**: - 识别模型在哪些字段上表现优异或薄弱 - 优化特定字段的抽取能力 - 例如:若`doi`字段准确率为0.95,`abstract`字段准确率为0.75,则模型在摘要抽取方面有待改进 #### 2. 整体准确率 **定义**:所有评估字段准确率的平均值,反映模型的整体性能。 **计算方法**: 整体准确率 = Σ(各字段级准确率)/ 总字段数 **示例**:对7个字段(isbn、title、author、abstract、category、pub_time、publisher)进行评估,将这7个字段的准确率求和后除以7。 **使用场景**: - 为模型整体性能提供单一量化指标 - 便于不同模型或方法间的横向对比 - 作为模型优化的整体目标 ### 评估脚本使用 `compare.py` 提供了便捷的评估接口: python from compare import main, write_similarity_data_to_excel # 定义待比较的文件路径与字段 file_llm = 'data/llm-label_textbook.jsonl' # LLM抽取结果 file_bench = 'data/benchmark_textbook.jsonl' # 基准数据集 # 针对教科书/电子书 key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'] # 针对学术论文 # key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time'] # 运行评估并获取指标 accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list) # 将结果输出至Excel(可选) write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx") # 查看评估指标 print("字段级准确率:", key_accuracy) print("整体准确率:", accuracy) ### 输出文件 脚本将生成包含逐样本详细分析的Excel文件: - `sha256`:文件标识符 - 针对每个字段(例如`title`): - `llm_title`:LLM抽取结果 - `benchmark_title`:基准数据集数据 - `similarity_title`:相似度得分(0-1) ## 📈 统计信息 ### 数据规模 **第一批数据(20250806)**: - **电子书**:70条记录 - **学术论文**:70条记录 - **教科书**:71条记录 - **小计**:211条记录 **第二批数据(20251022)**: - **电子书**:354条记录 - **学术论文**:399条记录 - **教科书**:46条记录 - **小计**:799条记录 **总计**:1010条基准测试样本 本数据集覆盖多语言(英语、中文、德语、希腊语等)与多学科,两批次数据共同提供了丰富多样的测试样本。 ## 🎯 应用场景 1. **LLM性能评估**:评估大语言模型从PDF中提取元数据的能力 2. **信息抽取系统测试**:测试OCR、文档解析等系统的准确性 3. **模型微调**:用作训练或微调数据,提升模型的信息抽取能力 4. **跨语言能力评估**:评估模型处理多语言文献的能力 ## 🔬 数据特征 - ✅ **真实数据**:从实际PDF文件中提取的真实元数据 - ✅ **多样性**:涵盖不同时代、语言与学科的文献 - ✅ **挑战性**:包含古籍、非英文文献、复杂版式等疑难案例 - ✅ **可追溯性**:每条记录均包含SHA256哈希值与原始路径 ## 📋 依赖项 python pandas>=1.3.0 openpyxl>=3.0.0 安装依赖: bash pip install pandas openpyxl ## 🤝 贡献 若您希望: - 报告数据错误 - 新增评估维度 - 扩展数据集 请提交Issue或Pull Request。 ## 📧 联系 若您有任何问题或建议,请通过Issue与我们联系。 --- **最后更新**:2025年12月26日
提供机构:
maas
创建时间:
2025-12-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作