ScienceMetaBench

Name: ScienceMetaBench
Creator: maas
Published: 2026-01-09 18:38:13
License: 暂无描述

魔搭社区2026-01-09 更新2026-01-10 收录

下载链接：

https://modelscope.cn/datasets/OpenDataLab/ScienceMetaBench

下载链接

链接失效反馈

官方服务：

资源简介：

# ScienceMetaBench [English](README.md) | [中文](README_ZH.md) 🤗 [HuggingFace Dataset](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub Repository](https://github.com/DataEval/ScienceMetaBench) **Acknowledgements**: 🔍 [Dingo](https://github.com/MigoXLab/dingo) ScienceMetaBench is a benchmark dataset for evaluating the accuracy of metadata extraction from scientific literature PDF files. The dataset covers three major categories: academic papers, textbooks, and ebooks, and can be used to assess the performance of Large Language Models (LLMs) or other information extraction systems. ## 📊 Dataset Overview ### Data Types This benchmark includes three types of scientific literature: 1. **Papers** - Mainly from academic journals and conferences - Contains academic metadata such as DOI, keywords, etc. 2. **Textbooks** - Formally published textbooks - Includes ISBN, publisher, and other publication information 3. **Ebooks** - Digitized historical documents and books - Covers multiple languages and disciplines ### Data Batches This benchmark has undergone two rounds of data expansion, with each round adding new sample data: ``` data/ ├── 20250806/ # First batch (August 6, 2024) │ ├── ebook_0806.jsonl │ ├── paper_0806.jsonl │ └── textbook_0806.jsonl └── 20251022/ # Second batch (October 22, 2024) ├── ebook_1022.jsonl ├── paper_1022.jsonl └── textbook_1022.jsonl ``` **Note**: The two batches of data complement each other to form a complete benchmark dataset. You can choose to use a single batch or merge them as needed. ### PDF Files The `pdf/` directory contains the original PDF files corresponding to the benchmark data, with a directory structure consistent with the `data/` directory. **File Naming Convention**: All PDF files are named using their SHA256 hash values, in the format `{sha256}.pdf`. This naming scheme ensures file uniqueness and traceability, making it easy to locate the corresponding source file using the `sha256` field in the JSONL data. ## 📝 Data Format All data files are in JSONL format (one JSON object per line). ### Academic Paper Fields ```json { "sha256": "SHA256 hash of the file", "doi": "Digital Object Identifier", "title": "Paper title", "author": "Author name", "keyword": "Keywords (comma-separated)", "abstract": "Abstract content", "pub_time": "Publication year" } ``` ### Textbook/Ebook Fields ```json { "sha256": "SHA256 hash of the file", "isbn": "International Standard Book Number", "title": "Book title", "author": "Author name", "abstract": "Introduction/abstract", "category": "Classification number (e.g., Chinese Library Classification)", "pub_time": "Publication year", "publisher": "Publisher" } ``` ## 📖 Data Examples ### Academic Paper Example The following image shows an example of metadata fields extracted from an academic paper PDF: ![Academic Paper Example](images/paper_example.png) As shown in the image, the following key information needs to be extracted from the paper's first page: - **DOI**: Digital Object Identifier (e.g., `10.1186/s41038-017-0090-z`) - **Title**: Paper title - **Author**: Author name - **Keyword**: List of keywords - **Abstract**: Paper abstract - **pub_time**: Publication time (usually the year) ### Textbook/Ebook Example The following image shows an example of metadata fields extracted from the copyright page of a Chinese ebook PDF: ![Textbook Example](images/ebook_example.png) As shown in the image, the following key information needs to be extracted from the book's copyright page: - **ISBN**: International Standard Book Number (e.g., `978-7-5385-8594-0`) - **Title**: Book title - **Author**: Author/editor name - **Publisher**: Publisher name - **pub_time**: Publication time (year) - **Category**: Book classification number - **Abstract**: Content introduction (if available) These examples demonstrate the core task of the benchmark test: accurately extracting structured metadata information from PDF documents in various formats and languages. ## 📊 Evaluation Metrics ### Core Evaluation Metrics This benchmark uses a string similarity-based evaluation method, providing two core metrics: ### Similarity Calculation Rules This benchmark uses a string similarity algorithm based on `SequenceMatcher`, with the following specific rules: 1. **Empty Value Handling**: One is empty and the other is not → similarity is 0 2. **Complete Match**: Both are identical (including both being empty) → similarity is 1 3. **Case Insensitive**: Convert to lowercase before comparison 4. **Sequence Matching**: Use longest common subsequence algorithm to calculate similarity (range: 0-1) **Similarity Score Interpretation**: - `1.0`: Perfect match - `0.8-0.99`: Highly similar (may have minor formatting differences) - `0.5-0.79`: Partial match (extracted main information but incomplete) - `0.0-0.49`: Low similarity (extraction result differs significantly from ground truth) #### 1. Field-level Accuracy **Definition**: The average similarity score for each metadata field. **Calculation Method**: ``` Field-level Accuracy = Σ(similarity of that field across all samples) / total number of samples ``` **Example**: Suppose evaluating the `title` field on 100 samples, the sum of title similarity for each sample divided by 100 gives the accuracy for that field. **Use Cases**: - Identify which fields the model performs well or poorly on - Optimize extraction capabilities for specific fields - For example: If `doi` accuracy is 0.95 and `abstract` accuracy is 0.75, the model needs improvement in extracting abstracts #### 2. Overall Accuracy **Definition**: The average of all evaluated field accuracies, reflecting the model's overall performance. **Calculation Method**: ``` Overall Accuracy = Σ(field-level accuracies) / total number of fields ``` **Example**: Evaluating 7 fields (isbn, title, author, abstract, category, pub_time, publisher), sum these 7 field accuracies and divide by 7. **Use Cases**: - Provide a single quantitative metric for overall model performance - Facilitate horizontal comparison between different models or methods - Serve as an overall objective for model optimization ### Using the Evaluation Script `compare.py` provides a convenient evaluation interface: ```python from compare import main, write_similarity_data_to_excel # Define file paths and fields to compare file_llm = 'data/llm-label_textbook.jsonl' # LLM extraction results file_bench = 'data/benchmark_textbook.jsonl' # Benchmark data # For textbooks/ebooks key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'] # For academic papers # key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time'] # Run evaluation and get metrics accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list) # Output results to Excel (optional) write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx") # View evaluation metrics print("Field-level Accuracy:", key_accuracy) print("Overall Accuracy:", accuracy) ``` ### Output Files The script generates an Excel file containing detailed sample-by-sample analysis: - `sha256`: File identifier - For each field (e.g., `title`): - `llm_title`: LLM extraction result - `benchmark_title`: Benchmark data - `similarity_title`: Similarity score (0-1) ## 📈 Statistics ### Data Scale **First Batch (20250806)**: - **Ebooks**: 70 records - **Academic Papers**: 70 records - **Textbooks**: 71 records - **Subtotal**: 211 records **Second Batch (20251022)**: - **Ebooks**: 354 records - **Academic Papers**: 399 records - **Textbooks**: 46 records - **Subtotal**: 799 records **Total**: 1010 benchmark test records The data covers multiple languages (English, Chinese, German, Greek, etc.) and multiple disciplines, with both batches together providing a rich and diverse set of test samples. ## 🎯 Application Scenarios 1. **LLM Performance Evaluation**: Assess the ability of large language models to extract metadata from PDFs 2. **Information Extraction System Testing**: Test the accuracy of OCR, document parsing, and other systems 3. **Model Fine-tuning**: Use as training or fine-tuning data to improve model information extraction capabilities 4. **Cross-lingual Capability Evaluation**: Evaluate the model's ability to process multilingual literature ## 🔬 Data Characteristics - ✅ **Real Data**: Real metadata extracted from actual PDF files - ✅ **Diversity**: Covers literature from different eras, languages, and disciplines - ✅ **Challenging**: Includes ancient texts, non-English literature, complex layouts, and other difficult cases - ✅ **Traceable**: Each record includes SHA256 hash and original path ## 📋 Dependencies ```python pandas>=1.3.0 openpyxl>=3.0.0 ``` Install dependencies: ```bash pip install pandas openpyxl ``` ## 🤝 Contributing If you would like to: - Report data errors - Add new evaluation dimensions - Expand the dataset Please submit an Issue or Pull Request. ## 📧 Contact If you have questions or suggestions, please contact us through Issues. --- **Last Updated**: December 26, 2025

# ScienceMetaBench [English](README.md) | [中文](README_ZH.md) 🤗 [HuggingFace 数据集](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub 仓库](https://github.com/DataEval/ScienceMetaBench) **致谢**: 🔍 [Dingo](https://github.com/MigoXLab/dingo) ScienceMetaBench 是一款用于评估科学文献PDF文件元数据提取准确性的基准数据集。该数据集涵盖学术论文、教科书与电子书三大类别，可用于评估大语言模型（Large Language Model，简称LLM）或其他信息抽取系统的性能。 ## 📊 数据集概览 ### 数据类型本基准数据集包含三类科学文献： 1. **学术论文** - 主要来源于学术期刊与会议 - 包含数字对象标识符（Digital Object Identifier，简称DOI）、关键词等学术元数据 2. **教科书** - 正式出版的教科书 - 包含国际标准书号（International Standard Book Number，简称ISBN）、出版社等出版信息 3. **电子书** - 数字化的历史文献与图书 - 覆盖多语言与多学科 ### 数据批次本基准数据集历经两轮数据扩展，每轮均新增样本数据： data/ ├── 20250806/ # 第一批数据（2024年8月6日） │ ├── ebook_0806.jsonl │ ├── paper_0806.jsonl │ └── textbook_0806.jsonl └── 20251022/ # 第二批数据（2024年10月22日） ├── ebook_1022.jsonl ├── paper_1022.jsonl └── textbook_1022.jsonl **注意**：两批次数据互为补充，共同构成完整的基准数据集。您可根据需求选择单一批次或合并使用。 ### PDF文件 `pdf/` 目录包含与基准数据对应的原始PDF文件，目录结构与 `data/` 目录一致。 **文件命名规则**：所有PDF文件均采用其SHA256哈希值命名，格式为`{sha256}.pdf`。该命名方案可确保文件唯一性与可追溯性，便于通过JSONL数据中的`sha256`字段定位对应的源文件。 ## 📝 数据格式所有数据文件均采用JSON Lines（简称JSONL）格式，即每行包含一个JSON对象。 ### 学术论文字段 json { "sha256": "文件的SHA256哈希值", "doi": "数字对象标识符", "title": "论文标题", "author": "作者姓名", "keyword": "关键词（逗号分隔）", "abstract": "摘要内容", "pub_time": "出版年份" } ### 教科书/电子书字段 json { "sha256": "文件的SHA256哈希值", "isbn": "国际标准书号", "title": "图书标题", "author": "作者姓名", "abstract": "内容简介/摘要", "category": "分类号（如中国图书馆分类法）", "pub_time": "出版年份", "publisher": "出版社名称" } ## 📖 数据示例 ### 学术论文示例下图展示了从学术论文PDF中提取的元数据字段示例： ![Academic Paper Example](images/paper_example.png) 如图所示，需从论文首页提取以下关键信息： - **DOI**：数字对象标识符（例如`10.1186/s41038-017-0090-z`） - **标题**：论文标题 - **作者**：作者姓名 - **关键词**：关键词列表 - **摘要**：论文摘要 - **pub_time**：出版时间（通常为年份） ### 教科书/电子书示例下图展示了从中文电子书PDF版权页中提取的元数据字段示例： ![Textbook Example](images/ebook_example.png) 如图所示，需从图书版权页提取以下关键信息： - **ISBN**：国际标准书号（例如`978-7-5385-8594-0`） - **标题**：图书标题 - **作者**：作者/编辑姓名 - **出版社**：出版社名称 - **pub_time**：出版时间（年份） - **分类号**：图书分类号 - **摘要**：内容简介（如有）上述示例展示了本基准测试的核心任务：从不同格式与语言的PDF文档中精准提取结构化元数据信息。 ## 📊 评估指标 ### 核心评估指标本基准数据集采用基于字符串相似度的评估方法，提供两项核心指标： ### 相似度计算规则本基准数据集采用基于`SequenceMatcher`的字符串相似度算法，具体规则如下： 1. **空值处理**：一方为空、另一方非空 → 相似度为0 2. **完全匹配**：双方完全一致（包括双方均为空）→ 相似度为1 3. **大小写不敏感**：比较前统一转换为小写 4. **序列匹配**：采用最长公共子序列算法计算相似度（取值范围：0-1） **相似度分数释义**： - `1.0`：完全匹配 - `0.8-0.99`：高度相似（可能存在细微格式差异） - `0.5-0.79`：部分匹配（提取了主要信息但不完整） - `0.0-0.49`：低相似度（抽取结果与标准答案差异显著） #### 1. 字段级准确率 **定义**：各元数据字段的平均相似度得分。 **计算方法**：字段级准确率 = Σ（所有样本中该字段的相似度）/ 总样本数 **示例**：假设对100个样本的`title`字段进行评估，将每个样本的标题相似度求和后除以100，即得到该字段的准确率。 **使用场景**： - 识别模型在哪些字段上表现优异或薄弱 - 优化特定字段的抽取能力 - 例如：若`doi`字段准确率为0.95，`abstract`字段准确率为0.75，则模型在摘要抽取方面有待改进 #### 2. 整体准确率 **定义**：所有评估字段准确率的平均值，反映模型的整体性能。 **计算方法**：整体准确率 = Σ（各字段级准确率）/ 总字段数 **示例**：对7个字段（isbn、title、author、abstract、category、pub_time、publisher）进行评估，将这7个字段的准确率求和后除以7。 **使用场景**： - 为模型整体性能提供单一量化指标 - 便于不同模型或方法间的横向对比 - 作为模型优化的整体目标 ### 评估脚本使用 `compare.py` 提供了便捷的评估接口： python from compare import main, write_similarity_data_to_excel # 定义待比较的文件路径与字段 file_llm = 'data/llm-label_textbook.jsonl' # LLM抽取结果 file_bench = 'data/benchmark_textbook.jsonl' # 基准数据集 # 针对教科书/电子书 key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'] # 针对学术论文 # key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time'] # 运行评估并获取指标 accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list) # 将结果输出至Excel（可选） write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx") # 查看评估指标 print("字段级准确率:", key_accuracy) print("整体准确率:", accuracy) ### 输出文件脚本将生成包含逐样本详细分析的Excel文件： - `sha256`：文件标识符 - 针对每个字段（例如`title`）： - `llm_title`：LLM抽取结果 - `benchmark_title`：基准数据集数据 - `similarity_title`：相似度得分（0-1） ## 📈 统计信息 ### 数据规模 **第一批数据（20250806）**： - **电子书**：70条记录 - **学术论文**：70条记录 - **教科书**：71条记录 - **小计**：211条记录 **第二批数据（20251022）**： - **电子书**：354条记录 - **学术论文**：399条记录 - **教科书**：46条记录 - **小计**：799条记录 **总计**：1010条基准测试样本本数据集覆盖多语言（英语、中文、德语、希腊语等）与多学科，两批次数据共同提供了丰富多样的测试样本。 ## 🎯 应用场景 1. **LLM性能评估**：评估大语言模型从PDF中提取元数据的能力 2. **信息抽取系统测试**：测试OCR、文档解析等系统的准确性 3. **模型微调**：用作训练或微调数据，提升模型的信息抽取能力 4. **跨语言能力评估**：评估模型处理多语言文献的能力 ## 🔬 数据特征 - ✅ **真实数据**：从实际PDF文件中提取的真实元数据 - ✅ **多样性**：涵盖不同时代、语言与学科的文献 - ✅ **挑战性**：包含古籍、非英文文献、复杂版式等疑难案例 - ✅ **可追溯性**：每条记录均包含SHA256哈希值与原始路径 ## 📋 依赖项 python pandas>=1.3.0 openpyxl>=3.0.0 安装依赖： bash pip install pandas openpyxl ## 🤝 贡献若您希望： - 报告数据错误 - 新增评估维度 - 扩展数据集请提交Issue或Pull Request。 ## 📧 联系若您有任何问题或建议，请通过Issue与我们联系。 --- **最后更新**：2025年12月26日

提供机构：

maas

创建时间：

2025-12-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集