ScienceMetaBench
收藏魔搭社区2026-01-09 更新2026-01-10 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/ScienceMetaBench
下载链接
链接失效反馈官方服务:
资源简介:
# ScienceMetaBench
[English](README.md) | [中文](README_ZH.md)
🤗 [HuggingFace Dataset](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub Repository](https://github.com/DataEval/ScienceMetaBench)
**Acknowledgements**: 🔍 [Dingo](https://github.com/MigoXLab/dingo)
ScienceMetaBench is a benchmark dataset for evaluating the accuracy of metadata extraction from scientific literature PDF files. The dataset covers three major categories: academic papers, textbooks, and ebooks, and can be used to assess the performance of Large Language Models (LLMs) or other information extraction systems.
## 📊 Dataset Overview
### Data Types
This benchmark includes three types of scientific literature:
1. **Papers**
- Mainly from academic journals and conferences
- Contains academic metadata such as DOI, keywords, etc.
2. **Textbooks**
- Formally published textbooks
- Includes ISBN, publisher, and other publication information
3. **Ebooks**
- Digitized historical documents and books
- Covers multiple languages and disciplines
### Data Batches
This benchmark has undergone two rounds of data expansion, with each round adding new sample data:
```
data/
├── 20250806/ # First batch (August 6, 2024)
│ ├── ebook_0806.jsonl
│ ├── paper_0806.jsonl
│ └── textbook_0806.jsonl
└── 20251022/ # Second batch (October 22, 2024)
├── ebook_1022.jsonl
├── paper_1022.jsonl
└── textbook_1022.jsonl
```
**Note**: The two batches of data complement each other to form a complete benchmark dataset. You can choose to use a single batch or merge them as needed.
### PDF Files
The `pdf/` directory contains the original PDF files corresponding to the benchmark data, with a directory structure consistent with the `data/` directory.
**File Naming Convention**: All PDF files are named using their SHA256 hash values, in the format `{sha256}.pdf`. This naming scheme ensures file uniqueness and traceability, making it easy to locate the corresponding source file using the `sha256` field in the JSONL data.
## 📝 Data Format
All data files are in JSONL format (one JSON object per line).
### Academic Paper Fields
```json
{
"sha256": "SHA256 hash of the file",
"doi": "Digital Object Identifier",
"title": "Paper title",
"author": "Author name",
"keyword": "Keywords (comma-separated)",
"abstract": "Abstract content",
"pub_time": "Publication year"
}
```
### Textbook/Ebook Fields
```json
{
"sha256": "SHA256 hash of the file",
"isbn": "International Standard Book Number",
"title": "Book title",
"author": "Author name",
"abstract": "Introduction/abstract",
"category": "Classification number (e.g., Chinese Library Classification)",
"pub_time": "Publication year",
"publisher": "Publisher"
}
```
## 📖 Data Examples
### Academic Paper Example
The following image shows an example of metadata fields extracted from an academic paper PDF:

As shown in the image, the following key information needs to be extracted from the paper's first page:
- **DOI**: Digital Object Identifier (e.g., `10.1186/s41038-017-0090-z`)
- **Title**: Paper title
- **Author**: Author name
- **Keyword**: List of keywords
- **Abstract**: Paper abstract
- **pub_time**: Publication time (usually the year)
### Textbook/Ebook Example
The following image shows an example of metadata fields extracted from the copyright page of a Chinese ebook PDF:

As shown in the image, the following key information needs to be extracted from the book's copyright page:
- **ISBN**: International Standard Book Number (e.g., `978-7-5385-8594-0`)
- **Title**: Book title
- **Author**: Author/editor name
- **Publisher**: Publisher name
- **pub_time**: Publication time (year)
- **Category**: Book classification number
- **Abstract**: Content introduction (if available)
These examples demonstrate the core task of the benchmark test: accurately extracting structured metadata information from PDF documents in various formats and languages.
## 📊 Evaluation Metrics
### Core Evaluation Metrics
This benchmark uses a string similarity-based evaluation method, providing two core metrics:
### Similarity Calculation Rules
This benchmark uses a string similarity algorithm based on `SequenceMatcher`, with the following specific rules:
1. **Empty Value Handling**: One is empty and the other is not → similarity is 0
2. **Complete Match**: Both are identical (including both being empty) → similarity is 1
3. **Case Insensitive**: Convert to lowercase before comparison
4. **Sequence Matching**: Use longest common subsequence algorithm to calculate similarity (range: 0-1)
**Similarity Score Interpretation**:
- `1.0`: Perfect match
- `0.8-0.99`: Highly similar (may have minor formatting differences)
- `0.5-0.79`: Partial match (extracted main information but incomplete)
- `0.0-0.49`: Low similarity (extraction result differs significantly from ground truth)
#### 1. Field-level Accuracy
**Definition**: The average similarity score for each metadata field.
**Calculation Method**:
```
Field-level Accuracy = Σ(similarity of that field across all samples) / total number of samples
```
**Example**: Suppose evaluating the `title` field on 100 samples, the sum of title similarity for each sample divided by 100 gives the accuracy for that field.
**Use Cases**:
- Identify which fields the model performs well or poorly on
- Optimize extraction capabilities for specific fields
- For example: If `doi` accuracy is 0.95 and `abstract` accuracy is 0.75, the model needs improvement in extracting abstracts
#### 2. Overall Accuracy
**Definition**: The average of all evaluated field accuracies, reflecting the model's overall performance.
**Calculation Method**:
```
Overall Accuracy = Σ(field-level accuracies) / total number of fields
```
**Example**: Evaluating 7 fields (isbn, title, author, abstract, category, pub_time, publisher), sum these 7 field accuracies and divide by 7.
**Use Cases**:
- Provide a single quantitative metric for overall model performance
- Facilitate horizontal comparison between different models or methods
- Serve as an overall objective for model optimization
### Using the Evaluation Script
`compare.py` provides a convenient evaluation interface:
```python
from compare import main, write_similarity_data_to_excel
# Define file paths and fields to compare
file_llm = 'data/llm-label_textbook.jsonl' # LLM extraction results
file_bench = 'data/benchmark_textbook.jsonl' # Benchmark data
# For textbooks/ebooks
key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher']
# For academic papers
# key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time']
# Run evaluation and get metrics
accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list)
# Output results to Excel (optional)
write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx")
# View evaluation metrics
print("Field-level Accuracy:", key_accuracy)
print("Overall Accuracy:", accuracy)
```
### Output Files
The script generates an Excel file containing detailed sample-by-sample analysis:
- `sha256`: File identifier
- For each field (e.g., `title`):
- `llm_title`: LLM extraction result
- `benchmark_title`: Benchmark data
- `similarity_title`: Similarity score (0-1)
## 📈 Statistics
### Data Scale
**First Batch (20250806)**:
- **Ebooks**: 70 records
- **Academic Papers**: 70 records
- **Textbooks**: 71 records
- **Subtotal**: 211 records
**Second Batch (20251022)**:
- **Ebooks**: 354 records
- **Academic Papers**: 399 records
- **Textbooks**: 46 records
- **Subtotal**: 799 records
**Total**: 1010 benchmark test records
The data covers multiple languages (English, Chinese, German, Greek, etc.) and multiple disciplines, with both batches together providing a rich and diverse set of test samples.
## 🎯 Application Scenarios
1. **LLM Performance Evaluation**: Assess the ability of large language models to extract metadata from PDFs
2. **Information Extraction System Testing**: Test the accuracy of OCR, document parsing, and other systems
3. **Model Fine-tuning**: Use as training or fine-tuning data to improve model information extraction capabilities
4. **Cross-lingual Capability Evaluation**: Evaluate the model's ability to process multilingual literature
## 🔬 Data Characteristics
- ✅ **Real Data**: Real metadata extracted from actual PDF files
- ✅ **Diversity**: Covers literature from different eras, languages, and disciplines
- ✅ **Challenging**: Includes ancient texts, non-English literature, complex layouts, and other difficult cases
- ✅ **Traceable**: Each record includes SHA256 hash and original path
## 📋 Dependencies
```python
pandas>=1.3.0
openpyxl>=3.0.0
```
Install dependencies:
```bash
pip install pandas openpyxl
```
## 🤝 Contributing
If you would like to:
- Report data errors
- Add new evaluation dimensions
- Expand the dataset
Please submit an Issue or Pull Request.
## 📧 Contact
If you have questions or suggestions, please contact us through Issues.
---
**Last Updated**: December 26, 2025
# ScienceMetaBench
[English](README.md) | [中文](README_ZH.md)
🤗 [HuggingFace 数据集](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | 💻 [GitHub 仓库](https://github.com/DataEval/ScienceMetaBench)
**致谢**: 🔍 [Dingo](https://github.com/MigoXLab/dingo)
ScienceMetaBench 是一款用于评估科学文献PDF文件元数据提取准确性的基准数据集。该数据集涵盖学术论文、教科书与电子书三大类别,可用于评估大语言模型(Large Language Model,简称LLM)或其他信息抽取系统的性能。
## 📊 数据集概览
### 数据类型
本基准数据集包含三类科学文献:
1. **学术论文**
- 主要来源于学术期刊与会议
- 包含数字对象标识符(Digital Object Identifier,简称DOI)、关键词等学术元数据
2. **教科书**
- 正式出版的教科书
- 包含国际标准书号(International Standard Book Number,简称ISBN)、出版社等出版信息
3. **电子书**
- 数字化的历史文献与图书
- 覆盖多语言与多学科
### 数据批次
本基准数据集历经两轮数据扩展,每轮均新增样本数据:
data/
├── 20250806/ # 第一批数据(2024年8月6日)
│ ├── ebook_0806.jsonl
│ ├── paper_0806.jsonl
│ └── textbook_0806.jsonl
└── 20251022/ # 第二批数据(2024年10月22日)
├── ebook_1022.jsonl
├── paper_1022.jsonl
└── textbook_1022.jsonl
**注意**:两批次数据互为补充,共同构成完整的基准数据集。您可根据需求选择单一批次或合并使用。
### PDF文件
`pdf/` 目录包含与基准数据对应的原始PDF文件,目录结构与 `data/` 目录一致。
**文件命名规则**:所有PDF文件均采用其SHA256哈希值命名,格式为`{sha256}.pdf`。该命名方案可确保文件唯一性与可追溯性,便于通过JSONL数据中的`sha256`字段定位对应的源文件。
## 📝 数据格式
所有数据文件均采用JSON Lines(简称JSONL)格式,即每行包含一个JSON对象。
### 学术论文字段
json
{
"sha256": "文件的SHA256哈希值",
"doi": "数字对象标识符",
"title": "论文标题",
"author": "作者姓名",
"keyword": "关键词(逗号分隔)",
"abstract": "摘要内容",
"pub_time": "出版年份"
}
### 教科书/电子书字段
json
{
"sha256": "文件的SHA256哈希值",
"isbn": "国际标准书号",
"title": "图书标题",
"author": "作者姓名",
"abstract": "内容简介/摘要",
"category": "分类号(如中国图书馆分类法)",
"pub_time": "出版年份",
"publisher": "出版社名称"
}
## 📖 数据示例
### 学术论文示例
下图展示了从学术论文PDF中提取的元数据字段示例:

如图所示,需从论文首页提取以下关键信息:
- **DOI**:数字对象标识符(例如`10.1186/s41038-017-0090-z`)
- **标题**:论文标题
- **作者**:作者姓名
- **关键词**:关键词列表
- **摘要**:论文摘要
- **pub_time**:出版时间(通常为年份)
### 教科书/电子书示例
下图展示了从中文电子书PDF版权页中提取的元数据字段示例:

如图所示,需从图书版权页提取以下关键信息:
- **ISBN**:国际标准书号(例如`978-7-5385-8594-0`)
- **标题**:图书标题
- **作者**:作者/编辑姓名
- **出版社**:出版社名称
- **pub_time**:出版时间(年份)
- **分类号**:图书分类号
- **摘要**:内容简介(如有)
上述示例展示了本基准测试的核心任务:从不同格式与语言的PDF文档中精准提取结构化元数据信息。
## 📊 评估指标
### 核心评估指标
本基准数据集采用基于字符串相似度的评估方法,提供两项核心指标:
### 相似度计算规则
本基准数据集采用基于`SequenceMatcher`的字符串相似度算法,具体规则如下:
1. **空值处理**:一方为空、另一方非空 → 相似度为0
2. **完全匹配**:双方完全一致(包括双方均为空)→ 相似度为1
3. **大小写不敏感**:比较前统一转换为小写
4. **序列匹配**:采用最长公共子序列算法计算相似度(取值范围:0-1)
**相似度分数释义**:
- `1.0`:完全匹配
- `0.8-0.99`:高度相似(可能存在细微格式差异)
- `0.5-0.79`:部分匹配(提取了主要信息但不完整)
- `0.0-0.49`:低相似度(抽取结果与标准答案差异显著)
#### 1. 字段级准确率
**定义**:各元数据字段的平均相似度得分。
**计算方法**:
字段级准确率 = Σ(所有样本中该字段的相似度)/ 总样本数
**示例**:假设对100个样本的`title`字段进行评估,将每个样本的标题相似度求和后除以100,即得到该字段的准确率。
**使用场景**:
- 识别模型在哪些字段上表现优异或薄弱
- 优化特定字段的抽取能力
- 例如:若`doi`字段准确率为0.95,`abstract`字段准确率为0.75,则模型在摘要抽取方面有待改进
#### 2. 整体准确率
**定义**:所有评估字段准确率的平均值,反映模型的整体性能。
**计算方法**:
整体准确率 = Σ(各字段级准确率)/ 总字段数
**示例**:对7个字段(isbn、title、author、abstract、category、pub_time、publisher)进行评估,将这7个字段的准确率求和后除以7。
**使用场景**:
- 为模型整体性能提供单一量化指标
- 便于不同模型或方法间的横向对比
- 作为模型优化的整体目标
### 评估脚本使用
`compare.py` 提供了便捷的评估接口:
python
from compare import main, write_similarity_data_to_excel
# 定义待比较的文件路径与字段
file_llm = 'data/llm-label_textbook.jsonl' # LLM抽取结果
file_bench = 'data/benchmark_textbook.jsonl' # 基准数据集
# 针对教科书/电子书
key_list = ['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher']
# 针对学术论文
# key_list = ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time']
# 运行评估并获取指标
accuracy, key_accuracy, detail_data = main(file_llm, file_bench, key_list)
# 将结果输出至Excel(可选)
write_similarity_data_to_excel(key_list, detail_data, "similarity_analysis.xlsx")
# 查看评估指标
print("字段级准确率:", key_accuracy)
print("整体准确率:", accuracy)
### 输出文件
脚本将生成包含逐样本详细分析的Excel文件:
- `sha256`:文件标识符
- 针对每个字段(例如`title`):
- `llm_title`:LLM抽取结果
- `benchmark_title`:基准数据集数据
- `similarity_title`:相似度得分(0-1)
## 📈 统计信息
### 数据规模
**第一批数据(20250806)**:
- **电子书**:70条记录
- **学术论文**:70条记录
- **教科书**:71条记录
- **小计**:211条记录
**第二批数据(20251022)**:
- **电子书**:354条记录
- **学术论文**:399条记录
- **教科书**:46条记录
- **小计**:799条记录
**总计**:1010条基准测试样本
本数据集覆盖多语言(英语、中文、德语、希腊语等)与多学科,两批次数据共同提供了丰富多样的测试样本。
## 🎯 应用场景
1. **LLM性能评估**:评估大语言模型从PDF中提取元数据的能力
2. **信息抽取系统测试**:测试OCR、文档解析等系统的准确性
3. **模型微调**:用作训练或微调数据,提升模型的信息抽取能力
4. **跨语言能力评估**:评估模型处理多语言文献的能力
## 🔬 数据特征
- ✅ **真实数据**:从实际PDF文件中提取的真实元数据
- ✅ **多样性**:涵盖不同时代、语言与学科的文献
- ✅ **挑战性**:包含古籍、非英文文献、复杂版式等疑难案例
- ✅ **可追溯性**:每条记录均包含SHA256哈希值与原始路径
## 📋 依赖项
python
pandas>=1.3.0
openpyxl>=3.0.0
安装依赖:
bash
pip install pandas openpyxl
## 🤝 贡献
若您希望:
- 报告数据错误
- 新增评估维度
- 扩展数据集
请提交Issue或Pull Request。
## 📧 联系
若您有任何问题或建议,请通过Issue与我们联系。
---
**最后更新**:2025年12月26日
提供机构:
maas
创建时间:
2025-12-30



