vinven7/FormBench
收藏Hugging Face2026-04-17 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vinven7/FormBench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
pretty_name: FormBench
task_categories:
- text-retrieval
task_ids:
- document-retrieval
annotations_creators:
- machine-generated
language_creators:
- found
multilinguality:
- monolingual
source_datasets:
- original
size_categories:
- 1M<n<10M
tags:
- beir
- retrieval
- patents
- materials-science
- formulations
- chemistry
- benchmark
- graded-relevance
- neurips-2026
configs:
- config_name: formbench-structured
data_files:
- split: corpus
path: formbench-structured/corpus.jsonl
- split: queries
path: formbench-structured/queries.jsonl
- config_name: formbench-random
data_files:
- split: corpus
path: formbench-random/corpus.jsonl
- split: queries
path: formbench-random/queries.jsonl
- config_name: formbench-sample
data_files:
- split: corpus
path: formbench-sample/corpus.jsonl
- split: queries
path: formbench-sample/queries.jsonl
---
# FormBench: A Formulation Retrieval Benchmark
FormBench is a large-scale information retrieval benchmark for **formulation science** —
adhesives, coatings, polymers, pharmaceuticals, lubricants, agrochemicals, and related
industries. It provides ~1M corpus passages, 55,347 queries, and 4-level graded relevance
qrels derived from a domain taxonomy of 590K US formulation patents.
Two corpus variants are provided. In the paper these are called **C0** and **C1**:
| Config | HF name | Passages | Distractor strategy |
|--------|---------|----------|-------------------|
| C1 | `formbench-structured` | 994,609 | Near-miss chunks from tuple patents + random fill |
| C0 | `formbench-random` | 997,312 | Random chunks from non-tuple patents |
| — | `formbench-sample` | 63,058 | Labeled passages only — reviewer entry point (<400 MB) |
## Graded Relevance Scheme
| Score | Meaning |
|-------|---------|
| 3 | Anchor — passage the query was generated from |
| 2 | Hard negative — same taxonomy cluster, different formulation type |
| 1 | Soft negative — different cluster, same macro-domain |
| 0 | Irrelevant — not written (BEIR convention) |
Standard BEIR binary evaluation: score ≥ 1. Strict binary (anchor-only): score == 3.
## Quick Load
```python
from beir.datasets.data_loader import GenericDataLoader
corpus, queries, qrels = GenericDataLoader(
'vinven7/FormBench', config_name='formbench-structured'
).load(split='test')
```
Start with `formbench-sample` (~400 MB) for exploration.
## Domain Taxonomy
3-level hierarchy built from entity co-occurrence in ~590K USPTO formulation patents:
- **6 macro-domains**: Life Sciences & Health, Chemicals & Energy, Materials & Polymers,
Coatings/Inks/Adhesives, Electronics & Construction, Other
- **23 clusters**: e.g., Pharmaceutical, Polymer Composites, Lithium Batteries, Coatings & Paints
- **4,899 fine-grained labels**: open-ended, assigned by Claude Haiku
## File Schema
**corpus.jsonl**: `_id` (PATENT_ID:CHUNK_IDX), `title`, `text`, `metadata`
(patent_id, patent_title, cpc_subclasses, year, chunk_no)
**queries.jsonl**: `_id`, `text`, `metadata`
(patent_id, passage_key, clustered_category, macro_category, split)
**qrels/{train,dev,test}.tsv**: tab-separated query-id, corpus-id, score (with header)
## Responsible AI
**Data source and provenance:**
USPTO patent full text is public domain. No personal data is present.
Passages are excerpted from patent descriptions without modification.
Queries are synthetic — generated by Claude Sonnet 3.5, filtered by Claude Haiku 3.
NER extraction used Llama-3-8B + LoRA adapter trained on materials science text.
Taxonomy constructed via entity co-occurrence Jaccard similarity within CPC subclasses.
**Synthetic content:**
Queries are machine-generated (`isSynthetic: true` for the query split).
Passage text is taken verbatim from public patent documents (`isSynthetic: false`).
**Known biases:**
- USPTO corpus over-represents US-origin innovations and large industrial applicants
(major pharma, chemical, and materials companies file disproportionately more patents).
- English-language only; non-English patent filings are excluded.
- Temporal coverage is biased toward 1995–2022 (USPTO digital archive period).
- Qrel scores are taxonomy-derived, not human-annotated; fine-grained discrimination
within a cluster reflects entity co-occurrence Jaccard, not human relevance judgments.
- Formula-only passages (chemical equations with minimal prose) were identified and
removed from the corpus (6 passages, training split only; test metrics unaffected).
**Personal or sensitive information:**
None. All source material is public-domain USPTO patent text. Inventor names present
in raw patent data are not included in corpus passages (description chunks only).
**Social impact:**
FormBench is intended to advance retrieval systems for industrial R&D. Potential
positive impact: faster discovery of relevant prior art for formulation development.
Potential misuse: retrieval systems trained on FormBench could be used to extract
proprietary formulation insights from public patents at scale; appropriate access
controls should be applied in deployment.
**Maintenance:**
Hosted under CC-BY-4.0. Dataset will remain publicly accessible. Corrections and
community contributions via the HuggingFace Community tab. Future versions may
include human-validated qrel subsets and multilingual extensions.
## Citation
```bibtex
@misc{formbench2026,
title={FormBench: A Large-Scale Benchmark for Formulation Retrieval in Patent Literature},
author={Venugopal, Vineeth and others},
year={2026},
note={NeurIPS 2026 Evaluations & Datasets Track (submitted)},
url={https://huggingface.co/datasets/vinven7/FormBench}
}
```
提供机构:
vinven7



