vinven7/FormBench

Name: vinven7/FormBench
Creator: vinven7
Published: 2026-04-17 06:39:33
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/vinven7/FormBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 pretty_name: FormBench task_categories: - text-retrieval task_ids: - document-retrieval annotations_creators: - machine-generated language_creators: - found multilinguality: - monolingual source_datasets: - original size_categories: - 1M<n<10M tags: - beir - retrieval - patents - materials-science - formulations - chemistry - benchmark - graded-relevance - neurips-2026 configs: - config_name: formbench-structured data_files: - split: corpus path: formbench-structured/corpus.jsonl - split: queries path: formbench-structured/queries.jsonl - config_name: formbench-random data_files: - split: corpus path: formbench-random/corpus.jsonl - split: queries path: formbench-random/queries.jsonl - config_name: formbench-sample data_files: - split: corpus path: formbench-sample/corpus.jsonl - split: queries path: formbench-sample/queries.jsonl --- # FormBench: A Formulation Retrieval Benchmark FormBench is a large-scale information retrieval benchmark for **formulation science** — adhesives, coatings, polymers, pharmaceuticals, lubricants, agrochemicals, and related industries. It provides ~1M corpus passages, 55,347 queries, and 4-level graded relevance qrels derived from a domain taxonomy of 590K US formulation patents. Two corpus variants are provided. In the paper these are called **C0** and **C1**: | Config | HF name | Passages | Distractor strategy | |--------|---------|----------|-------------------| | C1 | `formbench-structured` | 994,609 | Near-miss chunks from tuple patents + random fill | | C0 | `formbench-random` | 997,312 | Random chunks from non-tuple patents | | — | `formbench-sample` | 63,058 | Labeled passages only — reviewer entry point (<400 MB) | ## Graded Relevance Scheme | Score | Meaning | |-------|---------| | 3 | Anchor — passage the query was generated from | | 2 | Hard negative — same taxonomy cluster, different formulation type | | 1 | Soft negative — different cluster, same macro-domain | | 0 | Irrelevant — not written (BEIR convention) | Standard BEIR binary evaluation: score ≥ 1. Strict binary (anchor-only): score == 3. ## Quick Load ```python from beir.datasets.data_loader import GenericDataLoader corpus, queries, qrels = GenericDataLoader( 'vinven7/FormBench', config_name='formbench-structured' ).load(split='test') ``` Start with `formbench-sample` (~400 MB) for exploration. ## Domain Taxonomy 3-level hierarchy built from entity co-occurrence in ~590K USPTO formulation patents: - **6 macro-domains**: Life Sciences & Health, Chemicals & Energy, Materials & Polymers, Coatings/Inks/Adhesives, Electronics & Construction, Other - **23 clusters**: e.g., Pharmaceutical, Polymer Composites, Lithium Batteries, Coatings & Paints - **4,899 fine-grained labels**: open-ended, assigned by Claude Haiku ## File Schema **corpus.jsonl**: `_id` (PATENT_ID:CHUNK_IDX), `title`, `text`, `metadata` (patent_id, patent_title, cpc_subclasses, year, chunk_no) **queries.jsonl**: `_id`, `text`, `metadata` (patent_id, passage_key, clustered_category, macro_category, split) **qrels/{train,dev,test}.tsv**: tab-separated query-id, corpus-id, score (with header) ## Responsible AI **Data source and provenance:** USPTO patent full text is public domain. No personal data is present. Passages are excerpted from patent descriptions without modification. Queries are synthetic — generated by Claude Sonnet 3.5, filtered by Claude Haiku 3. NER extraction used Llama-3-8B + LoRA adapter trained on materials science text. Taxonomy constructed via entity co-occurrence Jaccard similarity within CPC subclasses. **Synthetic content:** Queries are machine-generated (`isSynthetic: true` for the query split). Passage text is taken verbatim from public patent documents (`isSynthetic: false`). **Known biases:** - USPTO corpus over-represents US-origin innovations and large industrial applicants (major pharma, chemical, and materials companies file disproportionately more patents). - English-language only; non-English patent filings are excluded. - Temporal coverage is biased toward 1995–2022 (USPTO digital archive period). - Qrel scores are taxonomy-derived, not human-annotated; fine-grained discrimination within a cluster reflects entity co-occurrence Jaccard, not human relevance judgments. - Formula-only passages (chemical equations with minimal prose) were identified and removed from the corpus (6 passages, training split only; test metrics unaffected). **Personal or sensitive information:** None. All source material is public-domain USPTO patent text. Inventor names present in raw patent data are not included in corpus passages (description chunks only). **Social impact:** FormBench is intended to advance retrieval systems for industrial R&D. Potential positive impact: faster discovery of relevant prior art for formulation development. Potential misuse: retrieval systems trained on FormBench could be used to extract proprietary formulation insights from public patents at scale; appropriate access controls should be applied in deployment. **Maintenance:** Hosted under CC-BY-4.0. Dataset will remain publicly accessible. Corrections and community contributions via the HuggingFace Community tab. Future versions may include human-validated qrel subsets and multilingual extensions. ## Citation ```bibtex @misc{formbench2026, title={FormBench: A Large-Scale Benchmark for Formulation Retrieval in Patent Literature}, author={Venugopal, Vineeth and others}, year={2026}, note={NeurIPS 2026 Evaluations & Datasets Track (submitted)}, url={https://huggingface.co/datasets/vinven7/FormBench} } ```

提供机构：

vinven7

5,000+

优质数据集

54 个

任务类型

进入经典数据集