five

vineetsinghvats/SynthIPData

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vineetsinghvats/SynthIPData
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification - text-generation language: - en tags: - patent - legal - synthetic-data - office-actions - USPTO - rare-categories - data-augmentation size_categories: - 1K<n<10K --- # SynthIPData: Synthetic Patent Office Action Rejections for Rare Categories ## Dataset Description SynthIPData is the first synthetic dataset targeting rare patent office action rejection categories. It contains 1,800 synthetic patent rejection documents generated by a LoRA-fine-tuned Mistral-7B model, trained on 6,161 real USPTO office actions. ### Why This Dataset? Patent office action rejections are critical documents in intellectual property law. However, certain rejection-type x technology-area combinations are severely underrepresented in existing data. For example, 35 USC 101 rejections in Materials/Coatings (USPC 428) account for fewer than 100 documents in the entire 2020-2024 USPTO corpus. This data scarcity prevents AI systems from learning to handle these rare but important cases. SynthIPData addresses this gap by generating high-fidelity synthetic office actions that are: - **Indistinguishable from real text** (perplexity ratio 1.04 vs real documents) - **Effective for retrieval augmentation** (+200% recall for rarest categories) - **Useful in few-shot learning** (+12% F1 improvement when real data is scarce) ## 8 Rare Categories | Category | Rejection Type | USPC Class | Technology Area | Real Seeds | Synthetic | |---|---|---|---|---|---| | 101_ai_ml | 35 USC 101 | 706 | AI/Neural Networks | 2,424 | 240 | | 112_ai_ml | 35 USC 112 | 706 | AI/Neural Networks | 1,552 | 240 | | dp_ai_ml | Double Patenting | 706 | AI/Neural Networks | 1,184 | 120 | | 101_semiconductors | 35 USC 101 | 257 | Semiconductors | 363 | 240 | | 101_surgical | 35 USC 101 | 606 | Surgical Instruments | 230 | 240 | | 101_crypto | 35 USC 101 | 380 | Cryptography | 181 | 240 | | 101_batteries | 35 USC 101 | 429 | Batteries/Fuel Cells | 152 | 240 | | 101_materials | 35 USC 101 | 428 | Materials/Coatings | 75 | 240 | ## Data Sources - **Real seeds**: USPTO Office Actions Weekly Archives (OACT), 2020-2024 - **Category discovery**: USPTO PTOFFACT dataset, 2014-2017 (2.4M office actions analyzed) - **Synthetic generation**: Mistral-7B fine-tuned with LoRA on 5,544 real office actions ## Evaluation Results ### Text Quality (Perplexity) Average perplexity ratio: **1.04** (1.0 = identical to real, lower is better) | Category | Real PPL | Synthetic PPL | Ratio | |---|---|---|---| | 101_ai_ml | 6.1 | 6.1 | 0.99 | | 101_semiconductors | 7.3 | 5.4 | 0.74 | | 101_surgical | 6.3 | 5.8 | 0.93 | | 112_ai_ml | 4.6 | 4.7 | 1.04 | ### Retrieval Improvement (Rare Categories) | Category | Real Only Recall@1 | + SynthIPData | Improvement | |---|---|---|---| | 101_materials | 6.7% | 20.0% | +200% | | 101_batteries | 30.0% | 43.3% | +44% | | 101_semiconductors | 55.6% | 61.1% | +10% | | dp_ai_ml | 67.1% | 73.0% | +9% | ### Few-Shot Classification With only 30 real examples per category: | Method | F1 Macro | |---|---| | Few-shot only | 0.307 | | + Paraphrasing | 0.311 | | **+ SynthIPData** | **0.344 (+12%)** | ## Usage ```python from datasets import load_dataset dataset = load_dataset("vineetsinghvats/SynthIPData") # Access synthetic documents for doc in dataset['synthetic']: print(doc['category'], doc['title']) print(doc['text'][:200]) # Access real seed metadata for doc in dataset['real_seeds']: print(doc['category'], doc['title']) ``` ## Model Training Details - **Base model**: Mistral-7B-v0.1 - **Fine-tuning**: LoRA (r=16, alpha=32) - **Training data**: 5,544 real office actions - **Training loss**: 0.528 (validation: 0.561) - **Memorization rate**: 8.3% before filtering, 0% after Qdrant-based deduplication ## Citation ```bibtex @dataset{singh2026synthipdata, title={SynthIPData: Synthetic Data Augmentation for Rare Patent Office Action Rejection Categories}, author={Singh, Vineet}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/vineetsinghvats/SynthIPData} } ``` ## License MIT License ## Contact Vineet Singh - [GitHub](https://github.com/vineetsingh-vs/synthipdata) | [LinkedIn](https://linkedin.com/in/vineetsingh44)
提供机构:
vineetsinghvats
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作