five

jeffdekerj/bms-images-shards

收藏
Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/jeffdekerj/bms-images-shards
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - image-to-text tags: - chemistry - molecular-structure - smiles - ocr - computer-vision - webdataset - lightonocr size_categories: - 1M<n<10M --- # BMS Molecular Translation - WebDataset Shards This dataset contains pre-processed WebDataset shards of the BMS Molecular Translation dataset, optimized for fast data loading during model training. ## Dataset Summary - **Total Size**: 3.8 GB - **Training shards**: 236 files (3.7 GB) - 2.36M molecular structure images with SMILES - **Validation shards**: 5 files (0.1 GB) - 48K samples for model validation - **Test shards**: 3 files (0.0 GB) - 24K held-out samples for final evaluation ## Format Shards are in [WebDataset](https://github.com/webdataset/webdataset) format: - Sequential tar archives for fast I/O - 10,000 samples per shard - Training data pre-shuffled - Val/test data in original order - **Tar files are preserved** (not extracted) - perfect for WebDataset! ## Usage ### Download the Dataset ```bash # Using HuggingFace Hub pip install huggingface_hub # Download entire dataset python download_shards_from_huggingface.py --username jeffdekerj # Or use HuggingFace Hub directly from huggingface_hub import snapshot_download snapshot_download( repo_id="jeffdekerj/bms-images-shards", repo_type="dataset", local_dir=".data/webdataset_shards" ) ``` ### Load with WebDataset ```python from webdataset_loader import BMSWebDataset from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("lightonai/LightOnOCR-1B-1025") train_dataset = BMSWebDataset( shard_dir=".data/webdataset_shards/train/", processor=processor, user_prompt="Return the SMILES string for this molecule.", shuffle_buffer=1000, ) ``` ### Train Your Model ```bash python finetune_lightocr.py \ --train_shards .data/webdataset_shards/train/ \ --val_shards .data/webdataset_shards/val/ \ --per_device_train_batch_size 4 \ --num_train_epochs 3 \ --fp16 ``` ## Benefits - **2-5x faster** data loading vs individual files - **Better I/O** performance for network filesystems - **Lower overhead** with sequential reads - **Built-in shuffling** without memory overhead - **Tar files preserved** - no auto-extraction like Kaggle ## Source Repository GitHub: https://github.com/JeffDeKerj/lightocr Complete documentation available in the repository: - `docs/WEBDATASET_GUIDE.md` - Complete usage guide - `docs/HUGGINGFACE_GUIDE.md` - HuggingFace-specific guide - `docs/FINETUNE_GUIDE.md` - Fine-tuning guide - `README.md` - Project overview ## Original Dataset Based on the BMS Molecular Translation competition dataset: https://www.kaggle.com/c/bms-molecular-translation ## Citation If you use this dataset, please cite both: 1. The original BMS Molecular Translation competition 2. The LightOnOCR model (if applicable to your work) ## License CC0: Public Domain. Free to use for any purpose.
提供机构:
jeffdekerj
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作