jeffdekerj/bms-images-shards-256
收藏Hugging Face2025-11-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/jeffdekerj/bms-images-shards-256
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- image-to-text
tags:
- chemistry
- molecular-structure
- smiles
- ocr
- computer-vision
- webdataset
- lightonocr
size_categories:
- 1M<n<10M
---
# BMS Molecular Translation - WebDataset Shards
This dataset contains pre-processed WebDataset shards of the BMS Molecular Translation dataset,
optimized for fast data loading during model training.
## Dataset Summary
- **Total Size**: 0.5 GB
- **Training shards**: 4053 files (0.5 GB) - 2.36M molecular structure images with SMILES
- **Validation shards**: 83 files (0.0 GB) - 48K samples for model validation
- **Test shards**: 85 files (0.0 GB) - 24K held-out samples for final evaluation
## Format
Shards are in [WebDataset](https://github.com/webdataset/webdataset) format:
- Sequential tar archives for fast I/O
- 10,000 samples per shard
- Training data pre-shuffled
- Val/test data in original order
- **Tar files are preserved** (not extracted) - perfect for WebDataset!
## Usage
### Download the Dataset
```bash
# Using HuggingFace Hub
pip install huggingface_hub
# Download entire dataset
# Download entire dataset
python download_shards_from_huggingface.py --username jeffdekerj --repo_name bms-images-shards-256
# Or use HuggingFace Hub directly
from huggingface_hub import snapshot_download
snapshot_download(
repo_id=f"jeffdekerj/bms-images-shards-256",
repo_type="dataset",
local_dir=".data/webdataset_shards"
)
```
### Load with WebDataset
```python
from webdataset_loader import BMSWebDataset
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("lightonai/LightOnOCR-1B-1025")
train_dataset = BMSWebDataset(
shard_dir=".data/webdataset_shards/train/",
processor=processor,
user_prompt="Return the SMILES string for this molecule.",
shuffle_buffer=1000,
)
```
### Train Your Model
```bash
python finetune_lightocr.py \
--train_shards .data/webdataset_shards/train/ \
--val_shards .data/webdataset_shards/val/ \
--per_device_train_batch_size 4 \
--num_train_epochs 3 \
--fp16
```
## Benefits
- **2-5x faster** data loading vs individual files
- **Better I/O** performance for network filesystems
- **Lower overhead** with sequential reads
- **Built-in shuffling** without memory overhead
- **Tar files preserved** - no auto-extraction like Kaggle
## Source Repository
GitHub: https://github.com/JeffDeKerj/lightonocr
Complete documentation available in the repository:
- `docs/WEBDATASET_GUIDE.md` - Complete usage guide
- `docs/HUGGINGFACE_GUIDE.md` - HuggingFace-specific guide
- `docs/FINETUNE_GUIDE.md` - Fine-tuning guide
- `README.md` - Project overview
## Original Dataset
Based on the BMS Molecular Translation competition dataset:
https://www.kaggle.com/c/bms-molecular-translation
## Citation
If you use this dataset, please cite both:
1. The original BMS Molecular Translation competition
2. The LightOnOCR model (if applicable to your work)
## License
CC0: Public Domain. Free to use for any purpose.
提供机构:
jeffdekerj



