five

Yesianrohn/OCR-Data

收藏
Hugging Face2026-04-12 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Yesianrohn/OCR-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - object-detection - image-to-text language: - zh - en tags: - ocr - text-detection - text-recognition - document-understanding - scene-text - handwritten-chinese pretty_name: OCR Text Detection and Recognition Dataset size_categories: - 100K<n<1M --- # OCR Text Detection and Recognition Dataset ## Dataset Description A large-scale, multi-source OCR dataset aggregating **14 public benchmarks** for text detection and recognition in both scene images and handwritten documents. Each image is paired with: - **Transcribed text** for each text region - **Bounding boxes** (axis-aligned rectangles) for each text region - **Polygon coordinates** (precise boundary points) for each text region The dataset is stored in HuggingFace Parquet format with images embedded as raw bytes, enabling efficient streaming and zero-setup loading. Each source benchmark is stored as a separate **split**, so you can load individual subsets or combine them freely. ### Included Benchmarks | Split | Source | Description | |-------|--------|-------------| | `ART` | [ART](https://rrc.cvc.uab.es/?ch=14) | Arbitrary-shaped text in natural scenes | | `cocotext` | [COCO-Text](https://bgshih.github.io/cocotext/) | Text annotations on MS-COCO images | | `CTW` | [CTW](https://ctwdataset.github.io/) | Chinese text in the wild | | `hiertext` | [HierText](https://github.com/google-research-datasets/hiertext) | Hierarchical text in scene images | | `LSVT` | [LSVT](https://rrc.cvc.uab.es/?ch=16) | Large-scale Street View Text | | `MTWI` | [MTWI](https://tianchi.aliyun.com/competition/entrance/231651) | Multi-type web images | | `openvino` | [OpenVINO](https://github.com/openvinotoolkit/open_model_zoo) | Text detection training data | | `RCTW` | [RCTW-17](https://rctw.vlrlab.net/) | Reading Chinese text in the wild | | `ReCTS` | [ReCTS](https://rrc.cvc.uab.es/?ch=12) | Reading Chinese text on signboards | | `SCUT_HCCDoc` | [SCUT-HCCDoc](https://github.com/HCIILAB/SCUT-HCCDoc_Dataset_Release) | Handwritten Chinese text in documents | | `ShopSign` | [ShopSign](https://github.com/chongshengzhang/shopsign) | Chinese shop sign text | | `TextOCR` | [TextOCR](https://textvqa.org/textocr/) | Text in natural images (TextVQA) | | `UberText` | [UberText](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) | Text from Bing Street View imagery | | `MLT2019` | [MLT 2019](https://rrc.cvc.uab.es/?ch=15) | Multi-lingual scene text | ## Dataset Structure ### Features | Feature | Type | Description | |---------|------|-------------| | `image` | `Image` | The document/scene image (embedded as raw bytes) | | `texts` | `Sequence[string]` | List of transcribed text strings, one per text region | | `bboxes` | `Sequence[Sequence[float64]]` | Axis-aligned bounding boxes `[x_min, y_min, x_max, y_max]` for each text region | | `polygons` | `Sequence[Sequence[float64]]` | Polygon coordinates as flat arrays `[x1, y1, x2, y2, ...]` for each text region | | `num_text_regions` | `int32` | Total number of text regions in the image | ### Schema All splits share an identical Arrow/Parquet schema with HuggingFace metadata, so `datasets` will automatically decode the `image` column into PIL Image objects. ## Usage ### Quick Start ```python from datasets import load_dataset # Load the full dataset (all splits) ds = load_dataset("Yesianrohn/OCR-Data") # Access a specific split art = ds["ART"] # View the first example example = art[0] print(f"Number of text regions: {example['num_text_regions']}") print(f"Texts: {example['texts']}") print(f"Bounding boxes: {example['bboxes']}") ``` ### Load a Single Split ```python from datasets import load_dataset # Load only the LSVT split lsvt = load_dataset("Yesianrohn/OCR-Data", split="LSVT") print(f"LSVT contains {len(lsvt)} examples") ``` ### Display an Image with Annotations ```python from datasets import load_dataset from PIL import Image, ImageDraw ds = load_dataset("Yesianrohn/OCR-Data", split="ReCTS") example = ds[0] image = example["image"] draw = ImageDraw.Draw(image) for text, bbox in zip(example["texts"], example["bboxes"]): x_min, y_min, x_max, y_max = bbox draw.rectangle([x_min, y_min, x_max, y_max], outline="red", width=2) draw.text((x_min, y_min - 12), text, fill="red") image.show() ``` ### Streaming Mode (No Download Required) ```python from datasets import load_dataset ds = load_dataset("Yesianrohn/OCR-Data", split="hiertext", streaming=True) for example in ds: print(example["texts"]) break # just peek at the first example ``` ### Combine Multiple Splits ```python from datasets import load_dataset, concatenate_datasets ds = load_dataset("Yesianrohn/OCR-Data") combined = concatenate_datasets([ds["ART"], ds["LSVT"], ds["MTWI"]]) print(f"Combined dataset size: {len(combined)}") ``` ### Convert to Pandas DataFrame (without images) ```python from datasets import load_dataset ds = load_dataset("Yesianrohn/OCR-Data", split="CTW") df = ds.to_pandas() # Note: the 'image' column will contain PIL Image objects print(df[["texts", "num_text_regions"]].head()) ``` ## How to Build This Parquet Dataset Below is a minimal example showing how to programmatically construct a Parquet file that matches this dataset's schema. You can adapt it to any data source. ### Parquet Schema Each Parquet file follows this Arrow schema with HuggingFace metadata: ``` image: struct { bytes: binary, path: string } texts: list<string> bboxes: list<list<float64>> // each inner list is [x_min, y_min, x_max, y_max] polygons: list<list<float64>> // each inner list is [x1, y1, x2, y2, ...] num_text_regions: int32 ``` The `image` column uses the HuggingFace `Image` feature convention — a struct with raw `bytes` and a `path` filename — so the `datasets` library will automatically decode it into a PIL Image. ### Build a Parquet File from Scratch ```python import json import pyarrow as pa import pyarrow.parquet as pq # ---- 1. Define Arrow schema with HuggingFace metadata ---- image_type = pa.struct([ pa.field("bytes", pa.binary()), pa.field("path", pa.string()), ]) hf_features = { "image": {"_type": "Image"}, "texts": {"feature": {"dtype": "string", "_type": "Value"}, "_type": "Sequence"}, "bboxes": {"feature": {"feature": {"dtype": "float64", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}, "polygons": {"feature": {"feature": {"dtype": "float64", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}, "num_text_regions": {"dtype": "int32", "_type": "Value"}, } schema = pa.schema([ pa.field("image", image_type), pa.field("texts", pa.list_(pa.string())), pa.field("bboxes", pa.list_(pa.list_(pa.float64()))), pa.field("polygons", pa.list_(pa.list_(pa.float64()))), pa.field("num_text_regions", pa.int32()), ], metadata={"huggingface": json.dumps({"info": {"features": hf_features}})}) # ---- 2. Prepare your data (one record per image) ---- records = [] for img_path, annotations in your_data_iterator(): with open(img_path, "rb") as f: img_bytes = f.read() texts, bboxes, polygons = [], [], [] for ann in annotations: texts.append(ann["text"]) pts = ann["polygon"] # [x1,y1,x2,y2,...,xN,yN] polygons.append(pts) xs, ys = pts[0::2], pts[1::2] bboxes.append([min(xs), min(ys), max(xs), max(ys)]) records.append({ "image": {"bytes": img_bytes, "path": os.path.basename(img_path)}, "texts": texts, "bboxes": bboxes, "polygons": polygons, "num_text_regions": len(texts), }) # ---- 3. Write to Parquet (chunked for memory efficiency) ---- CHUNK = 200 with pq.ParquetWriter("my_split.parquet", schema, compression="snappy") as writer: for i in range(0, len(records), CHUNK): chunk = records[i : i + CHUNK] batch = pa.record_batch({ "image": pa.array([r["image"] for r in chunk], type=image_type), "texts": pa.array([r["texts"] for r in chunk], type=pa.list_(pa.string())), "bboxes": pa.array([r["bboxes"] for r in chunk], type=pa.list_(pa.list_(pa.float64()))), "polygons": pa.array([r["polygons"] for r in chunk], type=pa.list_(pa.list_(pa.float64()))), "num_text_regions": pa.array([r["num_text_regions"] for r in chunk], type=pa.int32()), }, schema=schema) writer.write_batch(batch) ``` ### Key Points - **Image Encoding:** Store raw JPEG/PNG bytes directly — do not re-encode. The HuggingFace `datasets` library handles decoding at load time. - **Bounding Boxes:** Computed as axis-aligned rectangles from polygon vertices: `[min(xs), min(ys), max(xs), max(ys)]`. - **Memory Efficiency:** Write in chunks (e.g. 200 records) via `ParquetWriter` to avoid loading all images into memory at once. - **HuggingFace Metadata:** The `{"huggingface": ...}` key in schema metadata tells the Dataset Viewer how to render each column (especially the `Image` type). - **Split Naming:** Each `.parquet` file becomes a split. The filename (without extension) is the split name. HuggingFace requires split names to match `\w+(\.\w+)*`, so replace hyphens with underscores. ### Upload to HuggingFace Hub ```bash pip install huggingface_hub datasets huggingface-cli login # Edit upload_to_hf.py with your REPO_ID and DATASET_DIR, then: python upload_to_hf.py ``` ## Citation If you use this dataset, please cite: ```bibtex TBD ``` ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
提供机构:
Yesianrohn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作