five

fgaim/GLOCR-Tigrinya

收藏
Hugging Face2025-07-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/fgaim/GLOCR-Tigrinya
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - image-to-text language: - ti tags: - ocr - tigrinya - geez-script - text-recognition - geezlab size_categories: - 100K<n<1M pretty_name: GLOCR - GeezLab OCR Dataset configs: - config_name: news data_files: - split: train path: data/news/train.parquet - split: validation path: data/news/validation.parquet - split: test path: data/news/test.parquet - config_name: bible data_files: - split: train path: data/bible/train.parquet - split: validation path: data/bible/validation.parquet - split: test path: data/bible/test.parquet - config_name: top150k data_files: - split: train path: data/top150k/train.parquet - split: validation path: data/top150k/validation.parquet - split: test path: data/top150k/test.parquet - config_name: characters data_files: - split: train path: data/characters/train.parquet - split: validation path: data/characters/validation.parquet - split: test path: data/characters/test.parquet - config_name: unsegmented data_files: - split: train path: data/unsegmented/train.parquet - config_name: all data_files: - split: train path: data/*/train.parquet - split: validation path: data/*/validation.parquet - split: test path: data/*/test.parquet default: true --- # GLOCR: GeezLab OCR Dataset ## Overview GLOCR is a Text Recognition (TR) and Optical Character Recognition (OCR) dataset for the **Tigrinya language**. The dataset contains a total of 661K image-label pairs from multiple data sources. In addition to the characters-only data, the major part of the dataset is a collection of multi-word text images with labels from three categories: News (from Haddas Ertra newspaper), the Bible, and random-trigrams of the 150k most common words in Tigrinya. ### Dataset Summary - **Total samples**: ~661K image-label pairs - **Total size**: >1.3GB (tar.gz archives) - **DOI**: [10.7910/DVN/RQTSD2](https://doi.org/10.7910/DVN/RQTSD2) ### Dataset Subsets | Config | Description | Train | Validation | Test | |--------|-------------|------:|-----------:|-----:| | `news` | Newspaper text-lines | 200K | 15K | 15K | | `bible` | Biblical text-lines | 80K | 10K | 10K | | `top150k` | Word trigrams | 150K | 15K | 15K | | `characters` | Single characters | 120K | 15K | 15K | | `unsegmented` | Full-page scans | 506 | - | - | ## Usage ### Loading a specific subset ```python from datasets import load_dataset # Load a specific subset, one of: news, bible, top150k, characters, unsegmented news = load_dataset("fgaim/GLOCR-Tigrinya", "news") # Access samples sample = news["train"][0] print(sample["text"]) sample["image"].show() ``` ### Loading a specific split ```python # Load a specific split of a subset bible_test = load_dataset("fgaim/GLOCR-Tigrinya", "bible", split="test") # Access samples print(bible_test["text"][0]) bible_test["image"][0].show() ``` ### Loading all text-line data combined ```python # Load all text-line data combined all_data = load_dataset("fgaim/GLOCR-Tigrinya", "all") # Access samples sample = all_data["train"][0] print(sample["text"]) sample["image"].show() ``` ## Links - [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RQTSD2) - [GitHub Repository](https://github.com/fgaim/GLOCR) ## Citation ```bibtex @data{gaim-2021-glocr, title = {{GLOCR: GeezLab OCR Dataset}}, author = {Fitsum Gaim}, year = {2021}, month = {April}, publisher = {Harvard Dataverse}, version = {1.0}, doi = {10.7910/DVN/RQTSD2}, url = {https://doi.org/10.7910/DVN/RQTSD2}, dataverse = {https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RQTSD2} } ``` ## License This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://licensebuttons.net/l/by-sa/4.0/88x31.png" /></a>
提供机构:
fgaim
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作