five

Peter

收藏
魔搭社区2025-11-12 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/ai-forever/Peter
下载链接
链接失效反馈
官方服务:
资源简介:
# Digital Peter The Peter dataset can be used for reading texts from the manuscripts written by Peter the Great. The dataset annotation contain end-to-end markup for training detection and OCR models, as well as an end-to-end model for reading text from pages. Paper is available at http://arxiv.org/abs/2103.09354 ## Description Digital Peter is an educational task with a historical slant created on the basis of several AI technologies (Computer Vision, NLP, and knowledge graphs). The task was prepared jointly with the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences, Federal Archival Agency of Russia and Russian State Archive of Ancient Acts. A detailed description of the problem (with an immersion in the problem) can be found in [detailed_description_of_the_task_en.pdf](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/desc/detailed_description_of_the_task_en.pdf) The dataset consists of 662 full page images and 9696 annotated text files. There are 265788 symbols and approximately 50998 words. ## Annotation format The annotation is in COCO format. The `annotation.json` should have the following dictionaries: - `annotation["categories"]` - a list of dicts with a categories info (categotiy names and indexes). - `annotation["images"]` - a list of dictionaries with a description of images, each dictionary must contain fields: - `file_name` - name of the image file. - `id` for image id. - `annotation["annotations"]` - a list of dictioraties with a murkup information. Each dictionary stores a description for one polygon from the dataset, and must contain the following fields: - `image_id` - the index of the image on which the polygon is located. - `category_id` - the polygon’s category index. - ```attributes``` - dict with some additional annotatioin information. In the `translation` subdict you can find text translation for the line. - `segmentation` - the coordinates of the polygon, a list of numbers - which are coordinate pairs x and y. ## Competition We held a competition based on Digital Peter dataset. Here is github [link](https://github.com/sberbank-ai/digital_peter_aij2020). Here is competition [page](https://ods.ai/tracks/aij2020) (need to register).

# 数字彼得(Digital Peter) 彼得数据集(Peter dataset)可用于识读彼得大帝(Peter the Great)手稿中的文本内容。该数据集的标注包含用于训练检测模型与光学字符识别(Optical Character Recognition, OCR)模型的端到端标注方案,以及用于页面文本识读的端到端模型。 相关论文可访问:http://arxiv.org/abs/2103.09354 ## 数据集描述 数字彼得(Digital Peter)是一项兼具历史属性的教育型任务,基于多项人工智能技术构建,包括计算机视觉(Computer Vision)、自然语言处理(Natural Language Processing, NLP)以及知识图谱。本任务由俄罗斯科学院圣彼得堡历史研究所(N.P.利哈乔夫府邸)、俄罗斯联邦档案署以及俄罗斯国家古代档案总局联合打造。 该任务的详细说明(含任务背景深度解析)可参阅:[detailed_description_of_the_task_en.pdf](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/desc/detailed_description_of_the_task_en.pdf) 本数据集共包含662张完整页面图像与9696个标注文本文件,总字符数达265788,单词量约50998。 ## 标注格式 本次标注采用COCO格式。`annotation.json` 文件需包含以下字典字段: - `annotation["categories"]`:存储分类信息的字典列表,包含分类名称与索引。 - `annotation["images"]`:图像描述字典列表,每个字典需包含以下字段: - `file_name`:图像文件名 - `id`:图像唯一标识符 - `annotation["annotations"]`:标注信息字典列表,每个字典对应数据集中的一个多边形标注,需包含以下字段: - `image_id`:该多边形所属图像的索引 - `category_id`:该多边形的分类索引 - `attributes`:存储额外标注信息的字典,其子字典`translation`中可获取对应文本行的翻译内容 - `segmentation`:多边形的坐标信息,由一系列数值组成,对应x、y坐标对序列 ## 赛事相关 我们曾基于数字彼得数据集举办相关赛事。相关GitHub仓库链接:[https://github.com/sberbank-ai/digital_peter_aij2020](https://github.com/sberbank-ai/digital_peter_aij2020),赛事页面链接:[https://ods.ai/tracks/aij2020](https://ods.ai/tracks/aij2020)(需注册方可访问)
提供机构:
maas
创建时间:
2025-05-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作