five

pastvu

收藏
魔搭社区2025-07-26 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/nyuuzyou/pastvu
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for PastVu Historical Photographs ### Dataset Summary This dataset contains approximately 2,093,000 historical photographs from [PastVu.com](https://pastvu.com), spanning the years 1826-2000. PastVu is a collaborative project that allows users to view historical photographs on an interactive map, providing a unique temporal and geographical perspective on historical documentation. The dataset includes images downloaded at the best available resolution and is structured in webdataset format for efficient processing. The collection covers a wide range of historical subjects including architecture, street scenes, cultural events, portraits, and everyday life, primarily focused on Russia and former Soviet territories, though it includes photographs from around the world. ### Languages The dataset is multilingual: - Russian (ru): Primary language for titles, descriptions, and metadata - English (en): Secondary language with some translated content and region names ## Dataset Structure ### Data Fields Each photograph in the dataset is accompanied by comprehensive JSON metadata containing the following fields: - `type`: Integer representing the content type - `adate`: ISO timestamp of when the photo was added to PastVu - `author`: String containing the photographer's name - `cid`: Unique content identifier - `dir`: String indicating image orientation/direction - `file`: Original URL of the image file - `geo`: Array containing latitude and longitude coordinates [lat, lon] - `h`: Original image height in pixels - `hs`: Scaled image height in pixels - `ldate`: Last modification timestamp - `title`: Photograph title/description in original language - `user`: Object containing uploader information: - `login`: Username of the uploader - `avatar`: URL to user's avatar image - `disp`: Display name - `ranks`: Array of user rank badges - `sex`: Gender identifier - `w`: Original image width in pixels - `ws`: Scaled image width in pixels - `year`: Primary year the photograph was taken - `year2`: End year (for photographs spanning multiple years) - `s`: Status or quality indicator - `y`: String representation of the year - `waterh`: Watermark height - `waterhs`: Scaled watermark height - `watersignText`: Watermark text content - `watersignTextApplied`: Timestamp when watermark was applied - `r2d`: Array containing additional geographical reference data - `frags`: Array of image fragments or regions of interest - `regions`: Array of geographical region objects containing: - `cid`: Region identifier - `title_en`: English region name - `title_ru`: Russian region name - `phc`: Photo count in region - `pac`: Active photo count - `cc`: Total content count ### Data Splits The dataset contains a single split: | Split | Number of Examples | | :---- | ---------------: | | `train` | ~2,093,000 | Total dataset size: Approximately 2,093,000 entries ### Data Format - Images are stored in webdataset format across 2,094 tar files - Tar files are named sequentially: `pastvu_000000.tar` to `pastvu_002093.tar` - Approximately 1,000 images per tar file - Images are downloaded at the best available resolution from the original source - Each photograph has corresponding JSON metadata - Complete metadata is also provided in `pastvu.jsonl.zst` - Geographical coordinates are provided for most images, enabling spatial analysis ### Temporal Coverage The dataset spans 174 years of photography: - **Start year**: 1826 - **End year**: 2000 - **Peak periods**: Soviet era (1920s-1990s) with extensive documentation ## License Information ### Licensing Structure The licensing for this dataset follows the original PastVu.com terms and conditions. Users should be aware that: - Images may have varying copyright statuses depending on their age and origin - Many historical photographs may be in the public domain due to age - Some images may still be under copyright protection - Attribution to original photographers and PastVu.com is recommended - Any use of materials published by the user for commercial purposes is only possible with the permission of the copyright holder of the specific image.

# PastVu历史照片数据集卡片 ## 数据集概览 本数据集包含来自[PastVu.com](https://pastvu.com)的约209.3万张历史照片,时间跨度为1826年至2000年。PastVu是一个协作式项目,支持用户通过交互式地图查看历史照片,为历史文献提供了独特的时空视角。本数据集的图像均以最高可用分辨率下载,并采用webdataset格式存储以实现高效处理。该数据集涵盖建筑、街景、文化活动、肖像及日常生活等多类历史题材,虽收录了全球各地的照片,但主要聚焦于俄罗斯及前苏维埃地区。 ## 语言支持 本数据集支持多语言: - 俄语(ru):标题、描述及元数据的主要语言 - 英语(en):辅助语言,包含部分翻译内容及地区名称 ## 数据集结构 ### 数据字段 数据集中的每张照片均附带完整的JSON格式元数据,包含以下字段: - `type`:表示内容类型的整数值 - `adate`:照片上传至PastVu的ISO格式时间戳 - `author`:摄影师姓名的字符串 - `cid`:唯一内容标识符 - `dir`:表示图像朝向/方向的字符串 - `file`:图像文件的原始URL - `geo`:包含纬度和经度坐标的数组[lat, lon] - `h`:图像原始高度(像素) - `hs`:缩放后图像高度(像素) - `ldate`:最后修改时间戳 - `title`:照片原始语言的标题/描述 - `user`:包含上传者信息的对象,其子字段包括: - `login`:上传者用户名 - `avatar`:用户头像图片的URL - `disp`:显示名称 - `ranks`:用户等级徽章数组 - `sex`:性别标识 - `w`:图像原始宽度(像素) - `ws`:缩放后图像宽度(像素) - `year`:照片拍摄的主要年份 - `year2`:结束年份(用于跨多年拍摄的照片) - `s`:状态或质量标识 - `y`:年份的字符串形式 - `waterh`:水印高度 - `waterhs`:缩放后水印高度 - `watersignText`:水印文本内容 - `watersignTextApplied`:水印添加时间戳 - `r2d`:包含额外地理参考数据的数组 - `frags`:图像片段或感兴趣区域的数组 - `regions`:地理区域对象数组,其子字段包括: - `cid`:地区标识符 - `title_en`:地区英文名称 - `title_ru`:地区俄文名称 - `phc`:该地区的照片数量 - `pac`:该地区的有效照片数量 - `cc`:总内容数量 ### 数据划分 本数据集仅包含一个划分: | 划分 | 样本数量 | | :---- | ---------------: | | `train` | ~209.3万 | 数据集总规模:约209.3万条数据 ### 数据格式 - 图像以webdataset格式存储于2094个tar文件中 - Tar文件按顺序命名:`pastvu_000000.tar` 至 `pastvu_002093.tar` - 每个tar文件约包含1000张图像 - 图像从原始来源以最高可用分辨率下载 - 每张照片均配有对应的JSON元数据 - 完整元数据也可通过`pastvu.jsonl.zst`获取 - 大多数图像附带地理坐标,支持空间分析 ## 时间覆盖范围 本数据集涵盖174年的摄影历史: - **起始年份**:1826年 - **结束年份**:2000年 - **拍摄高峰时段**:苏维埃时代(1920年代-1990年代),相关文档记录最为丰富 ## 授权信息 ### 授权结构 本数据集的授权遵循PastVu.com原始条款与条件。用户需注意: - 图像的版权状态因拍摄年代及来源而异 - 多数历史照片因年代久远已进入公有领域 - 部分图像仍受版权保护 - 建议注明原摄影师及PastVu.com的来源 - 若将本数据集内容用于商业用途,需获得对应图像版权持有人的许可。
提供机构:
maas
创建时间:
2025-07-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作