five

SidneyBissoli/sipni-agregados-doses

收藏
Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SidneyBissoli/sipni-agregados-doses
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pt license: cc-by-4.0 tags: - health - brazil - public-health - parquet - datasus - sipni - vaccination - immunization - historical pretty_name: "SI-PNI — Aggregated Vaccine Doses (Brazil, 1994–2019)" size_categories: - 10M<n<100M task_categories: - tabular-classification source_datasets: - original --- # SI-PNI — Aggregated Vaccine Doses (Brazil, 1994–2019) Historical aggregated data on administered vaccine doses from Brazil's National Immunization Program (SI-PNI), covering 26 years of municipality- level records. Converted from legacy .dbf files to Apache Parquet for modern analytical access. **Part of the [healthbr-data](https://huggingface.co/SidneyBissoli) project** — open redistribution of Brazilian public health data. ## Summary | Item | Detail | |------|--------| | **Official source** | DATASUS FTP / Ministry of Health | | **Temporal coverage** | 1994–2019 | | **Geographic coverage** | All Brazilian municipalities (by state) | | **Granularity** | Aggregated: one row per municipality × vaccine × dose × age group | | **Volume** | 84M+ records (674 .dbf files processed) | | **Format** | Apache Parquet, partitioned by `ano/uf` | | **Data types** | All fields stored as `string` (preserves original format) | | **Update frequency** | Static (historical series, no longer updated at source) | | **License** | CC-BY 4.0 | ## Resumo em português **SI-PNI — Doses Aplicadas Agregadas (Brasil, 1994–2019)** Dados históricos agregados de doses aplicadas do Programa Nacional de Imunizações (PNI), cobrindo 26 anos de registros em nível municipal. Convertidos de arquivos .dbf legados para Apache Parquet. | Item | Detalhe | |------|---------| | **Fonte oficial** | FTP DATASUS / Ministério da Saúde | | **Cobertura temporal** | 1994–2019 | | **Cobertura geográfica** | Todos os municípios brasileiros (por UF) | | **Granularidade** | Agregado: uma linha por município × vacina × dose × faixa etária | | **Volume** | 84M+ registros (674 arquivos .dbf processados) | | **Formato** | Apache Parquet, particionado por `ano/uf` | | **Atualização** | Estática (série histórica, não atualizada na fonte) | > Para documentação completa em português, consulte o > [repositório do projeto](https://github.com/SidneyBissoli/healthbr-data). ## Data access Data is hosted on Cloudflare R2 and accessed via S3-compatible API. The credentials below are **read-only** and intended for public use. ### R (Arrow) ```r library(arrow) library(dplyr) Sys.setenv( AWS_ENDPOINT_URL = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com", AWS_ACCESS_KEY_ID = "28c72d4b3e1140fa468e367ae472b522", AWS_SECRET_ACCESS_KEY = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951", AWS_DEFAULT_REGION = "auto" ) ds <- open_dataset("s3://healthbr-data/sipni/agregados/doses/", format = "parquet") # Example: vaccine doses in Acre, 2010 ds |> filter(ano == "2010", uf == "AC") |> count(IMUNO) |> collect() ``` ### Python (PyArrow) ```python import pyarrow.dataset as pds import pyarrow.fs as fs s3 = fs.S3FileSystem( endpoint_override = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com", access_key = "28c72d4b3e1140fa468e367ae472b522", secret_key = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951", region = "auto" ) dataset = pds.dataset( "healthbr-data/sipni/agregados/doses/", filesystem = s3, format = "parquet", partitioning = "hive" ) table = dataset.to_table( filter=(pds.field("ano") == "2010") & (pds.field("uf") == "AC") ) print(table.to_pandas().head()) ``` > **Note:** These credentials are **read-only** and safe to use in scripts. > The bucket does not allow anonymous S3 access — credentials are required. ## File structure ``` s3://healthbr-data/sipni/agregados/doses/ README.md ano=1994/ uf=AC/ part-0.parquet uf=AL/ part-0.parquet ... ano=1995/ ... ``` ## Structural eras The .dbf files underwent two structural transitions over 26 years: | Era | Period | Columns | Key difference | |:---:|--------|:-------:|----------------| | 1 | 1994–2003 | 7 | Basic structure, 7-digit municipality code | | 2 | 2004–2012 | 12 | Added dose, age group, and population fields; 7-digit municipality code | | 3 | 2013–2019 | 12 | Same columns as era 2, but 6-digit municipality code | All eras are preserved as-is in the Parquet files. The municipality code format (7 vs 6 digits) is kept as originally recorded. ## Schema Key variables (varies by era): | Variable | Description | Available | |----------|-------------|:---------:| | `MUNICIP` | Municipality code (7 digits until 2012, 6 digits from 2013) | All eras | | `IMESSION` | Vaccine code (per IMUNO.CNV dictionary, 85 entries) | All eras | | `QT_DOSE` | Number of administered doses | All eras | | `DOSE` | Dose type (1st, 2nd, booster, etc.) | Eras 2–3 | | `FX_ETARIA` | Age group | Eras 2–3 | | `POP` | Target population | Eras 2–3 | > For the complete vaccine code dictionary (65 unique codes across 26 years), > see `IMUNO.CNV` from the DATASUS FTP `/PNI/AUXILIARES/` directory. ## Source and processing **Original source:** 702 .dbf files (dBase III) from the DATASUS FTP server (`ftp://ftp.datasus.gov.br/dissemin/publicos/PNI/DADOS/`). Of these, 674 were successfully processed, 12 were unavailable on the server, and 16 were empty. **Processing:** .dbf → R (`foreign::read.dbf`) → Parquet (`arrow::write_dataset`) → upload to R2 (`rclone`). No transformations are applied. Consolidated files (UF, BR, IG prefixes) were excluded — only state-level files with municipal granularity are included. **Validation:** The sum of all 27 state files matches the national consolidated file (DPNIBR) with zero difference. ## Known limitations 1. **Government data, not ours.** Values are preserved exactly as in the original .dbf files. 2. **Three structural eras.** Column availability and municipality code format change across time periods. Users must handle this in analysis. 3. **All fields are strings.** Preserves original format including municipality code leading digits. 4. **No microdata.** These are aggregated counts, not individual records. For individual-level data from 2020 onward, see `sipni-microdados`. 5. **Static dataset.** The Ministry stopped publishing aggregated .dbf files after 2019. The new SI-PNI system (2020+) produces individual records instead. ## Citation ```bibtex @misc{healthbrdata, author = {Sidney da Silva Bissoli}, title = {healthbr-data: Redistribution of Brazilian Public Health Data}, year = {2026}, url = {https://huggingface.co/datasets/SidneyBissoli/sipni-agregados-doses}, note = {Original source: Ministry of Health / DATASUS} } ``` ## Contact - **GitHub:** [https://github.com/SidneyBissoli](https://github.com/SidneyBissoli) - **Hugging Face:** [https://huggingface.co/SidneyBissoli](https://huggingface.co/SidneyBissoli) - **E-mail:** sbissoli76@gmail.com --- *Last updated: 2026-02-28*

--- 语言: - pt 许可证:CC BY 4.0(知识共享署名4.0) 标签: - 卫生 - 巴西 - 公共卫生 - Apache Parquet - DATASUS(巴西国家卫生数据系统) - SI-PNI - 疫苗接种 - 免疫接种 - 历史数据 友好名称:"SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集" 规模类别: - 1000万<n<1亿条记录 任务类别: - 表格分类 源数据集: - 原始数据集 --- # SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集 本数据集包含巴西国家免疫规划信息系统(SI-PNI)的历史聚合疫苗接种剂次数据,覆盖26年的市级层面记录。数据已从遗留的.dbf文件转换为Apache Parquet格式,以支持现代化分析访问。 本数据集属于[healthbr-data](https://huggingface.co/SidneyBissoli)项目——巴西公共卫生数据的开源再分发项目。 ## 摘要 | 项目 | 详情 | |------|--------| | **官方来源** | DATASUS FTP服务器 / 巴西卫生部 | | **时间覆盖范围** | 1994–2019年 | | **地理覆盖范围** | 巴西所有市级行政区(按州划分) | | **数据粒度** | 聚合数据:每一行对应一个市级行政区×疫苗×剂次×年龄组 | | **数据规模** | 超过8400万条记录(已处理674个.dbf文件) | | **数据格式** | Apache Parquet,按`ano/uf`(年份/州)分区 | | **数据类型** | 所有字段均存储为字符串(保留原始格式) | | **更新频率** | 静态数据集(历史序列,源端不再更新) | | **许可证** | CC BY 4.0 | ## 葡萄牙语摘要 **SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集** 巴西国家免疫规划(PNI,Programa Nacional de Imunizações)的历史聚合疫苗接种剂次数据,覆盖26年的市级层面记录。数据已从遗留的.dbf文件转换为Apache Parquet格式。 | 项目 | 详情 | |------|---------| | **官方来源** | DATASUS FTP服务器 / 巴西卫生部 | | **时间覆盖范围** | 1994–2019年 | | **地理覆盖范围** | 巴西所有市级行政区(按联邦单元UF划分) | | **数据粒度** | 聚合数据:每一行对应一个市级行政区×疫苗×剂次×年龄组 | | **数据规模** | 超过8400万条记录(已处理674个.dbf文件) | | **数据格式** | Apache Parquet,按`ano/uf`(年份/州)分区 | | **更新状态** | 静态数据集(历史序列,源端不再更新) | > 如需完整的葡萄牙语文档,请参阅[项目仓库](https://github.com/SidneyBissoli/healthbr-data)。 ## 数据访问 数据托管于Cloudflare R2存储服务,通过兼容S3的API进行访问。以下凭证为**只读权限**,可供公众使用。 ### R语言(Arrow库) r library(arrow) library(dplyr) Sys.setenv( AWS_ENDPOINT_URL = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com", AWS_ACCESS_KEY_ID = "28c72d4b3e1140fa468e367ae472b522", AWS_SECRET_ACCESS_KEY = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951", AWS_DEFAULT_REGION = "auto" ) ds <- open_dataset("s3://healthbr-data/sipni/agregados/doses/", format = "parquet") # 示例:查询2010年阿克里州的疫苗剂次数据 ds |> filter(ano == "2010", uf == "AC") |> count(IMUNO) |> collect() ### Python语言(PyArrow库) python import pyarrow.dataset as pds import pyarrow.fs as fs s3 = fs.S3FileSystem( endpoint_override = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com", access_key = "28c72d4b3e1140fa468e367ae472b522", secret_key = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951", region = "auto" ) dataset = pds.dataset( "healthbr-data/sipni/agregados/doses/", filesystem = s3, format = "parquet", partitioning = "hive" ) table = dataset.to_table( filter=(pds.field("ano") == "2010") & (pds.field("uf") == "AC") ) print(table.to_pandas().head()) > **注意:** 以下凭证为**只读权限**,可安全用于脚本中。该存储桶不支持匿名S3访问——必须使用凭证才能访问。 ## 文件结构 s3://healthbr-data/sipni/agregados/doses/ README.md ano=1994/ uf=AC/ part-0.parquet uf=AL/ part-0.parquet ... ano=1995/ ... ## 数据结构时代 在26年的时间跨度内,.dbf文件经历了两次结构变更: | 时代 | 时间范围 | 字段数 | 关键差异 | |:---:|--------|:-------:|----------------| | 1 | 1994–2003 | 7 | 基础结构,使用7位市级行政区代码 | | 2 | 2004–2012 | 12 | 新增剂次、年龄组和目标人口字段;仍使用7位市级行政区代码 | | 3 | 2013–2019 | 12 | 与时代2字段一致,但使用6位市级行政区代码 | 所有时代的数据均按原始格式保留在Parquet文件中,市级行政区代码格式(7位 vs 6位)也完全保留原始记录。 ## 数据Schema 关键变量(因时代不同而有所差异): | 变量名 | 说明 | 可用时代 | |----------|-------------|:---------:| | `MUNICIP` | 市级行政区代码(2012年前为7位,2013年起为6位) | 所有时代 | | `IMESSION` | 疫苗代码(遵循IMUNO.CNV字典,共85个条目) | 所有时代 | | `QT_DOSE` | 疫苗接种剂次数 | 所有时代 | | `DOSE` | 剂次类型(第1剂、第2剂、加强针等) | 时代2–3 | | `FX_ETARIA` | 年龄组 | 时代2–3 | | `POP` | 目标人口数 | 时代2–3 | > 如需完整的疫苗代码字典(26年间共65个唯一代码),请参阅DATASUS FTP服务器`/PNI/AUXILIARES/`目录下的`IMUNO.CNV`文件。 ## 源数据与处理流程 **原始数据源**:来自DATASUS FTP服务器(`ftp://ftp.datasus.gov.br/dissemin/publicos/PNI/DADOS/`)的702个.dbf(dBase III格式)文件。其中674个文件成功处理,12个文件在服务器上不可用,另有16个为空文件。 **处理流程**:.dbf → R语言(`foreign::read.dbf`)→ Parquet格式(`arrow::write_dataset`)→ 上传至R2存储(`rclone`工具)。未进行任何数据转换。已排除合并后的文件(UF、BR、IG前缀),仅保留包含市级粒度的州级文件。 **数据验证**:所有27个州级文件的总和与全国合并文件(DPNIBR)完全一致,无任何差异。 ## 已知局限性 1. **政府原始数据,非本项目生成**:数据值完全保留自原始.dbf文件,未做任何修改。 2. **三个数据结构时代**:不同时间段的字段可用性和市级代码格式存在差异,使用者需在分析中自行处理。 3. **所有字段均为字符串类型**:保留原始格式,包括市级代码的前导零。 4. **无微观个体数据**:本数据集为聚合计数数据,而非个体记录。如需2020年及以后的个体级数据,请参阅`sipni-microdados`数据集。 5. **静态数据集**:巴西卫生部自2019年起停止发布聚合式.dbf文件。2020年及以后的新版SI-PNI系统仅生成个体记录数据。 ## 引用方式 bibtex @misc{healthbrdata, author = {Sidney da Silva Bissoli}, title = {healthbr-data: 巴西公共卫生数据再分发项目}, year = {2026}, url = {https://huggingface.co/datasets/SidneyBissoli/sipni-agregados-doses}, note = {原始数据源:巴西卫生部 / DATASUS} } ## 联系方式 - **GitHub**:[https://github.com/SidneyBissoli](https://github.com/SidneyBissoli) - **Hugging Face**:[https://huggingface.co/SidneyBissoli](https://huggingface.co/SidneyBissoli) - **电子邮箱**:sbissoli76@gmail.com --- *最后更新时间:2026年2月28日*
提供机构:
SidneyBissoli
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作