SidneyBissoli/sipni-agregados-doses
收藏Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SidneyBissoli/sipni-agregados-doses
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
license: cc-by-4.0
tags:
- health
- brazil
- public-health
- parquet
- datasus
- sipni
- vaccination
- immunization
- historical
pretty_name: "SI-PNI — Aggregated Vaccine Doses (Brazil, 1994–2019)"
size_categories:
- 10M<n<100M
task_categories:
- tabular-classification
source_datasets:
- original
---
# SI-PNI — Aggregated Vaccine Doses (Brazil, 1994–2019)
Historical aggregated data on administered vaccine doses from Brazil's
National Immunization Program (SI-PNI), covering 26 years of municipality-
level records. Converted from legacy .dbf files to Apache Parquet for
modern analytical access.
**Part of the [healthbr-data](https://huggingface.co/SidneyBissoli) project** — open redistribution of Brazilian public health data.
## Summary
| Item | Detail |
|------|--------|
| **Official source** | DATASUS FTP / Ministry of Health |
| **Temporal coverage** | 1994–2019 |
| **Geographic coverage** | All Brazilian municipalities (by state) |
| **Granularity** | Aggregated: one row per municipality × vaccine × dose × age group |
| **Volume** | 84M+ records (674 .dbf files processed) |
| **Format** | Apache Parquet, partitioned by `ano/uf` |
| **Data types** | All fields stored as `string` (preserves original format) |
| **Update frequency** | Static (historical series, no longer updated at source) |
| **License** | CC-BY 4.0 |
## Resumo em português
**SI-PNI — Doses Aplicadas Agregadas (Brasil, 1994–2019)**
Dados históricos agregados de doses aplicadas do Programa Nacional de
Imunizações (PNI), cobrindo 26 anos de registros em nível municipal.
Convertidos de arquivos .dbf legados para Apache Parquet.
| Item | Detalhe |
|------|---------|
| **Fonte oficial** | FTP DATASUS / Ministério da Saúde |
| **Cobertura temporal** | 1994–2019 |
| **Cobertura geográfica** | Todos os municípios brasileiros (por UF) |
| **Granularidade** | Agregado: uma linha por município × vacina × dose × faixa etária |
| **Volume** | 84M+ registros (674 arquivos .dbf processados) |
| **Formato** | Apache Parquet, particionado por `ano/uf` |
| **Atualização** | Estática (série histórica, não atualizada na fonte) |
> Para documentação completa em português, consulte o
> [repositório do projeto](https://github.com/SidneyBissoli/healthbr-data).
## Data access
Data is hosted on Cloudflare R2 and accessed via S3-compatible API. The
credentials below are **read-only** and intended for public use.
### R (Arrow)
```r
library(arrow)
library(dplyr)
Sys.setenv(
AWS_ENDPOINT_URL = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com",
AWS_ACCESS_KEY_ID = "28c72d4b3e1140fa468e367ae472b522",
AWS_SECRET_ACCESS_KEY = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951",
AWS_DEFAULT_REGION = "auto"
)
ds <- open_dataset("s3://healthbr-data/sipni/agregados/doses/", format = "parquet")
# Example: vaccine doses in Acre, 2010
ds |>
filter(ano == "2010", uf == "AC") |>
count(IMUNO) |>
collect()
```
### Python (PyArrow)
```python
import pyarrow.dataset as pds
import pyarrow.fs as fs
s3 = fs.S3FileSystem(
endpoint_override = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com",
access_key = "28c72d4b3e1140fa468e367ae472b522",
secret_key = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951",
region = "auto"
)
dataset = pds.dataset(
"healthbr-data/sipni/agregados/doses/",
filesystem = s3,
format = "parquet",
partitioning = "hive"
)
table = dataset.to_table(
filter=(pds.field("ano") == "2010") & (pds.field("uf") == "AC")
)
print(table.to_pandas().head())
```
> **Note:** These credentials are **read-only** and safe to use in scripts.
> The bucket does not allow anonymous S3 access — credentials are required.
## File structure
```
s3://healthbr-data/sipni/agregados/doses/
README.md
ano=1994/
uf=AC/
part-0.parquet
uf=AL/
part-0.parquet
...
ano=1995/
...
```
## Structural eras
The .dbf files underwent two structural transitions over 26 years:
| Era | Period | Columns | Key difference |
|:---:|--------|:-------:|----------------|
| 1 | 1994–2003 | 7 | Basic structure, 7-digit municipality code |
| 2 | 2004–2012 | 12 | Added dose, age group, and population fields; 7-digit municipality code |
| 3 | 2013–2019 | 12 | Same columns as era 2, but 6-digit municipality code |
All eras are preserved as-is in the Parquet files. The municipality code
format (7 vs 6 digits) is kept as originally recorded.
## Schema
Key variables (varies by era):
| Variable | Description | Available |
|----------|-------------|:---------:|
| `MUNICIP` | Municipality code (7 digits until 2012, 6 digits from 2013) | All eras |
| `IMESSION` | Vaccine code (per IMUNO.CNV dictionary, 85 entries) | All eras |
| `QT_DOSE` | Number of administered doses | All eras |
| `DOSE` | Dose type (1st, 2nd, booster, etc.) | Eras 2–3 |
| `FX_ETARIA` | Age group | Eras 2–3 |
| `POP` | Target population | Eras 2–3 |
> For the complete vaccine code dictionary (65 unique codes across 26 years),
> see `IMUNO.CNV` from the DATASUS FTP `/PNI/AUXILIARES/` directory.
## Source and processing
**Original source:** 702 .dbf files (dBase III) from the DATASUS FTP server
(`ftp://ftp.datasus.gov.br/dissemin/publicos/PNI/DADOS/`). Of these, 674
were successfully processed, 12 were unavailable on the server, and 16 were
empty.
**Processing:** .dbf → R (`foreign::read.dbf`) → Parquet (`arrow::write_dataset`)
→ upload to R2 (`rclone`). No transformations are applied. Consolidated
files (UF, BR, IG prefixes) were excluded — only state-level files with
municipal granularity are included.
**Validation:** The sum of all 27 state files matches the national
consolidated file (DPNIBR) with zero difference.
## Known limitations
1. **Government data, not ours.** Values are preserved exactly as in the
original .dbf files.
2. **Three structural eras.** Column availability and municipality code
format change across time periods. Users must handle this in analysis.
3. **All fields are strings.** Preserves original format including
municipality code leading digits.
4. **No microdata.** These are aggregated counts, not individual records.
For individual-level data from 2020 onward, see `sipni-microdados`.
5. **Static dataset.** The Ministry stopped publishing aggregated .dbf
files after 2019. The new SI-PNI system (2020+) produces individual
records instead.
## Citation
```bibtex
@misc{healthbrdata,
author = {Sidney da Silva Bissoli},
title = {healthbr-data: Redistribution of Brazilian Public Health Data},
year = {2026},
url = {https://huggingface.co/datasets/SidneyBissoli/sipni-agregados-doses},
note = {Original source: Ministry of Health / DATASUS}
}
```
## Contact
- **GitHub:** [https://github.com/SidneyBissoli](https://github.com/SidneyBissoli)
- **Hugging Face:** [https://huggingface.co/SidneyBissoli](https://huggingface.co/SidneyBissoli)
- **E-mail:** sbissoli76@gmail.com
---
*Last updated: 2026-02-28*
---
语言:
- pt
许可证:CC BY 4.0(知识共享署名4.0)
标签:
- 卫生
- 巴西
- 公共卫生
- Apache Parquet
- DATASUS(巴西国家卫生数据系统)
- SI-PNI
- 疫苗接种
- 免疫接种
- 历史数据
友好名称:"SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集"
规模类别:
- 1000万<n<1亿条记录
任务类别:
- 表格分类
源数据集:
- 原始数据集
---
# SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集
本数据集包含巴西国家免疫规划信息系统(SI-PNI)的历史聚合疫苗接种剂次数据,覆盖26年的市级层面记录。数据已从遗留的.dbf文件转换为Apache Parquet格式,以支持现代化分析访问。
本数据集属于[healthbr-data](https://huggingface.co/SidneyBissoli)项目——巴西公共卫生数据的开源再分发项目。
## 摘要
| 项目 | 详情 |
|------|--------|
| **官方来源** | DATASUS FTP服务器 / 巴西卫生部 |
| **时间覆盖范围** | 1994–2019年 |
| **地理覆盖范围** | 巴西所有市级行政区(按州划分) |
| **数据粒度** | 聚合数据:每一行对应一个市级行政区×疫苗×剂次×年龄组 |
| **数据规模** | 超过8400万条记录(已处理674个.dbf文件) |
| **数据格式** | Apache Parquet,按`ano/uf`(年份/州)分区 |
| **数据类型** | 所有字段均存储为字符串(保留原始格式) |
| **更新频率** | 静态数据集(历史序列,源端不再更新) |
| **许可证** | CC BY 4.0 |
## 葡萄牙语摘要
**SI-PNI——巴西1994-2019年疫苗接种剂次聚合数据集**
巴西国家免疫规划(PNI,Programa Nacional de Imunizações)的历史聚合疫苗接种剂次数据,覆盖26年的市级层面记录。数据已从遗留的.dbf文件转换为Apache Parquet格式。
| 项目 | 详情 |
|------|---------|
| **官方来源** | DATASUS FTP服务器 / 巴西卫生部 |
| **时间覆盖范围** | 1994–2019年 |
| **地理覆盖范围** | 巴西所有市级行政区(按联邦单元UF划分) |
| **数据粒度** | 聚合数据:每一行对应一个市级行政区×疫苗×剂次×年龄组 |
| **数据规模** | 超过8400万条记录(已处理674个.dbf文件) |
| **数据格式** | Apache Parquet,按`ano/uf`(年份/州)分区 |
| **更新状态** | 静态数据集(历史序列,源端不再更新) |
> 如需完整的葡萄牙语文档,请参阅[项目仓库](https://github.com/SidneyBissoli/healthbr-data)。
## 数据访问
数据托管于Cloudflare R2存储服务,通过兼容S3的API进行访问。以下凭证为**只读权限**,可供公众使用。
### R语言(Arrow库)
r
library(arrow)
library(dplyr)
Sys.setenv(
AWS_ENDPOINT_URL = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com",
AWS_ACCESS_KEY_ID = "28c72d4b3e1140fa468e367ae472b522",
AWS_SECRET_ACCESS_KEY = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951",
AWS_DEFAULT_REGION = "auto"
)
ds <- open_dataset("s3://healthbr-data/sipni/agregados/doses/", format = "parquet")
# 示例:查询2010年阿克里州的疫苗剂次数据
ds |>
filter(ano == "2010", uf == "AC") |>
count(IMUNO) |>
collect()
### Python语言(PyArrow库)
python
import pyarrow.dataset as pds
import pyarrow.fs as fs
s3 = fs.S3FileSystem(
endpoint_override = "https://5c499208eebced4e34bd98ffa204f2fb.r2.cloudflarestorage.com",
access_key = "28c72d4b3e1140fa468e367ae472b522",
secret_key = "2937b2106736e2ba64e24e92f2be4e6c312bba3355586e41ce634b14c1482951",
region = "auto"
)
dataset = pds.dataset(
"healthbr-data/sipni/agregados/doses/",
filesystem = s3,
format = "parquet",
partitioning = "hive"
)
table = dataset.to_table(
filter=(pds.field("ano") == "2010") & (pds.field("uf") == "AC")
)
print(table.to_pandas().head())
> **注意:** 以下凭证为**只读权限**,可安全用于脚本中。该存储桶不支持匿名S3访问——必须使用凭证才能访问。
## 文件结构
s3://healthbr-data/sipni/agregados/doses/
README.md
ano=1994/
uf=AC/
part-0.parquet
uf=AL/
part-0.parquet
...
ano=1995/
...
## 数据结构时代
在26年的时间跨度内,.dbf文件经历了两次结构变更:
| 时代 | 时间范围 | 字段数 | 关键差异 |
|:---:|--------|:-------:|----------------|
| 1 | 1994–2003 | 7 | 基础结构,使用7位市级行政区代码 |
| 2 | 2004–2012 | 12 | 新增剂次、年龄组和目标人口字段;仍使用7位市级行政区代码 |
| 3 | 2013–2019 | 12 | 与时代2字段一致,但使用6位市级行政区代码 |
所有时代的数据均按原始格式保留在Parquet文件中,市级行政区代码格式(7位 vs 6位)也完全保留原始记录。
## 数据Schema
关键变量(因时代不同而有所差异):
| 变量名 | 说明 | 可用时代 |
|----------|-------------|:---------:|
| `MUNICIP` | 市级行政区代码(2012年前为7位,2013年起为6位) | 所有时代 |
| `IMESSION` | 疫苗代码(遵循IMUNO.CNV字典,共85个条目) | 所有时代 |
| `QT_DOSE` | 疫苗接种剂次数 | 所有时代 |
| `DOSE` | 剂次类型(第1剂、第2剂、加强针等) | 时代2–3 |
| `FX_ETARIA` | 年龄组 | 时代2–3 |
| `POP` | 目标人口数 | 时代2–3 |
> 如需完整的疫苗代码字典(26年间共65个唯一代码),请参阅DATASUS FTP服务器`/PNI/AUXILIARES/`目录下的`IMUNO.CNV`文件。
## 源数据与处理流程
**原始数据源**:来自DATASUS FTP服务器(`ftp://ftp.datasus.gov.br/dissemin/publicos/PNI/DADOS/`)的702个.dbf(dBase III格式)文件。其中674个文件成功处理,12个文件在服务器上不可用,另有16个为空文件。
**处理流程**:.dbf → R语言(`foreign::read.dbf`)→ Parquet格式(`arrow::write_dataset`)→ 上传至R2存储(`rclone`工具)。未进行任何数据转换。已排除合并后的文件(UF、BR、IG前缀),仅保留包含市级粒度的州级文件。
**数据验证**:所有27个州级文件的总和与全国合并文件(DPNIBR)完全一致,无任何差异。
## 已知局限性
1. **政府原始数据,非本项目生成**:数据值完全保留自原始.dbf文件,未做任何修改。
2. **三个数据结构时代**:不同时间段的字段可用性和市级代码格式存在差异,使用者需在分析中自行处理。
3. **所有字段均为字符串类型**:保留原始格式,包括市级代码的前导零。
4. **无微观个体数据**:本数据集为聚合计数数据,而非个体记录。如需2020年及以后的个体级数据,请参阅`sipni-microdados`数据集。
5. **静态数据集**:巴西卫生部自2019年起停止发布聚合式.dbf文件。2020年及以后的新版SI-PNI系统仅生成个体记录数据。
## 引用方式
bibtex
@misc{healthbrdata,
author = {Sidney da Silva Bissoli},
title = {healthbr-data: 巴西公共卫生数据再分发项目},
year = {2026},
url = {https://huggingface.co/datasets/SidneyBissoli/sipni-agregados-doses},
note = {原始数据源:巴西卫生部 / DATASUS}
}
## 联系方式
- **GitHub**:[https://github.com/SidneyBissoli](https://github.com/SidneyBissoli)
- **Hugging Face**:[https://huggingface.co/SidneyBissoli](https://huggingface.co/SidneyBissoli)
- **电子邮箱**:sbissoli76@gmail.com
---
*最后更新时间:2026年2月28日*
提供机构:
SidneyBissoli



