CDC Heart Disease 2020 Semantic Vector Database
收藏Figshare2026-01-24 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/CDC_Heart_Disease_2020_Semantic_Vector_Database/31144024
下载链接
链接失效反馈官方服务:
资源简介:
To ensure your dataset is professionally indexed and citable for a journal paper, your Figshare description should follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles.Here is a comprehensive template you can copy and adapt.DescriptionOverviewThis item contains the Semantic Vector Database derived from the CDC Indicators of Heart Disease (2022 UPDATE) dataset. This database was specifically engineered to support Retrieval-Augmented Generation (RAG) and Explainable AI (XAI) research in cardiovascular informatics. It transforms over 400,000 tabular clinical records into a high-dimensional semantic space for sub-millisecond similarity search and evidence-based clinical reasoning.Technical SpecificationsSource Dataset: CDC Behavioral Risk Factor Surveillance System (BRFSS) 2022.Total Records ($N$): 445,132 individual respondent profiles.Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (Transformer-based).Vector Dimensions: 384.Vector Library: FAISS (Facebook AI Similarity Search).Index Type: Flat L2 (Euclidean distance) for maximum precision.MethodologyThe vectorization process involved three primary stages:Data Preprocessing: Cleaning the 2022 CDC BRFSS raw data, focusing on 40 key health variables (e.g., HadHeartAttack, SmokerStatus, BMI, CovidPos).Semantic Profiling: Converting categorical tabular data into structured natural language "Clinical Profiles" to preserve the context of comorbidities.Indexing: Generating dense vector embeddings and persisting them into a FAISS index to enable semantic retrieval by an LLM (e.g., Llama-3.1).Research UtilityThis database is intended for researchers working on:Clinical Decision Support: Identifying high-risk patient cohorts via semantic similarity.Public Health Trends: Analyzing correlations between COVID-19 history and heart disease within the 2022 population.Agentic AI: Providing a "Ground Truth" retrieval layer for AI agents to reduce hallucinations in medical queries.Included Filesindex.faiss: The serialized vector index containing the embeddings.index.pkl: The metadata mapping for the index
创建时间:
2026-01-24



