five

revanth7667/usa_opioid_overdose

收藏
Hugging Face2024-03-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/revanth7667/usa_opioid_overdose
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en pretty_name: USA Opioid Overdose size_categories: - 10K<n<100K dataset_info: features: - name: State dtype: string - name: State_Code dtype: string - name: County dtype: string - name: County_Code dtype: string - name: Year dtype: int64 - name: Population dtype: int64 - name: Deaths dtype: int64 - name: Original dtype: bool - name: State_Mortality_Rate dtype: float - name: County_Mortality_Rate dtype: float --- ## Overview This dataset contains the number of yearly deaths due to **Unintentional** Drug Overdoses in the United States at a County Level between 2003-2015. To overcome the limitation of the original dataset, it is merged with population dataset to identify missing combinations and imputation is performed on the dataset taking into account the logical rules of the source dataset. Users can decide the proportion of the imputed values in the dataset by using the provided population and flag columns. Additional fields like state codes, FIPS codes are provided for the convenience of the user so that the dataset can be merged easily with other datasets. ## Data Structure The dataset contains the following fields: 1. State (string): Name of the State 2. State_Code (string): 2 Character abbreviation of the state 3. County (string): Name of the County 4. County_Code (string): 5 Charter representation of the County’s FIPS code 5. Year (integer): Year 6. Population (integer): Population of the County for the given year 6. Deaths (integer): number of Drug overdose deaths in the county for the given year 7. Original (Boolean): To indicate if the Deaths are from original dataset or imputed 8. State_Mortality_Rate (float): Mortality rate of the state for the given year 9. County_Mortality_Rate (float): Mortality rate of the county for the given year Notes: 1. County FIPS has been formatted as a string so that leading zeros are not lost and it is easier for the user to read it 2. The County_Mortality_Rate which is provided for convenience is calculated after the imputation of the missing values, hence it might not be accurate for all the combinations, refer the "Original" column to identify the imputed values. ## Data Source 1. Deaths Data: The original source of the data is the US Vital Statistics Agency [Link](https://www.cdc.gov/nchs/products/vsus.htm), however, for this project, it has been downloaded from a different [source](https://www.dropbox.com/s/kad4dwebr88l3ud/US_VitalStatistics.zip?dl=0) for convenience. 2. Population Data: To have consistency with the Mortality Data, the population Data has been downloaded from the [CDC Wonder](https://wonder.cdc.gov/bridged-race-v2020.html) portal. Population data is used for 2 purposes: to calculate the mortality rate and as a master list of Counties to perform the Imputation 3. Other Data: To provide convenience to the users of the Dataset, additional fields such as County Fips, State Codes etc. have been added so that users can easily combine it with other datasets if required. This mapping is a standard mapping which can be found on the internet. The raw data files are present in the ``.01_Data/01_Raw`` folder for reference. ## Methodology To study the impact of drug related deaths, one of the primary sources is the US Vital Statistics Agency. There is a limitation in the report since US Vital Statistics does not report the deaths in a county if the number of deaths in that county is less than 10 for privacy reasons. This means that the deaths available in the report are not fully representative of the deaths and hence any analysis performed on it may not be fully accurate. To overcome this, in this dataset, values are imputed for the missing counties based on State level mortality rates and population limiting factors. While this may not be 100% representative, it gives a slightly different and better approach to perform analysis on the drug related deaths. Post the basic data cleaning and merging, the imputation is performed in the following steps: 1. Mortality Rate is calculated at the State-Year level using the available data 2. Master combination of State-County is obtained from Population file 3. For the missing counties a reverse calculation is performed using the state level mortality rate and the population of the county. A maximum calculated limit of 9 is imposed to preserve the conditions of the original data set. 4. Flag column is added to indicate if the values seen are original values or imputed ones Since the original trend of the dataset may distort due to the imputations, the population data is left in the dataset and an additional column is added to the dataset to indicate if the values seen are from the original dataset or if they were imputed. Using the population and the flag column, the users of the dataset can decide the proportion of the imputed data in the analysis (This is the population limit factor). The below graph shows the relation between the population limit factor and the % of imputed values in the dataset: ![Plot](.01_Data/Missing_vs_Population.png) ## Files and Folder Structure 1. Data Files: The raw data files are present in the [.01_Data/01_Raw](./.01_Data/01_Raw) folder for reference. The intermediate Population and Mortality files are present in the [.01_Data/02_Processed](./.01_Data/02_Processed) folder. The final dataset is present in the root folder. The Data folder is hidden so that the raw and itermediate files are not loaded by the library. 2. Code Files: The code files are present in the [02_Code](./02_Code) folder. - The "*_eda.ipynb" files are the exploratory files which the user can refer to understand the processing of the data in a step by step manner. - The "*_script.py" files are the optimized scripts which contain only the required steps from the eda files to process the data. provided the raw data files are present in the ``.01_Data/01_Raw`` folder, all the other intermediate and final data files can be generated using the script files provided in the ``02_Code`` folder. ## Disclaimers 1. This dataset has been created purely for educational purposes. The imputations performed is one of the many ways to handle the missing data, please consider the % of imputed data in the dataset before performing any analysis. 2. The Dataset does NOT contain data for Alaska since the original data for it is messsy, users can however make use of the raw files and modify the scripts if required to include Alaska 3. Only 1 type of drug related deaths is present in the dataset, refer to the master_eda file for details 4. Please refer to the original source of the data (links provided in the data source section) for any legal or privacy concerns.
提供机构:
revanth7667
原始信息汇总

数据集概述

该数据集包含2003年至2015年间美国各县因非故意药物过量导致的年度死亡人数。为了克服原始数据集的限制,该数据集与人口数据集合并,以识别缺失的组合并进行插补。用户可以通过使用提供的“人口”和“原始”列来决定数据集中插补值的比例。此外,还提供了州代码、FIPS代码等字段,以便用户可以轻松地与其他数据集合并。

数据结构

数据集包含以下字段:

  1. State (字符串): 州名称
  2. State_Code (字符串): 州的2字符缩写
  3. County (字符串): 县名称
  4. County_Code (字符串): 县的5字符FIPS代码表示
  5. Year (整数): 年份
  6. Population (整数): 给定年份县的 population
  7. Deaths (整数): 给定年份县的药物过量死亡人数
  8. Original (布尔值): 指示死亡人数是来自原始数据集还是插补的
  9. State_Mortality_Rate (浮点数): 给定年份州的死亡率
  10. County_Mortality_Rate (浮点数): 给定年份县的死亡率

注意:

  1. 县的FIPS代码已格式化为字符串,以避免丢失前导零,并方便用户阅读。
  2. 提供的County_Mortality_Rate是在缺失值插补后计算的,因此对于所有组合可能不准确,请参考“Original”列以识别插补值。

数据来源

  1. 死亡数据: 原始数据来源是美国生命统计机构,但为了方便,该项目从其他来源下载。
  2. 人口数据: 为了与死亡数据保持一致,人口数据从CDC Wonder门户下载。人口数据用于计算死亡率和作为县的主列表进行插补。
  3. 其他数据: 为了方便用户,添加了如县FIPS、州代码等字段,以便用户可以轻松地与其他数据集合并。

方法论

为了研究药物相关死亡的影响,主要来源是美国生命统计机构。由于美国生命统计机构不会报告死亡人数少于10的县的死亡人数,因此报告中的死亡人数并不完全代表实际死亡人数。为了克服这一限制,在该数据集中,根据州级死亡率和人口限制因素对缺失的县进行插补。虽然这可能不完全代表实际情况,但它提供了一种不同的、更好的方法来分析药物相关死亡。

数据清洗和合并后,插补步骤如下:

  1. 使用可用数据计算州-年级别的死亡率
  2. 从人口文件中获取州-县的主组合
  3. 对于缺失的县,使用州级死亡率和县的人口进行反向计算,并施加最大计算限制9以保留原始数据集的条件
  4. 添加标志列以指示所见值是原始值还是插补值

由于插补可能会扭曲原始数据集的趋势,因此将人口数据保留在数据集中,并添加了一个额外的列以指示所见值是来自原始数据集还是插补的。用户可以使用人口和标志列来决定分析中插补数据的比例。

文件和文件夹结构

  1. 数据文件: 原始数据文件位于.01_Data/01_Raw文件夹中。中间的人口和死亡率文件位于.01_Data/02_Processed文件夹中。最终数据集位于根文件夹中。
  2. 代码文件: 代码文件位于02_Code文件夹中。*_eda.ipynb文件是探索性文件,用户可以参考以了解数据的逐步处理。*_script.py文件是优化脚本,仅包含处理数据所需的步骤。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作