Replication Data for: Environmental Protection Agency (EPA) Science Inventory (Raw Data)
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://doi.org/10.7910/DVN/UAABZK
下载链接
链接失效反馈官方服务:
资源简介:
The downloads for this archive has a focus on Reports and Presentations. There are four other archives (for a total of 5) containing the same information, packaged differently. This archive = contains all downloaded files, uncategorized and unpackaged. There is a master index to help sort through the information - epa_SI_all_index.xlsx. Note that the index has two tabs - one for all of the records in the SI, and one for only downloads. The other four archives are here: All data, chunked with indexes - https://doi.org/10.7910/DVN/SNYNKQ SI Reports - https://doi.org/10.7910/DVN/JB58ON SI Presentations - https://doi.org/10.7910/DVN/JCWZMA SI NEPIS files - https://doi.org/10.7910/DVN/RRPGWI From the Original URL website (https://cfpub.epa.gov/si/index.cfm): “The Science Inventory is a searchable database of research products primarily from EPA's Office of Research and Development. Science Inventory records provide descriptions of the product, contact information, and links to available printed material or websites.” Also documented in the epaSI_readme.docx in this archive, the overall Science Inventory consists of 94,428 records. However, the majority of those records are a web-only entry; that is, they do not have a downloadable file associated with them. The downloads from the Science Inventory (SI) occurred using a set of scripts that scraped the website, then downloaded, and then catalogued. The SI website is separated into three main categories: - Reports - Presentations - All others For files with duplicate filenames, the title of the item was used (from the scraped records), with “filler” words stripped out. Scripts: epa_SI_uncat_scraper1.py (Oct 29, 16KB) Purpose: Scrapes EPA Science Inventory for all document types EXCEPT journals, creates a CSV index Scope: Configurable page ranges (default 1-10) Features: Extracts metadata and download URLs from multiple document types (reports, books, data/software, etc.) Main Features: 1. **Multiple URL Columns**: Auto-detects columns starting with 'url' (or you can specify them manually). Handles up to 31 URL columns. 2. **Intelligent Filename Logic**: - **Single download per row**: Uses native filename - **Multiple downloads per row**: Adds title prefix (first 4 non-filler words) - **Duplicate filenames across records**: Adds letter suffix (a, b, c, etc.) 3. **Filler Word Filtering**: Removes common filler words (to, the, of, and, at, in, for, a, an) when creating title prefixes. 4. **Two-Pass Processing**: - Pass 1: Collects all download information - Pass 2: Downloads files and applies appropriate naming Automatically renames files after download based on conflict detection - Provides a conflict report at the end showing which filenames were duplicated ## Usage: ```python process_csv( csv_file="your_file.csv", url_columns=None, # Auto-detect or specify ['url', 'url1', 'url2', ...] title_column='title', # Column with record titles output_dir='downloads', max_rows=None # For testing, set to a number ) epa_multi_url_and_nepis_download3.py (Nov 5, 28KB) Purpose: Downloads files from CSV created by scrapers + handles NEPIS URLs. Enhanced version of the downloader with better NEPIS handling Key feature: Has get_nepis_download_url() function that parses NEPIS popup pages to find actual PDF download links NEPIS handling: Constructs popup URL (Display=p%7Cf), parses HTML for PDF link, handles JavaScript patterns using BeautifulSoup/requests Key additions: Handles multiple URL formats per record; Better filename sanitization; Title-based prefixes for files; Duplicate tracking; Statistics by domain epa_nepis_parse_manual1.py (Nov 6, 17KB) Purpose: Separates NEPIS URLs from regular downloads for manual processing Key feature: Creates separate CSV files for NEPIS vs non-NEPIS URLs Use case: When automated NEPIS parsing fails, this prepares files for manual download The workflow was: epa_SI_uncat_scraper1.py - Generate index of all documents epa_multi_url_and_nepis_download3.py - Download most files (with best NEPIS handling) epa_nepis_parse_manual1.py - For stubborn NEPIS files that fail automation, separate them for manual download Scraper scripts are available in the archive, and in GitHub. As of 11/6/2025 - Scraped 94,031 entries - Of those, 1,718 are have NEPIS links associated with them - Of the non-NEPIS entries (92,313), there are 12,002 files from entries. - Of the NEPIS associated entries (that is, NEPIS links in any download column), 225 were download urls in columns that were not NEPIS, and of those, only 70 were actual downloads. The remainder of the 225 did not contain valuable content (that is, they either have “NTIS contact” listed, or a “follow the URL” sentence, or instruction to “contact the Program Officer” as the only page in a pdf. Originally, an advanced search revealed that the records belonged to the following categories (number of records, category): 232 ASSESSMENT DOCUMENTs 1380 BOOKs 1340 BOOK CHAPTERs 276 COMMUNICATION PRODUCTs 25 CRITERIA...
创建时间:
2026-01-20



