Extracted Data for: Environmental Protection Agency (EPA) Science Inventory, Reports
收藏DataONE2026-01-27 更新2026-02-07 收录
下载链接:
https://search.dataone.org/view/sha256:69d04c28df3275468cce9b4502b944eeffbd593692dcfd56a133afb4b7aef5a9
下载链接
链接失效反馈官方服务:
资源简介:
This archive is a subset of the full Science Inventory archive; this archive includes all of the files marked as \"Reports\". From the Original URL website (https://cfpub.epa.gov/si/index.cfm): “The Science Inventory is a searchable database of research products primarily from EPA's Office of Research and Development. Science Inventory records provide descriptions of the product, contact information, and links to available printed material or websites.” Also documented in the epaSI_readme.docx in this archive, the Science Inventory consists of 94,428 records. This repository captured PUBLISHED REPORTs as well as PRESENTATIONs, although in the end the files were not clearly categorized in any one way. The records were uncategorized in the above way upon download. Instead, the records are organized by index (Title and Entry ID number, plus page number of entry). The downloads from the Science Inventory (SI) occurred using a set of scripts that scraped the website, then downloaded, and then catalogued. The website is separated into three main categories: Reports Presentations All others All of the types were downloaded together. The contents of each category are also presented in duplicate using their category, for findability. However, the exception to this are the NEPIS files (National Service Center for Environmental Publications, also known as NSCEP). These files have a special download mechanism, where downloads are restricted by the hour and frequency of download request. Thus, it is not clear that all NEPIS files are captured. The files are presented in archives by category (report, presentation, NEPIS), and then by page count, and then by all together. For files with duplicate filenames, the title of the item was used (from the scraped records), with “filler” words stripped out. Scripts: epa_SI_uncat_scraper1.py (Oct 29, 16KB) Purpose: Scrapes EPA Science Inventory for all document types EXCEPT journals, creates a CSV index Scope: Configurable page ranges (default 1-10) Features: Extracts metadata and download URLs from multiple document types (reports, books, data/software, etc.) Main Features: Multiple URL Columns: Auto-detects columns starting with 'url' (or you can specify them manually). Handles up to 31 URL columns. Intelligent Filename Logic: Single download per row: Uses native filename Multiple downloads per row: Adds title prefix (first 4 non-filler words) Duplicate filenames across records: Adds letter suffix (a, b, c, etc.) Filler Word Filtering: Removes common filler words (to, the, of, and, at, in, for, a, an) when creating title prefixes. Two-Pass Processing: Pass 1: Collects all download information Pass 2: Downloads files and applies appropriate naming Key Changes from Original: download() now returns (success, native_filename) tuple New extract_title_prefix() function for creating prefixes Tracks filename usage with filename_counter and filename_usage dictionaries Automatically renames files after download based on conflict detection Provides a conflict report at the end showing which filenames were duplicated Usage: # python process_csv( csv_file=\"your_file.csv\", url_columns=None, # Auto-detect or specify ['url', 'url1', 'url2', ...] title_column='title', # Column with record titles output_dir='downloads', max_rows=None # For testing, set to a number ) epa_multi_url_and_nepis_download3.py (Nov 5, 28KB) Purpose: Purpose: Downloads files from CSV created by scrapers + handles NEPIS URLs. Enhanced version of the downloader with better NEPIS handling Key feature: Has get_nepis_download_url() function that parses NEPIS popup pages to find actual PDF download links NEPIS handling: Constructs popup URL (Display=p%7Cf), parses HTML for PDF link, handles JavaScript patterns using BeautifulSoup/requests Key additions: Handles multiple URL formats per record; Better filename sanitization; Title-based prefixes for files; Duplicate tracking; Statistics by domain epa_nepis_parse_manual1.py (Nov 6, 17KB) Purpose: Separates NEPIS URLs from regular downloads for manual processing Key feature: Creates separate CSV files for NEPIS vs non-NEPIS URLs Use case: When automated NEPIS parsing fails, this prepares files for manual download The workflow was: epa_SI_uncat_scraper1.py - Generate index of all documents epa_multi_url_and_nepis_download3.py - Download most files (with best NEPIS handling) epa_nepis_parse_manual1.py - For stubborn NEPIS files that fail automation, separate them for manual download Scraper scripts are available in the archive, and in GitHub. As of 11/6/2025 Scraped 94,031 entries Of those, 1,718 are have NEPIS links associated with them Of the non-NEPIS entries (92,313), there are 12,002 files from entries. Of the NEPIS associated entries (that is, NEPIS links in any download column), 225 were download urls in columns that were not NEPIS, and of those, only 70 were actual downloads. The remainder of the 225 did not contain valuable content (that is,...
创建时间:
2026-01-30



