Extracted Data for: Environmental Protection Agency (EPA) Science Inventory
收藏DataONE2026-01-27 更新2026-02-07 收录
下载链接:
https://search.dataone.org/view/sha256:2a9749dd69911e99f3413a17ba18063a2dcc748beaf9fd76b103d48643c80906
下载链接
链接失效反馈官方服务:
资源简介:
This archive has a focus on Reports and Presentations. There are four other archives (for a total of 5) containing the same information, packaged differently. This archive = contains all downloaded files, packaged by download order, with indexes in chunks of 1,000 or fewer. Reports archive = Reports only Presentations archive = Presentations only NEPIS archive = NEPIS files only Raw archive = all raw data files with one master index Scripts are included with each archive. From the Original URL website (https://cfpub.epa.gov/si/index.cfm): “The Science Inventory is a searchable database of research products primarily from EPA's Office of Research and Development. Science Inventory records provide descriptions of the product, contact information, and links to available printed material or websites.” Also documented in the epaSI_readme.docx in this archive, the Science Inventory consists of 94,428 records. This repository captured PUBLISHED REPORTs as well as PRESENTATIONs, although in the end the files were not clearly categorized in any one way. The records were uncategorized in the above way upon download. Instead, the records are organized by index (Title and Entry ID number, plus page number of entry). The downloads from the Science Inventory (SI) occurred using a set of scripts that scraped the website, then downloaded, and then catalogued. The website is separated into three main categories: Reports Presentations All others All of the types were downloaded together. The contents of each category are also presented in duplicate using their category, for findability. However, the exception to this are the NEPIS files (National Service Center for Environmental Publications, also known as NSCEP). These files have a special download mechanism, where downloads are restricted by the hour and frequency of download request. Thus, it is not clear that all NEPIS files are captured. The files are presented in archives by category (report, presentation, NEPIS), and then by page count, and then by all together. For the latter, the indexes are the combined 7 indexes as listed below. For files with duplicate filenames, the title of the item was used (from the scraped records), with “filler” words stripped out. The downloads have been broken into chunks by “pages” of returns from the SI site. The scraper catalogued into pages of 25 records each, and thus the download process was handled by page count: 1 – 100 101 – 500 501 – 1000 1001 – 1250 1251 – 2000 2001 – 3000 3001 – 4000 Scripts: epa_SI_uncat_scraper1.py (Oct 29, 16KB) Purpose: Scrapes EPA Science Inventory for all document types EXCEPT journals, creates a CSV index Scope: Configurable page ranges (default 1-10) Features: Extracts metadata and download URLs from multiple document types (reports, books, data/software, etc.) Main Features: Multiple URL Columns: Auto-detects columns starting with 'url' (or you can specify them manually). Handles up to 31 URL columns. Intelligent Filename Logic: Single download per row: Uses native filename Multiple downloads per row: Adds title prefix (first 4 non-filler words) Duplicate filenames across records: Adds letter suffix (a, b, c, etc.) Filler Word Filtering: Removes common filler words (to, the, of, and, at, in, for, a, an) when creating title prefixes. Two-Pass Processing: Pass 1: Collects all download information Pass 2: Downloads files and applies appropriate naming Key Changes from Original: download() now returns (success, native_filename) tuple New extract_title_prefix() function for creating prefixes Tracks filename usage with filename_counter and filename_usage dictionaries Automatically renames files after download based on conflict detection Provides a conflict report at the end showing which filenames were duplicated Usage: # python process_csv( csv_file=\"your_file.csv\", url_columns=None, # Auto-detect or specify ['url', 'url1', 'url2', ...] title_column='title', # Column with record titles output_dir='downloads', max_rows=None # For testing, set to a number ) epa_multi_url_and_nepis_download3.py (Nov 5, 28KB) Purpose: Downloads files from CSV created by scrapers + handles NEPIS URLs. Enhanced version of the downloader with better NEPIS handling Key feature: Has get_nepis_download_url() function that parses NEPIS popup pages to find actual PDF download links NEPIS handling: Constructs popup URL (Display=p%7Cf), parses HTML for PDF link, handles JavaScript patterns using BeautifulSoup/requests Key additions: Handles multiple URL formats per record; Better filename sanitization; Title-based prefixes for files; Duplicate tracking; Statistics by domain epa_nepis_parse_manual1.py (Nov 6, 17KB) Purpose: Separates NEPIS URLs from regular downloads for manual processing Key feature: Creates separate CSV files for NEPIS vs non-NEPIS URLs Use case: When automated NEPIS parsing fails, this prepares files for manual download The workflow was: epa_SI_uncat_scraper1.py -...
创建时间:
2026-01-30



