five

Extracted Data from: Environmental Protection Agency (EPA) Science Inventory, NEPIS

收藏
DataONE2026-03-02 更新2026-04-04 收录
下载链接:
https://search.dataone.org/view/sha256:d3cf15c038a781b94f44ce4e0cffa3cddfdd49fd27c1601846be1074cd0f6a1c
下载链接
链接失效反馈
官方服务:
资源简介:
This archive is a subset of the full archive at (https://doi.org/10.7910/DVN/SNYNKQ) and has a focus on National Environmental Publications Internet Site (NEPIS) Reports. From the Original URL website (https://cfpub.epa.gov/si/index.cfm): “The Science Inventory is a searchable database of research products primarily from EPA's Office of Research and Development. Science Inventory records provide descriptions of the product, contact information, and links to available printed material or websites.” Also documented in the epaSI_readme.docx in this archive, the Science Inventory consists of 94,428 records. This repository captured PUBLISHED REPORTs as well as PRESENTATIONs, although in the end the files were not clearly categorized in any one way. The records were uncategorized in the above way upon download. Instead, the records are organized by index (Title and Entry ID number, plus page number of entry). The downloads from the Science Inventory (SI) occurred using a set of scripts that scraped the website, then downloaded, and then catalogued. The website is separated into three main categories: - Reports - Presentations - All others All of the types were downloaded together. The contents of each category are also presented in duplicate using their category, for findability. However, the exception to this are the NEPIS files (National Service Center for Environmental Publications, also known as NSCEP). These files have a special download mechanism, where downloads are restricted by the hour and frequency of download request. Thus, it is not clear that all NEPIS files are captured. The files are presented in archives by category (report, presentation, NEPIS), and then by page count, and then by all together. For the latter, the indexes are the combined 7 indexes as listed below. For files with duplicate filenames, the title of the item was used (from the scraped records), with “filler” words stripped out. Scripts: epa_SI_uncat_scraper1.py (Oct 29, 16KB) Purpose: Scrapes EPA Science Inventory for all document types EXCEPT journals, creates a CSV index Scope: Configurable page ranges (default 1-10) Features: Extracts metadata and download URLs from multiple document types (reports, books, data/software, etc.) Main Features: 1. **Multiple URL Columns**: Auto-detects columns starting with 'url' (or you can specify them manually). Handles up to 31 URL columns. 2. **Intelligent Filename Logic**: - **Single download per row**: Uses native filename - **Multiple downloads per row**: Adds title prefix (first 4 non-filler words) - **Duplicate filenames across records**: Adds letter suffix (a, b, c, etc.) 3. **Filler Word Filtering**: Removes common filler words (to, the, of, and, at, in, for, a, an) when creating title prefixes. 4. **Two-Pass Processing**: - Pass 1: Collects all download information - Pass 2: Downloads files and applies appropriate naming ## Key Changes from Original: - `download()` now returns `(success, native_filename)` tuple - New `extract_title_prefix()` function for creating prefixes - Tracks filename usage with `filename_counter` and `filename_usage` dictionaries - Automatically renames files after download based on conflict detection - Provides a conflict report at the end showing which filenames were duplicated ## Usage: ```python process_csv( csv_file=\"your_file.csv\", url_columns=None, # Auto-detect or specify ['url', 'url1', 'url2', ...] title_column='title', # Column with record titles output_dir='downloads', max_rows=None # For testing, set to a number ) epa_multi_url_and_nepis_download3.py (Nov 5, 28KB) Purpose: Purpose: Downloads files from CSV created by scrapers + handles NEPIS URLs. Enhanced version of the downloader with better NEPIS handling Key feature: Has get_nepis_download_url() function that parses NEPIS popup pages to find actual PDF download links NEPIS handling: Constructs popup URL (Display=p%7Cf), parses HTML for PDF link, handles JavaScript patterns using BeautifulSoup/requests Key additions: Handles multiple URL formats per record; Better filename sanitization; Title-based prefixes for files; Duplicate tracking; Statistics by domain epa_nepis_parse_manual1.py (Nov 6, 17KB) Purpose: Separates NEPIS URLs from regular downloads for manual processing Key feature: Creates separate CSV files for NEPIS vs non-NEPIS URLs Use case: When automated NEPIS parsing fails, this prepares files for manual download The workflow was: epa_SI_uncat_scraper1.py - Generate index of all documents epa_multi_url_and_nepis_download3.py - Download most files (with best NEPIS handling) epa_nepis_parse_manual1.py - For stubborn NEPIS files that fail automation, separate them for manual download Scraper scripts are available in the archive, and in GitHub. As of 11/6/2025 - Scraped 94,031 entries - Of those, 1,718 are have NEPIS links associated with them - Of the non-NEPIS entries (92,313), there are 12,002 files from entries. - Of the NEPIS associated entries (that...
创建时间:
2026-03-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作