Dataset for paper Spotting the Hook: Leveraging Domain Data for Advanced Phishing Detection
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12518089
下载链接
链接失效反馈官方服务:
资源简介:
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 benign domains from Cisco Umbrella and 68,353 phishing domains from PhishTank and OpenPhish services. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered as phishing by VT have been removed. The data was collected between March and November 2023.The final assessment of the data was conducted in December 2023.
The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection.
Data Files
The data is located in two individual files:
benign.json - data for 432,572 benign domains, and
phishing.json - data for 68,353 phishing domains.
Data Structure
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
some fields may be missing (they should be interpreted as nulls),
extra fields may be present (they should be ignored),
due to a processing error, the common_name field of the certificate objects always contains trailing symbols: ‘> .
Field name
Field type
Nullable
Description
domain_name
String
No
The evaluated domain name
url
String
No
The source URL for the domain name
evaluated_on
Date
No
Date of last collection attempt
source
String
No
An identifier of the source
sourced_on
Date
No
Date of ingestion of the domain name
dns
Object
Yes
Data from DNS scan
rdap
Object
Yes
Data from RDAP or WHOIS
tls
Object
Yes
Data from TLS handshake
ip_data
Array of Objects
Yes
Array of data objects capturing the IP addresses related to the domain name
DNS data (dns field)
A
Array of Strings
No
Array of IPv4 addresses
AAAA
Array of Strings
No
Array of IPv6 addresses
TXT
Array of Strings
No
Array of raw TXT values
CNAME
Object
No
The CNAME target and related IPs
MX
Array of Objects
No
Array of objects with the MX target hostname, priority and related IPs
NS
Array of Objects
No
Array of objects with the NS target hostname and related IPs
SOA
Object
No
All the SOA fields, present if found at the target domain name
zone_SOA
Object
No
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly
dnssec
Object
No
Flags describing the DNSSEC validation result for each record type
ttls
Object
No
The TTL values for each record type
remarks
Object
No
The zone domain name and DNSSEC flags
RDAP data (rdap field)
copyright_notice
String
No
RDAP/WHOIS data usage copyright notice
dnssec
Bool
No
DNSSEC presence flag
entitites
Object
No
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.
expiration_date
Date
Yes
The current date of expiration
handle
String
No
RDAP handle
last_changed_date
Date
Yes
The date when the domain was last changed
name
String
No
The target domain name for which the data in this object are stored
nameservers
Array of Strings
No
Nameserver hostnames provided by RDAP or WHOIS
registration_date
Date
Yes
First registration date
status
Array of Strings
No
The state of the registered object [TODO]
terms_of_service_url
String
No
URL of the RDAP usage ToS
url
String
No
URL of the RDAP entity
whois_server
String
No
WHOIS server address
TLS data (tls field)
cipher
String
No
TLS cipher suite description according to [TODO]
protocol
String
No
One of “TLS”, ”TLSv1.2”, ”TLSv1.3”
certificates
Array of Objects
No
Array of objects representing the certificate chain, the first element is the root certificate
IP data (elements in the ip_data array)
ip
String
No
The IP address
from_record
String
No
The type of the DNS record the address was captured from
remarks
Object
No
Ping round-trip time, “is alive” flag and rdap/geo/asn evaluation dates
rdap
Object
Yes
RDAP data, similar to DNS RDAP, see the JSON Schema for details
geo
Object
Yes
Geolocation data from the GeoLite2 City database (e.g. latitude, longitude, city, country, etc.)
asn
Object
Yes
Autonomous system data from the GeoLite2 ASN database (ASN, organization, network)
Acknowledgements
We would like to thank the OpenPhish Team for grating permission to use and publish their dataset. We also thank VirusTotal for providing us access to the API for research purposes. The research has been supported by the Flow-based Encrypted Traffic Analysis project, no. VJ02010024, granted by the Ministry of the Interior of the Czech Republic and the Smart Information Technology for a Resilient Society project, no. FIT-S-23-8209, granted by Brno University of Technology.
创建时间:
2024-06-25



