Dataset for paper Spotting the Hook: Leveraging Domain Data for Advanced Phishing Detection
收藏Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/12518090
下载链接
链接失效反馈官方服务:
资源简介:
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 benign domains from Cisco Umbrella and 68,353 phishing domains from PhishTank and OpenPhish services. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered as phishing by VT have been removed. The data was collected between March and November 2023.The final assessment of the data was conducted in December 2023. The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection. Data Files The data is located in two individual files: benign.json - data for 432,572 benign domains, and phishing.json - data for 68,353 phishing domains. Data Structure Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that: some fields may be missing (they should be interpreted as nulls), extra fields may be present (they should be ignored), due to a processing error, the common_name field of the certificate objects always contains trailing symbols: ‘> . Field name Field type Nullable Description domain_name String No The evaluated domain name url String No The source URL for the domain name evaluated_on Date No Date of last collection attempt source String No An identifier of the source sourced_on Date No Date of ingestion of the domain name dns Object Yes Data from DNS scan rdap Object Yes Data from RDAP or WHOIS tls Object Yes Data from TLS handshake ip_data Array of Objects Yes Array of data objects capturing the IP addresses related to the domain name DNS data (dns field) A Array of Strings No Array of IPv4 addresses AAAA Array of Strings No Array of IPv6 addresses TXT Array of Strings No Array of raw TXT values CNAME Object No The CNAME target and related IPs MX Array of Objects No Array of objects with the MX target hostname, priority and related IPs NS Array of Objects No Array of objects with the NS target hostname and related IPs SOA Object No All the SOA fields, present if found at the target domain name zone_SOA Object No The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly dnssec Object No Flags describing the DNSSEC validation result for each record type ttls Object No The TTL values for each record type remarks Object No The zone domain name and DNSSEC flags RDAP data (rdap field) copyright_notice String No RDAP/WHOIS data usage copyright notice dnssec Bool No DNSSEC presence flag entitites Object No An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. expiration_date Date Yes The current date of expiration handle String No RDAP handle last_changed_date Date Yes The date when the domain was last changed name String No The target domain name for which the data in this object are stored nameservers Array of Strings No Nameserver hostnames provided by RDAP or WHOIS registration_date Date Yes First registration date status Array of Strings No The state of the registered object [TODO] terms_of_service_url String No URL of the RDAP usage ToS url String No URL of the RDAP entity whois_server String No WHOIS server address TLS data (tls field) cipher String No TLS cipher suite description according to [TODO] protocol String No One of “TLS”, ”TLSv1.2”, ”TLSv1.3” certificates Array of Objects No Array of objects representing the certificate chain, the first element is the root certificate IP data (elements in the ip_data array) ip String No The IP address from_record String No The type of the DNS record the address was captured from remarks Object No Ping round-trip time, “is alive” flag and rdap/geo/asn evaluation dates rdap Object Yes RDAP data, similar to DNS RDAP, see the JSON Schema for details geo Object Yes Geolocation data from the GeoLite2 City database (e.g. latitude, longitude, city, country, etc.) asn Object Yes Autonomous system data from the GeoLite2 ASN database (ASN, organization, network) Acknowledgements We would like to thank the OpenPhish Team for grating permission to use and publish their dataset. We also thank VirusTotal for providing us access to the API for research purposes. The research has been supported by the Flow-based Encrypted Traffic Analysis project, no. VJ02010024, granted by the Ministry of the Interior of the Czech Republic and the Smart Information Technology for a Resilient Society project, no. FIT-S-23-8209, granted by Brno University of Technology.
创建时间:
2024-06-27



