five

Dataset for paper Spotting the Hook: Leveraging Domain Data for Advanced Phishing Detection

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12518089
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 benign domains from Cisco Umbrella and 68,353 phishing domains from PhishTank and OpenPhish services. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered as phishing by VT have been removed. The data was collected between March and November 2023.The final assessment of the data was conducted in December 2023. The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection.  Data Files The data is located in two individual files: benign.json - data for 432,572 benign domains, and phishing.json - data for 68,353 phishing domains. Data Structure Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that: some fields may be missing (they should be interpreted as nulls),  extra fields may be present (they should be ignored),  due to a processing error, the common_name field of the certificate objects always contains trailing symbols: ‘> . Field name  Field type  Nullable  Description  domain_name  String  No  The evaluated domain name  url  String  No  The source URL for the domain name  evaluated_on  Date  No  Date of last collection attempt  source  String  No  An identifier of the source  sourced_on  Date  No  Date of ingestion of the domain name  dns  Object  Yes  Data from DNS scan  rdap  Object  Yes  Data from RDAP or WHOIS  tls  Object  Yes  Data from TLS handshake  ip_data  Array of Objects  Yes  Array of data objects capturing the IP addresses related to the domain name  DNS data (dns field)  A  Array of Strings  No  Array of IPv4 addresses  AAAA  Array of Strings  No  Array of IPv6 addresses  TXT  Array of Strings  No  Array of raw TXT values  CNAME  Object  No  The CNAME target and related IPs  MX  Array of Objects  No  Array of objects with the MX target hostname, priority and related IPs  NS  Array of Objects  No  Array of objects with the NS target hostname and related IPs  SOA  Object  No  All the SOA fields, present if found at the target domain name  zone_SOA  Object  No  The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly  dnssec  Object  No  Flags describing the DNSSEC validation result for each record type  ttls  Object  No  The TTL values for each record type  remarks  Object  No  The zone domain name and DNSSEC flags  RDAP data (rdap field)  copyright_notice  String  No  RDAP/WHOIS data usage copyright notice  dnssec  Bool  No  DNSSEC presence flag  entitites  Object  No  An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.  expiration_date  Date  Yes  The current date of expiration  handle  String  No  RDAP handle  last_changed_date  Date  Yes  The date when the domain was last changed  name  String  No  The target domain name for which the data in this object are stored  nameservers  Array of Strings  No  Nameserver hostnames provided by RDAP or WHOIS  registration_date  Date  Yes  First registration date  status  Array of Strings  No  The state of the registered object [TODO]  terms_of_service_url  String  No  URL of the RDAP usage ToS  url  String  No  URL of the RDAP entity  whois_server  String  No  WHOIS server address  TLS data (tls field)  cipher  String  No  TLS cipher suite description according to [TODO]  protocol  String  No  One of “TLS”, ”TLSv1.2”, ”TLSv1.3”  certificates  Array of Objects  No  Array of objects representing the certificate chain, the first element is the root certificate  IP data (elements in the ip_data array)  ip   String  No  The IP address  from_record  String  No  The type of the DNS record the address was captured from  remarks  Object  No  Ping round-trip time, “is alive” flag and rdap/geo/asn evaluation dates  rdap  Object  Yes  RDAP data, similar to DNS RDAP, see the JSON Schema for details  geo  Object  Yes  Geolocation data from the GeoLite2 City database (e.g. latitude, longitude, city, country, etc.)  asn  Object  Yes  Autonomous system data from the GeoLite2 ASN database (ASN, organization, network)  Acknowledgements We would like to thank the OpenPhish Team for grating permission to use and publish their dataset. We also thank VirusTotal for providing us access to the API for research purposes. The research has been supported by the Flow-based Encrypted Traffic Analysis project, no. VJ02010024, granted by the Ministry of the Interior of the Czech Republic and the Smart Information Technology for a Resilient Society project, no. FIT-S-23-8209, granted by Brno University of Technology.
创建时间:
2024-06-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作