five

Custom Dataset: Web / OSINT Data

收藏
Databricks2025-06-11 收录
下载链接:
https://marketplace.databricks.com/details/572ee525-6212-4344-bccc-6552576e6a73/Hometree-Data-Inc-_Custom-Dataset:-Web-/-OSINT-Data
下载链接
链接失效反馈
官方服务:
资源简介:
**Overview** Hometree leverages its proprietary global internet multimodal multilingual context technology as the exclusive data source for all its datasets. Our internet mesh technology continuously streams and processes real-time global internet content, capturing critical metadata listed in this catalog, including sources, timestamps, geoip, and comprehensive data provenance. This robust lineage tracking and methodological transparency ensures exceptional source quality and generates actionable metrics for precise decision-making. **Use cases** **What clean data do you need?** Basic & Premium Data Fields **How do you want to explore the data?** Delivery Format & Frequency Solutions **What AI enrichment processing support do you require?** Data Overlays **What AI modeling support do you require?** Custom Services **Product details** Fields typically include in Hometree's datasets: authors - Author or authors of the text, as identified by the site. If multiple authors, names are listed in a single string with a space and a comma separating each name. Authorship is often omitted by sites, so this field is often absent. date_added - Date when the text was added to the database. Uses combined date-time format. domain - Amplifying domains. In cases without amplification, the contents of this field are the same as the contents of "url_domain". story language - Two-letter ISO 639-1 code indicating the document's original language. text - Complete scraped text from the website. title - Original text found by crawling the story URL. title_translated - English language translation of the document's title field. Time Stamp - ts_date - Alternative date string format ("YYYYMMDD") for the date_added field. Time Stamp - ts_hour - Alternative format indicating the hour of the day (0-23) when the document was added. ip - IP address derived from the domain name. url - Full-length URL of the story. url_domain - Domain culled from the URL, representing the "source domain" of the document. entities - Container field for subfields related to organizations, people, and places identified by Named Entity Recognition (NER). entities.orgs- List of strings, each representing the name of one identified organization. entities.people - List of strings, each representing the name of one identified person. entities.places - List of strings, each representing the name of one identified place. geoip - Container field for geographic information derived from the IP geoip.city_name - Full-length city name derived from the IP address. geoip.continent_name - Full-length continent name derived from the IP address. geoip.country_iso_code- Two-letter ISO country identification code derived from the IP address. geoip.country_name- Full-length country name corresponding to the country ISO code derived from the IP address. geoip.location- Container field for latitude and longitude pairs derived from the IP address. geoip.location.lat - Latitude derived from the IP address. geoip.location.lon - Longitude derived from the IP address. geoip.region_iso_code - Two-letter ISO country code followed by a variable-length alphanumeric ISO region code derived from the IP address. geoip.region_name - Full-length region name corresponding to the regional ISO code derived from the IP address. **Additional Insights** This dataset contains fields that support NLP, network/graph analysis, and AI/ML workflows.
提供机构:
Hometree Data, Inc.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作