AI / LLM Training Data
收藏Databricks2025-06-26 收录
下载链接:
https://marketplace.databricks.com/details/2c3e6d51-abd6-4ad0-88cb-21ce430fa7c0/Hometree-Data-Inc-_AI-/-LLM-Training-Data
下载链接
链接失效反馈官方服务:
资源简介:
**Overview**
As large language models (LLMs) become foundational to enterprise AI strategies, the quality of their training data is under increasing scrutiny. While vast quantities of web data are readily available, much of it is noisy, redundant, or unreliable which poses risks to both model performance and output integrity. High-signal, web data from HTD offers a solution by delivering diverse, up-to-date, and context-rich content that can dramatically improve the quality and relevance of LLM training.
**Use cases**
We collect high-signal data from across the open web, including: news sources, public forums, government reports, and industry blogs, and then filter and enrich to ensure accuracy, credibility, and contextual clarity. Unlike raw web crawls, HTD’s curated data is structured and de-duplicated, with metadata and geospatial context that provide valuable signals for LLMs to better understand nuance, regional variation, and temporal trends. This enhances a model's ability to reason, synthesize insights, and produce grounded, real-world responses.
For developers and researchers, integrating our curated data into LLM training pipelines means reducing hallucinations, expanding domain knowledge, and accelerating fine-tuning in high-stakes areas such as security, geopolitics, healthcare, and public safety. HTD also supports ongoing model refinement with fresh, real-time data that help to keep LLMs aligned with current events and emerging language patterns.
Hometree Data’s service isn’t just another data source; it’s a competitive advantage in building smarter, safer, and more situationally aware AI systems.
**Product details**
Hometree Data Dictionary Core Data Elements:
Field Name: Description:
summary A summary or brief overview of the content.
summary_translated English translation of the summary, including summaries of video, radio, blog, or linked content.
title The title or headline of the content.
title_translated English translation of the title.
full_text_translated English translation of all textual content from any media format.
text Full body of the scraped content.
url Full-length URL of the content.
url_domain Domain extracted from the URL, representing the source domain.
domain Publishing or amplifying domain. May be the same as url_domain.
ip IP address associated with the content or domain.
geoip Container for geolocation data derived from the IP address.
geoip.city_name City name from IP geolocation.
geoip.region_name Region name from IP geolocation.
geoip.region_iso_code ISO region code (includes country + region) from IP.
geoip.country_name Full country name from IP geolocation.
geoip.country_iso_code ISO 2-letter country code from IP geolocation.
geoip.continent_name Continent name from IP geolocation.
geoip.location.lat Latitude coordinate from IP geolocation.
geoip.location.lon Longitude coordinate from IP geolocation.
story_language Two-letter ISO 639-1 language code of the original content.
authors Author(s) of the content, as listed on the site. May be absent.
entities Container field for named entities (people, organizations, places).
entities.people List of identified people.
entities.orgs List of identified organizations.
entities.places List of identified places.
date_added ISO 8601 timestamp when content was added to the dataset.
ts_date Alternative date format for date_added, formatted as YYYYMMDD.
ts_hour Hour (0–23) the document was added.
radio_transcription Text transcription of audio radio content.
radio_translation English translation of radio transcription.
video_transcription Text transcription of video content.
video_translation English translation of video transcription.
translation_file_format Format of translation files (e.g., txt, html, mp3, mp4, etc.).
active_processing Indicates whether parallel processing was used for real-time needs before final reporting.
Custom feeds, analysis, and data elements are available on request from the Hometree team.
**Additional Insights**
This dataset contains fields that support NLP, network/graph analysis, and AI/ML workflows.
提供机构:
Hometree Data, Inc.



