five

SAMPLE Company Data — Entity Graph Master - 529K Stable QIDs | 99.81% Valid Websites | ISO2 | ...

收藏
Databricks2026-01-07 收录
下载链接:
https://marketplace.databricks.com/details/e73a8270-74df-4665-8e7d-f73e7528a0bd/QuantLens_SAMPLE-Company-Data-—-Entity-Graph-Master---529K-Stable-QIDs-99.81%-Valid-Websites-ISO2-
下载链接
链接失效反馈
官方服务:
资源简介:
QuantLens Global Entity Graph v9.4 is a **high-integrity global entity dataset** designed for durable joins, enrichment pipelines, analytics, and AI workflows. Instead of relying on brittle scraped identifiers, every record is keyed by a stable **Wikidata QID** (`wikidata_qid`) and includes a resolvable `wikidata_url`, giving you a clean primary key for **entity resolution**, deduplication, and deterministic refresh. If you’ve struggled with duplicate orgs in a CRM, inconsistent naming across sources, or enrichment pipelines that drift, this pack is built to be the **foundation layer** that makes downstream systems easier to maintain. This release contains **529,008 unique entities** across **140 countries** with **0 duplicate QIDs**, so you can safely treat `wikidata_qid` as a true primary key and join directly into Wikidata’s public knowledge base. Web coverage is a core strength: **99.98%** of rows contain a website, and we provide `website_domain` plus a strict **`website_valid`** flag (99.81% valid) that excludes blacklisted sources (archives, Wikipedia, social pages) to support reliable domain-based enrichment and discovery workflows. This is an **entity graph** (organizations + institutions + commercial entities), not just corporations. For “company-only” use cases, filter: `entity_category = "Commercial"` (22.0% / 116,454 entities). Full type granularity is included via `entity_type_qid` and `entity_type_label_en` (10,527 unique P31 types). ### What you get (68-column schema, engineered for usability) * **Stable identity & joins:** `wikidata_qid`, `wikidata_url`, `company_name`, `source`, `build_id`, `snapshot_date` * **Geography:** `country`, `country_code` (ISO-3166-1 alpha-2 where applicable) * **Web presence:** `website`, `website_domain`, `website_valid`, `website_domain_shared_count` * **Email (honest provenance split):** * `email_raw` (Wikidata P968, sourced) * `email_inferred` (heuristic `info@{domain}`, unvalidated) * `email_source` (per-row provenance) * **Quality & modeling helpers:** `data_completeness_score` (0–100), `identifier_density`, `has_public_financials` * **Classification:** `entity_category`, plus raw `entity_type_*` for custom filtering * **Collision transparency:** `name_country_collision_flag` to support buyer-side dedupe review ### Coverage highlights * **Primary key integrity:** 529,008 rows, **0 duplicate QIDs** * **Websites:** 528,888 / 529,008 (**99.98%**) non-empty * `website_valid=1`: 528,014 (**99.81%**) * **Countries:** `country` present for 528,927 (**99.98%**) (81 blanks) * **Country codes:** `country_code` present for 528,926 (**99.98%**) * 1 historical entity (Austria–Hungary) has no modern ISO2 * **Email provenance (not a contact list):** * `email_raw`: 30,696 (**5.8%**) — sourced from Wikidata statements (not SMTP/MX validated) * `email_inferred`: 497,262 (**94.0%**) — heuristic, unvalidated * No email: 1,050 (**0.2%**) — missing/blacklisted website ### Deduplication & shared-domain reality (transparent by design) * `name_country_collision_flag=1`: 14,480 rows (**2.74%**) — not necessarily duplicates, just shared names * `website_domain_shared_count > 1`: 114,258 rows (**21.6%**) * max shared-domain cluster: **2,977** (directory/collection-like domains exist; buyers can filter by threshold) ### Best-fit use cases 1. **Entity matching & deduplication:** reconcile messy org records across CRM/warehouse using stable QIDs + collision flags 2. **Enrichment seeding:** use validated websites/domains to drive deterministic enrichment and refresh pipelines 3. **Market mapping & segmentation:** slice by geography + entity category/type to build ICPs and coverage maps 4. **Knowledge graphs & AI:** load as a node table keyed by QID; use type labels + quality fields for retrieval, embeddings, and matchers ### Deliverables * CSV (193.2 MB, UTF-8 BOM), **Parquet/Snappy (51.3 MB)**, Sample CSV (10K rows) * QA report + SHA-256 manifest included for procurement/engineering verification
提供机构:
QuantLens
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作