SAMPLE Company Data — Entity Graph Master - 529K Stable QIDs | 99.81% Valid Websites | ISO2 | ...
收藏Databricks2026-01-07 收录
下载链接:
https://marketplace.databricks.com/details/e73a8270-74df-4665-8e7d-f73e7528a0bd/QuantLens_SAMPLE-Company-Data-—-Entity-Graph-Master---529K-Stable-QIDs-99.81%-Valid-Websites-ISO2-
下载链接
链接失效反馈官方服务:
资源简介:
QuantLens Global Entity Graph v9.4 is a **high-integrity global entity dataset** designed for durable joins, enrichment pipelines, analytics, and AI workflows. Instead of relying on brittle scraped identifiers, every record is keyed by a stable **Wikidata QID** (`wikidata_qid`) and includes a resolvable `wikidata_url`, giving you a clean primary key for **entity resolution**, deduplication, and deterministic refresh. If you’ve struggled with duplicate orgs in a CRM, inconsistent naming across sources, or enrichment pipelines that drift, this pack is built to be the **foundation layer** that makes downstream systems easier to maintain.
This release contains **529,008 unique entities** across **140 countries** with **0 duplicate QIDs**, so you can safely treat `wikidata_qid` as a true primary key and join directly into Wikidata’s public knowledge base. Web coverage is a core strength: **99.98%** of rows contain a website, and we provide `website_domain` plus a strict **`website_valid`** flag (99.81% valid) that excludes blacklisted sources (archives, Wikipedia, social pages) to support reliable domain-based enrichment and discovery workflows.
This is an **entity graph** (organizations + institutions + commercial entities), not just corporations. For “company-only” use cases, filter:
`entity_category = "Commercial"` (22.0% / 116,454 entities).
Full type granularity is included via `entity_type_qid` and `entity_type_label_en` (10,527 unique P31 types).
### What you get (68-column schema, engineered for usability)
* **Stable identity & joins:** `wikidata_qid`, `wikidata_url`, `company_name`, `source`, `build_id`, `snapshot_date`
* **Geography:** `country`, `country_code` (ISO-3166-1 alpha-2 where applicable)
* **Web presence:** `website`, `website_domain`, `website_valid`, `website_domain_shared_count`
* **Email (honest provenance split):**
* `email_raw` (Wikidata P968, sourced)
* `email_inferred` (heuristic `info@{domain}`, unvalidated)
* `email_source` (per-row provenance)
* **Quality & modeling helpers:** `data_completeness_score` (0–100), `identifier_density`, `has_public_financials`
* **Classification:** `entity_category`, plus raw `entity_type_*` for custom filtering
* **Collision transparency:** `name_country_collision_flag` to support buyer-side dedupe review
### Coverage highlights
* **Primary key integrity:** 529,008 rows, **0 duplicate QIDs**
* **Websites:** 528,888 / 529,008 (**99.98%**) non-empty
* `website_valid=1`: 528,014 (**99.81%**)
* **Countries:** `country` present for 528,927 (**99.98%**) (81 blanks)
* **Country codes:** `country_code` present for 528,926 (**99.98%**)
* 1 historical entity (Austria–Hungary) has no modern ISO2
* **Email provenance (not a contact list):**
* `email_raw`: 30,696 (**5.8%**) — sourced from Wikidata statements (not SMTP/MX validated)
* `email_inferred`: 497,262 (**94.0%**) — heuristic, unvalidated
* No email: 1,050 (**0.2%**) — missing/blacklisted website
### Deduplication & shared-domain reality (transparent by design)
* `name_country_collision_flag=1`: 14,480 rows (**2.74%**) — not necessarily duplicates, just shared names
* `website_domain_shared_count > 1`: 114,258 rows (**21.6%**)
* max shared-domain cluster: **2,977** (directory/collection-like domains exist; buyers can filter by threshold)
### Best-fit use cases
1. **Entity matching & deduplication:** reconcile messy org records across CRM/warehouse using stable QIDs + collision flags
2. **Enrichment seeding:** use validated websites/domains to drive deterministic enrichment and refresh pipelines
3. **Market mapping & segmentation:** slice by geography + entity category/type to build ICPs and coverage maps
4. **Knowledge graphs & AI:** load as a node table keyed by QID; use type labels + quality fields for retrieval, embeddings, and matchers
### Deliverables
* CSV (193.2 MB, UTF-8 BOM), **Parquet/Snappy (51.3 MB)**, Sample CSV (10K rows)
* QA report + SHA-256 manifest included for procurement/engineering verification
提供机构:
QuantLens



