Corp-o-Rate-Community/entity-references
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Corp-o-Rate-Community/entity-references
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-classification
- feature-extraction
tags:
- entity-linking
- named-entity-recognition
- knowledge-base
- organizations
- people
- sqlite
- vector-search
- embeddings
size_categories:
- 1M<n<10M
pretty_name: Entity References Database
configs:
- config_name: full
description: Full database with complete source metadata
- config_name: lite
description: Core fields + embeddings only (recommended)
---
# Entity References Database
A comprehensive entity database for organizations, people, roles, and locations with 768-dimensional embeddings for semantic matching. Built from authoritative sources (GLEIF, SEC, Companies House, Wikidata) for entity linking and named entity disambiguation.
## Dataset Description
- **Repository:** [Corp-o-Rate-Community/entity-references](https://huggingface.co/datasets/Corp-o-Rate-Community/entity-references)
- **Paper:** N/A
- **Point of Contact:** Corp-o-Rate-Community
### Dataset Summary
This dataset provides fast lookup and qualification of named entities using vector similarity search. It stores records from authoritative global sources with embeddings generated by `google/embeddinggemma-300m` (768 dimensions).
**Key Features:**
- **8M+ organization records** from GLEIF, SEC Edgar, Companies House, and Wikidata
- **Notable people** including executives, politicians, athletes, artists, and more
- **Roles and locations** with hierarchical relationships
- **Vector embeddings** for semantic similarity search
- **Canonical linking** across sources (same entity from multiple sources linked)
### Supported Tasks
- **Entity Linking**: Match extracted entity mentions to canonical database records
- **Named Entity Disambiguation**: Distinguish between entities with similar names
- **Knowledge Base Population**: Enrich extracted entities with identifiers and metadata
### Languages
English (en)
## Dataset Structure
### Schema (v2 - Normalized)
The database uses SQLite with the [sqlite-vec](https://github.com/asg017/sqlite-vec) extension for vector similarity search.
#### Organizations Table
| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `qid` | INTEGER | Wikidata QID as integer (e.g., 2283 for Q2283) |
| `name` | TEXT | Organization name |
| `name_normalized` | TEXT | Lowercased, normalized name |
| `source_id` | INTEGER FK | Reference to source_types |
| `source_identifier` | TEXT | LEI, CIK, Company Number, etc. |
| `region_id` | INTEGER FK | Reference to locations |
| `entity_type_id` | INTEGER FK | Reference to organization_types |
| `from_date` | TEXT | Founding/registration date (ISO format) |
| `to_date` | TEXT | Dissolution date (ISO format) |
| `canon_id` | INTEGER | ID of canonical record |
| `canon_size` | INTEGER | Size of canonical group |
| `record` | JSON | Full source record (omitted in lite) |
#### People Table
| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `qid` | INTEGER | Wikidata QID as integer |
| `name` | TEXT | Display name |
| `name_normalized` | TEXT | Lowercased, normalized name |
| `source_id` | INTEGER FK | Reference to source_types |
| `source_identifier` | TEXT | QID, Owner CIK, Person number |
| `country_id` | INTEGER FK | Reference to locations |
| `person_type_id` | INTEGER FK | Reference to people_types |
| `known_for_role_id` | INTEGER FK | Reference to roles |
| `known_for_org` | TEXT | Organization name |
| `known_for_org_id` | INTEGER FK | Reference to organizations |
| `from_date` | TEXT | Role start date (ISO format) |
| `to_date` | TEXT | Role end date (ISO format) |
| `birth_date` | TEXT | Date of birth (ISO format) |
| `death_date` | TEXT | Date of death (ISO format) |
| `record` | JSON | Full source record (omitted in lite) |
#### Roles Table
| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `qid` | INTEGER | Wikidata QID (e.g., 484876 for CEO Q484876) |
| `name` | TEXT | Role name (e.g., "Chief Executive Officer") |
| `name_normalized` | TEXT | Normalized name |
| `source_id` | INTEGER FK | Reference to source_types |
| `canon_id` | INTEGER | ID of canonical role |
#### Locations Table
| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `qid` | INTEGER | Wikidata QID (e.g., 30 for USA Q30) |
| `name` | TEXT | Location name |
| `name_normalized` | TEXT | Normalized name |
| `source_id` | INTEGER FK | Reference to source_types |
| `source_identifier` | TEXT | ISO code (e.g., "US", "CA") |
| `parent_ids` | TEXT JSON | Parent location IDs in hierarchy |
| `location_type_id` | INTEGER FK | Reference to location_types |
#### Embedding Tables (sqlite-vec)
| Table | Columns |
|-------|---------|
| `organization_embeddings` | org_id INTEGER, embedding FLOAT[768] |
| `organization_embeddings_scalar` | org_id INTEGER, embedding INT8[768] |
| `person_embeddings` | person_id INTEGER, embedding FLOAT[768] |
| `person_embeddings_scalar` | person_id INTEGER, embedding INT8[768] |
**Scalar (int8) embeddings** provide 75% storage reduction with ~92% recall at top-100.
#### Enum Lookup Tables
| Table | Values |
|-------|--------|
| `source_types` | gleif, sec_edgar, companies_house, wikidata |
| `people_types` | executive, politician, government, military, legal, professional, academic, artist, media, athlete, entrepreneur, journalist, activist, scientist, unknown |
| `organization_types` | business, fund, branch, nonprofit, ngo, foundation, government, international_org, political_party, trade_union, educational, research, healthcare, media, sports, religious, unknown |
| `simplified_location_types` | continent, country, subdivision, city, district, other |
### Data Splits
| Config | Size | Contents |
|--------|------|----------|
| `entities-lite.db` | ~50GB | Core fields + embeddings only |
| `entities.db` | ~74GB | Full records with source metadata |
The lite version is recommended for most use cases.
## Dataset Creation
### Source Data
#### Organizations
| Source | Records | Identifier | Coverage |
|--------|---------|------------|----------|
| [GLEIF](https://www.gleif.org/) | ~3.2M | LEI (Legal Entity Identifier) | Global companies with LEI |
| [SEC Edgar](https://www.sec.gov/) | ~100K+ | CIK (Central Index Key) | All SEC filers |
| [Companies House](https://www.gov.uk/government/organisations/companies-house) | ~5M | Company Number | UK registered companies |
| [Wikidata](https://www.wikidata.org/) | Variable | QID | Notable companies worldwide |
#### People
| Source | Records | Identifier | Coverage |
|--------|---------|------------|----------|
| [Wikidata](https://www.wikidata.org/) | Variable | QID | Notable people with English Wikipedia |
| [SEC Form 4](https://www.sec.gov/) | ~280K/year | Owner CIK | US public company insiders |
| [Companies House](https://www.gov.uk/government/organisations/companies-house) | ~15M+ | Person number | UK company officers |
### Embedding Model
| Property | Value |
|----------|-------|
| Model | `google/embeddinggemma-300m` |
| Dimensions | 768 |
| Framework | sentence-transformers |
| Size | ~300M parameters |
### Canonicalization
Records are linked across sources based on:
**Organizations:**
1. Same LEI (globally unique)
2. Same ticker symbol
3. Same CIK
4. Same normalized name + region
**People:**
1. Same Wikidata QID
2. Same normalized name + same organization
3. Same normalized name + overlapping date ranges
**Source priority:** gleif > sec_edgar > companies_house > wikidata
## Usage
### Installation
```bash
pip install corp-extractor
```
### Download
```bash
# Download lite version (recommended)
corp-extractor db download
# Download full version
corp-extractor db download --full
```
**Storage location:** `~/.cache/corp-extractor/entities-v2.db`
### Search
```bash
# Search organizations
corp-extractor db search "Microsoft"
# Search people
corp-extractor db search-people "Tim Cook"
# Search roles
corp-extractor db search-roles "CEO"
# Search locations
corp-extractor db search-locations "California"
# Check database status
corp-extractor db status
```
### Python API
```python
from statement_extractor.database import OrganizationDatabase, PersonDatabase
# Search organizations
org_db = OrganizationDatabase()
matches = org_db.search_by_name("Microsoft Corporation", top_k=5)
for match in matches:
print(f"{match.company.name} ({match.company.source}:{match.company.source_id})")
print(f" Similarity: {match.similarity_score:.3f}")
# Search people
person_db = PersonDatabase()
matches = person_db.search_by_name("Tim Cook", top_k=5)
for match in matches:
print(f"{match.person.name} - {match.person.known_for_role} at {match.person.known_for_org}")
```
### Use in Pipeline
```python
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced new AI features.")
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")
```
## Technical Details
### Vector Search Performance
| Database Size | Search Time | Memory |
|---------------|-------------|--------|
| 100K records | ~50ms | ~500MB |
| 1M records | ~200ms | ~3GB |
| 8M records | ~500ms | ~20GB |
### Similarity Thresholds
| Score | Interpretation |
|-------|----------------|
| > 0.85 | Strong match (likely same entity) |
| 0.70 - 0.85 | Good match (probable same entity) |
| 0.55 - 0.70 | Moderate match (may need verification) |
| < 0.55 | Weak match (likely different entity) |
### Canonical ID Format
| Source | Prefix | Example |
|--------|--------|---------|
| GLEIF | `LEI` | `LEI:INR2EJN1ERAN0W5ZP974` |
| SEC Edgar | `SEC-CIK` | `SEC-CIK:0000789019` |
| Companies House | `UK-CH` | `UK-CH:00445790` |
| Wikidata | `WIKIDATA` | `WIKIDATA:Q2283` |
## Building from Source
```bash
# Import data sources
corp-extractor db import-gleif --download
corp-extractor db import-sec --download
corp-extractor db import-companies-house --download
corp-extractor db import-wikidata --limit 100000
corp-extractor db import-people --all --limit 50000
# Link equivalent records
corp-extractor db canonicalize
# Generate scalar embeddings (75% smaller)
corp-extractor db backfill-scalar
# Create lite version for deployment
corp-extractor db create-lite ~/.cache/corp-extractor/entities.db
```
### Wikidata Dump Import (Recommended for Large Imports)
```bash
# Download and import from Wikidata dump (~100GB)
corp-extractor db import-wikidata-dump --download --limit 50000
# Import only people
corp-extractor db import-wikidata-dump --download --people --no-orgs
# Import only locations
corp-extractor db import-wikidata-dump --dump dump.json.bz2 --locations --no-people --no-orgs
# Resume interrupted import
corp-extractor db import-wikidata-dump --dump dump.bz2 --resume
```
## Considerations for Using the Data
### Social Impact
This dataset enables entity linking for NLP applications. Users should be aware that:
- Organization and people records may be incomplete or outdated
- Historic people (deceased) are included with `death_date` field
- Not all notable entities are covered
### Biases
- Coverage is weighted toward English-speaking countries (US, UK) due to source availability
- Wikidata coverage depends on Wikipedia notability criteria
- SEC and Companies House data is limited to their respective jurisdictions
### Limitations
- The database does not automatically deduplicate across sources
- Embedding similarity is not perfect for entity disambiguation
- Updates require re-importing from source data
## License
Apache 2.0
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{entity_references_2024,
title = {Entity References Database},
author = {Corp-o-Rate-Community},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Corp-o-Rate-Community/entity-references}
}
```
## Dataset Card Authors
Corp-o-Rate-Community
## Dataset Card Contact
Open an issue on the [GitHub repository](https://github.com/corp-o-rate/statement-extractor) for questions or feedback.
提供机构:
Corp-o-Rate-Community



