five

Corp-o-Rate-Community/entity-references

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Corp-o-Rate-Community/entity-references
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-classification - feature-extraction tags: - entity-linking - named-entity-recognition - knowledge-base - organizations - people - sqlite - vector-search - embeddings size_categories: - 1M<n<10M pretty_name: Entity References Database configs: - config_name: full description: Full database with complete source metadata - config_name: lite description: Core fields + embeddings only (recommended) --- # Entity References Database A comprehensive entity database for organizations, people, roles, and locations with 768-dimensional embeddings for semantic matching. Built from authoritative sources (GLEIF, SEC, Companies House, Wikidata) for entity linking and named entity disambiguation. ## Dataset Description - **Repository:** [Corp-o-Rate-Community/entity-references](https://huggingface.co/datasets/Corp-o-Rate-Community/entity-references) - **Paper:** N/A - **Point of Contact:** Corp-o-Rate-Community ### Dataset Summary This dataset provides fast lookup and qualification of named entities using vector similarity search. It stores records from authoritative global sources with embeddings generated by `google/embeddinggemma-300m` (768 dimensions). **Key Features:** - **8M+ organization records** from GLEIF, SEC Edgar, Companies House, and Wikidata - **Notable people** including executives, politicians, athletes, artists, and more - **Roles and locations** with hierarchical relationships - **Vector embeddings** for semantic similarity search - **Canonical linking** across sources (same entity from multiple sources linked) ### Supported Tasks - **Entity Linking**: Match extracted entity mentions to canonical database records - **Named Entity Disambiguation**: Distinguish between entities with similar names - **Knowledge Base Population**: Enrich extracted entities with identifiers and metadata ### Languages English (en) ## Dataset Structure ### Schema (v2 - Normalized) The database uses SQLite with the [sqlite-vec](https://github.com/asg017/sqlite-vec) extension for vector similarity search. #### Organizations Table | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key | | `qid` | INTEGER | Wikidata QID as integer (e.g., 2283 for Q2283) | | `name` | TEXT | Organization name | | `name_normalized` | TEXT | Lowercased, normalized name | | `source_id` | INTEGER FK | Reference to source_types | | `source_identifier` | TEXT | LEI, CIK, Company Number, etc. | | `region_id` | INTEGER FK | Reference to locations | | `entity_type_id` | INTEGER FK | Reference to organization_types | | `from_date` | TEXT | Founding/registration date (ISO format) | | `to_date` | TEXT | Dissolution date (ISO format) | | `canon_id` | INTEGER | ID of canonical record | | `canon_size` | INTEGER | Size of canonical group | | `record` | JSON | Full source record (omitted in lite) | #### People Table | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key | | `qid` | INTEGER | Wikidata QID as integer | | `name` | TEXT | Display name | | `name_normalized` | TEXT | Lowercased, normalized name | | `source_id` | INTEGER FK | Reference to source_types | | `source_identifier` | TEXT | QID, Owner CIK, Person number | | `country_id` | INTEGER FK | Reference to locations | | `person_type_id` | INTEGER FK | Reference to people_types | | `known_for_role_id` | INTEGER FK | Reference to roles | | `known_for_org` | TEXT | Organization name | | `known_for_org_id` | INTEGER FK | Reference to organizations | | `from_date` | TEXT | Role start date (ISO format) | | `to_date` | TEXT | Role end date (ISO format) | | `birth_date` | TEXT | Date of birth (ISO format) | | `death_date` | TEXT | Date of death (ISO format) | | `record` | JSON | Full source record (omitted in lite) | #### Roles Table | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key | | `qid` | INTEGER | Wikidata QID (e.g., 484876 for CEO Q484876) | | `name` | TEXT | Role name (e.g., "Chief Executive Officer") | | `name_normalized` | TEXT | Normalized name | | `source_id` | INTEGER FK | Reference to source_types | | `canon_id` | INTEGER | ID of canonical role | #### Locations Table | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key | | `qid` | INTEGER | Wikidata QID (e.g., 30 for USA Q30) | | `name` | TEXT | Location name | | `name_normalized` | TEXT | Normalized name | | `source_id` | INTEGER FK | Reference to source_types | | `source_identifier` | TEXT | ISO code (e.g., "US", "CA") | | `parent_ids` | TEXT JSON | Parent location IDs in hierarchy | | `location_type_id` | INTEGER FK | Reference to location_types | #### Embedding Tables (sqlite-vec) | Table | Columns | |-------|---------| | `organization_embeddings` | org_id INTEGER, embedding FLOAT[768] | | `organization_embeddings_scalar` | org_id INTEGER, embedding INT8[768] | | `person_embeddings` | person_id INTEGER, embedding FLOAT[768] | | `person_embeddings_scalar` | person_id INTEGER, embedding INT8[768] | **Scalar (int8) embeddings** provide 75% storage reduction with ~92% recall at top-100. #### Enum Lookup Tables | Table | Values | |-------|--------| | `source_types` | gleif, sec_edgar, companies_house, wikidata | | `people_types` | executive, politician, government, military, legal, professional, academic, artist, media, athlete, entrepreneur, journalist, activist, scientist, unknown | | `organization_types` | business, fund, branch, nonprofit, ngo, foundation, government, international_org, political_party, trade_union, educational, research, healthcare, media, sports, religious, unknown | | `simplified_location_types` | continent, country, subdivision, city, district, other | ### Data Splits | Config | Size | Contents | |--------|------|----------| | `entities-lite.db` | ~50GB | Core fields + embeddings only | | `entities.db` | ~74GB | Full records with source metadata | The lite version is recommended for most use cases. ## Dataset Creation ### Source Data #### Organizations | Source | Records | Identifier | Coverage | |--------|---------|------------|----------| | [GLEIF](https://www.gleif.org/) | ~3.2M | LEI (Legal Entity Identifier) | Global companies with LEI | | [SEC Edgar](https://www.sec.gov/) | ~100K+ | CIK (Central Index Key) | All SEC filers | | [Companies House](https://www.gov.uk/government/organisations/companies-house) | ~5M | Company Number | UK registered companies | | [Wikidata](https://www.wikidata.org/) | Variable | QID | Notable companies worldwide | #### People | Source | Records | Identifier | Coverage | |--------|---------|------------|----------| | [Wikidata](https://www.wikidata.org/) | Variable | QID | Notable people with English Wikipedia | | [SEC Form 4](https://www.sec.gov/) | ~280K/year | Owner CIK | US public company insiders | | [Companies House](https://www.gov.uk/government/organisations/companies-house) | ~15M+ | Person number | UK company officers | ### Embedding Model | Property | Value | |----------|-------| | Model | `google/embeddinggemma-300m` | | Dimensions | 768 | | Framework | sentence-transformers | | Size | ~300M parameters | ### Canonicalization Records are linked across sources based on: **Organizations:** 1. Same LEI (globally unique) 2. Same ticker symbol 3. Same CIK 4. Same normalized name + region **People:** 1. Same Wikidata QID 2. Same normalized name + same organization 3. Same normalized name + overlapping date ranges **Source priority:** gleif > sec_edgar > companies_house > wikidata ## Usage ### Installation ```bash pip install corp-extractor ``` ### Download ```bash # Download lite version (recommended) corp-extractor db download # Download full version corp-extractor db download --full ``` **Storage location:** `~/.cache/corp-extractor/entities-v2.db` ### Search ```bash # Search organizations corp-extractor db search "Microsoft" # Search people corp-extractor db search-people "Tim Cook" # Search roles corp-extractor db search-roles "CEO" # Search locations corp-extractor db search-locations "California" # Check database status corp-extractor db status ``` ### Python API ```python from statement_extractor.database import OrganizationDatabase, PersonDatabase # Search organizations org_db = OrganizationDatabase() matches = org_db.search_by_name("Microsoft Corporation", top_k=5) for match in matches: print(f"{match.company.name} ({match.company.source}:{match.company.source_id})") print(f" Similarity: {match.similarity_score:.3f}") # Search people person_db = PersonDatabase() matches = person_db.search_by_name("Tim Cook", top_k=5) for match in matches: print(f"{match.person.name} - {match.person.known_for_role} at {match.person.known_for_org}") ``` ### Use in Pipeline ```python from statement_extractor.pipeline import ExtractionPipeline pipeline = ExtractionPipeline() ctx = pipeline.process("Microsoft CEO Satya Nadella announced new AI features.") for stmt in ctx.labeled_statements: print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}") ``` ## Technical Details ### Vector Search Performance | Database Size | Search Time | Memory | |---------------|-------------|--------| | 100K records | ~50ms | ~500MB | | 1M records | ~200ms | ~3GB | | 8M records | ~500ms | ~20GB | ### Similarity Thresholds | Score | Interpretation | |-------|----------------| | > 0.85 | Strong match (likely same entity) | | 0.70 - 0.85 | Good match (probable same entity) | | 0.55 - 0.70 | Moderate match (may need verification) | | < 0.55 | Weak match (likely different entity) | ### Canonical ID Format | Source | Prefix | Example | |--------|--------|---------| | GLEIF | `LEI` | `LEI:INR2EJN1ERAN0W5ZP974` | | SEC Edgar | `SEC-CIK` | `SEC-CIK:0000789019` | | Companies House | `UK-CH` | `UK-CH:00445790` | | Wikidata | `WIKIDATA` | `WIKIDATA:Q2283` | ## Building from Source ```bash # Import data sources corp-extractor db import-gleif --download corp-extractor db import-sec --download corp-extractor db import-companies-house --download corp-extractor db import-wikidata --limit 100000 corp-extractor db import-people --all --limit 50000 # Link equivalent records corp-extractor db canonicalize # Generate scalar embeddings (75% smaller) corp-extractor db backfill-scalar # Create lite version for deployment corp-extractor db create-lite ~/.cache/corp-extractor/entities.db ``` ### Wikidata Dump Import (Recommended for Large Imports) ```bash # Download and import from Wikidata dump (~100GB) corp-extractor db import-wikidata-dump --download --limit 50000 # Import only people corp-extractor db import-wikidata-dump --download --people --no-orgs # Import only locations corp-extractor db import-wikidata-dump --dump dump.json.bz2 --locations --no-people --no-orgs # Resume interrupted import corp-extractor db import-wikidata-dump --dump dump.bz2 --resume ``` ## Considerations for Using the Data ### Social Impact This dataset enables entity linking for NLP applications. Users should be aware that: - Organization and people records may be incomplete or outdated - Historic people (deceased) are included with `death_date` field - Not all notable entities are covered ### Biases - Coverage is weighted toward English-speaking countries (US, UK) due to source availability - Wikidata coverage depends on Wikipedia notability criteria - SEC and Companies House data is limited to their respective jurisdictions ### Limitations - The database does not automatically deduplicate across sources - Embedding similarity is not perfect for entity disambiguation - Updates require re-importing from source data ## License Apache 2.0 ## Citation If you use this dataset, please cite: ```bibtex @dataset{entity_references_2024, title = {Entity References Database}, author = {Corp-o-Rate-Community}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Corp-o-Rate-Community/entity-references} } ``` ## Dataset Card Authors Corp-o-Rate-Community ## Dataset Card Contact Open an issue on the [GitHub repository](https://github.com/corp-o-rate/statement-extractor) for questions or feedback.
提供机构:
Corp-o-Rate-Community
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作