cp500/auto-ontology

Name: cp500/auto-ontology
Creator: cp500
Published: 2026-03-02 02:54:02
License: 暂无描述

Hugging Face2026-03-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/cp500/auto-ontology

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 tags: - automotive - ontology - common-crawl - market-intelligence - hypergraph - aws - strands-agents task_categories: - text-classification - feature-extraction size_categories: - 100K<n<1M --- # Auto-Ontology — Automotive Market Intelligence Hypergraph **Structural Intelligence for the Automotive Industry** A dual-star hypergraph connecting **94,671 market signals** extracted from Common Crawl to **1,261 vehicles** resolved against the NHTSA vPIC registry. Built as an AWS workshop demonstrating a full market-intelligence pipeline — from raw web crawl to structured competitive analysis. ## The Pipeline ``` Common Crawl (billions) → Automotive filter (69K documents) → NuExtract structured extraction → vPIC entity resolution → Temporal stitching → Hypergraph (176K nodes, 537K edges) ``` **1. Common Crawl Scraping** — Query the Common Crawl index for automotive domains (autocar.co.uk, motortrend.com, electrek.co, etc.), download WARC records, extract clean article text. **2. Structured Extraction** — NuExtract-2.0 with an ontology-enforced schema extracts products, market signals, companies, and financial data. Each signal is linked to specific vehicles and features at extraction time. **3. Normalization** — The "Ralph Wiggum" quality loop validates extractions against the NHTSA vPIC database. Fuzzy matching resolves messy text like "Tesla Model Y Long Range" to canonical product IDs (`prd_tesla_model_y_2024`). Up to 3 refinement passes ensure quality. **4. Temporal Stitching** — Signals sharing a product are connected with NEXT edges ordered by timestamp, creating event chains that reveal how stories evolve (84,981 temporal edges). **5. Building the Hypergraph** — Products, Features, Actors, Signals, Locations, and Documents are woven together through 13 typed edge roles into a unified knowledge graph. **6. Finding Structure** — Spectral analysis, persistent homology, sheaf cohomology, and functor analysis extract competitive landscapes, detect regime changes, and reveal structural gaps. ## Dataset Structure ### Hypergraph (`data/hypergraph/`) | File | Rows | Description | |------|------|-------------| | `nodes.parquet` | 176,324 | All graph nodes (Product, Feature, Actor, Signal, Location, Document) | | `edges.parquet` | 537,747 | Typed relationships (HAS_FEATURE, ABOUT_PRODUCT, COMPETES_WITH, NEXT, etc.) | | `signal_index.parquet` | 94,671 | Signal taxonomy — domain, subdomain, L3 code, sentiment, impact, timestamp | | `product_index.parquet` | 1,261 | Vehicle lookup — make, model, year, body class | **Node types:** | Type | Count | Example | |------|-------|---------| | Signal | 94,671 | "H2 Mobility announced shutting 22 fuel stations in Germany" | | Document | 69,093 | autocar.co.uk article (CC-MAIN-2026-04) | | Actor | 11,189 | Tesla, BMW, CATL, Magna International | | Product | 1,261 | prd_tesla_model_y_2024 (Tesla Model Y 2024 SUV) | | Feature | 110 | Anti-lock Braking System, Battery Type, Drive Type | **Edge roles:** | Role | Count | Connects | |------|-------|----------| | AFFECTS | 186,696 | Signal → Actor | | SOURCES | 94,671 | Document → Signal | | NEXT | 84,981 | Signal → Signal (temporal chain) | | EMITS | 61,151 | Actor → Signal | | ABOUT_PRODUCT | 55,051 | Signal → Product | | HAS_FEATURE | 22,521 | Product → Feature | | COMPETES_WITH | 17,835 | Product ↔ Product | | ABOUT_FEATURE | 12,297 | Signal → Feature | | MADE_BY | 1,592 | Product → Actor | | SUCCEEDS | 952 | Product → Product (model year lineage) | **Signal taxonomy (L1 domains):** | Domain | Count | Description | |--------|-------|-------------| | P — Product | 41,116 | Launches, architecture, lifecycle, features, quality, brand | | C — Competitive | 40,065 | Market share, partnerships, M&A, pricing | | T — Technology | 8,080 | Battery, autonomous, connected, manufacturing | | M — Market | 3,014 | Tariffs, currency, energy policy, geopolitics | | F — Financial | 1,030 | Revenue, investment, pricing, valuation | | S — Supply Chain | 731 | Raw materials, semiconductors, logistics, production | | R — Regulatory | 387 | Emissions, safety, trade, subsidies | | ST — Strategic | 248 | EV strategy, market entry/exit, platform, brand | ### vPIC Reference (`data/vpic/`) The complete NHTSA Vehicle Product Information Catalog — 97 tables covering makes, models, manufacturers, body styles, safety features, and VIN decoding patterns. | Key Tables | Rows | Description | |------------|------|-------------| | `pattern.parquet` | 1,651,121 | VIN decoding patterns | | `model.parquet` | 31,396 | Vehicle models | | `manufacturer.parquet` | 22,447 | Manufacturers | | `make.parquet` | 12,119 | Makes/brands | | `element.parquet` | ~160 | Vehicle specification elements | | `bodystyle.parquet` | 71 | Body style classifications | | `vehicletype.parquet` | 9 | Vehicle type categories | | + 90 lookup tables | | ABS types, drive types, fuel types, etc. | ### Workshop Files (`workshop/`) Complete workshop materials for building the pipeline: | File | Description | |------|-------------| | `tutorial.html` | Interactive structural intelligence tutorial with D3.js visualizations | | `ontology.py` | Hypergraph schema definitions — 6 node types, 13 edge roles, signal taxonomy | | `seed_vpic.py` | Populate hypergraph from vPIC — products, features, competition edges | | `Other_sources.md` | Global automotive government data sources (EU, UK, AU, JP, CA) | | `pipeline/definition.py` | AWS Step Functions pipeline definition | | `pipeline/extraction_template.py` | NuExtract ontology-enforced extraction schema | | `requirements.txt` | Minimal Python dependencies | ## Usage ```python from datasets import load_dataset # Load hypergraph tables ds = load_dataset("cp500/auto-ontology", data_dir="data/hypergraph") # Load specific vPIC tables vpic_makes = load_dataset("cp500/auto-ontology", data_files="data/vpic/make.parquet") # Query signals import pandas as pd signals = pd.read_parquet("hf://datasets/cp500/auto-ontology/data/hypergraph/signal_index.parquet") bearish = signals[signals["sentiment"] == "bearish"] print(f"{len(bearish)} bearish signals") # Load product index products = pd.read_parquet("hf://datasets/cp500/auto-ontology/data/hypergraph/product_index.parquet") tesla = products[products["make"] == "Tesla"] print(f"{len(tesla)} Tesla vehicles in the ontology") ``` ## Attribution ### Common Crawl The market signals in this dataset were extracted from web pages archived by [Common Crawl](https://commoncrawl.org/). Common Crawl data is released under the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/). Original content remains the property of the respective publishers. This dataset contains structured extractions (factual market events), not reproductions of original articles. ### NHTSA vPIC Vehicle product information sourced from the [NHTSA Product Information Catalog and Vehicle Listing (vPIC)](https://vpic.nhtsa.dot.gov/api/). This is a US government public dataset. ## Companion Space Explore the data interactively: [cp500/auto-ontology Space](https://huggingface.co/spaces/cp500/auto-ontology) - **Pipeline Narrative** — Scrollytelling walkthrough of the full pipeline with D3.js visualizations - **Ask the Ontology** — Strands Agent chatbot for querying the hypergraph ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). See attribution section above for source data licensing.

许可证：CC BY 4.0 标签： - 汽车领域 - 本体论（ontology） - 公共爬虫（Common Crawl） - 市场情报 - 超图（hypergraph） - 亚马逊云科技（AWS） - Strands智能体（strands-agents）任务类别： - 文本分类 - 特征提取规模类别： - 10万<样本量<100万 # Auto-Ontology — 汽车市场情报超图 **面向汽车行业的结构智能** 这是一个双星型超图（hypergraph），将从公共爬虫（Common Crawl）中提取的94,671条市场信号，与对照美国国家公路交通安全管理局（NHTSA）vPIC注册表进行实体对齐的1,261款车辆关联起来。本数据集作为亚马逊云科技（AWS）工作坊项目搭建，完整展示了从原始网页爬取到结构化竞争分析的全流程市场情报流水线。 ## 数据流水线公共爬虫（Common Crawl，数十亿网页） → 汽车领域筛选（6.9万份文档） → NuExtract结构化提取 → vPIC实体对齐 → 时序拼接 → 超图（hypergraph，17.6万个节点，53.7万条边） **1. 公共爬虫爬取**：针对汽车领域域名（如autocar.co.uk、motortrend.com、electrek.co等）检索公共爬虫索引，下载WARC记录并提取干净的文章文本。 **2. 结构化提取**：采用遵循本体论（ontology）约束模式的NuExtract-2.0模型，提取产品、市场信号、企业与财务数据。提取阶段即完成每条信号与特定车辆及特征的关联。 **3. 归一化处理**：通过名为“Ralph Wiggum”的质量校验循环，对照NHTSA vPIC数据库验证提取结果。采用模糊匹配将“特斯拉Model Y长续航版”这类非规范文本对齐至标准产品ID（如`prd_tesla_model_y_2024`）。至多进行3轮优化迭代以保障数据质量。 **4. 时序拼接**：将共享同一产品的信号按时间戳排序，通过`NEXT`边进行关联，构建可揭示事件发展脉络的事件链（共84,981条时序边）。 **5. 超图构建**：通过13种类型的边角色，将产品、特征、主体、信号、地点与文档整合为统一的知识图谱（超图）。 **6. 结构挖掘**：通过频谱分析、持久同调、层上同调与函子分析，挖掘竞争格局、检测机制变化并揭示结构空白。 ## 数据集结构 ### 超图数据集（`data/hypergraph/`） | 文件 | 行数 | 描述 | |------|------|-------------| | `nodes.parquet` | 176,324 | 全图节点（包括产品、特征、主体、信号、地点与文档） | | `edges.parquet` | 537,747 | 类型化关系（如`HAS_FEATURE`、`ABOUT_PRODUCT`、`COMPETES_WITH`、`NEXT`等） | | `signal_index.parquet` | 94,671 | 信号分类体系——包含领域、子领域、L3代码、情感倾向、影响程度与时间戳 | | `product_index.parquet` | 1,261 | 车辆查询表——包含品牌、车型、生产年份与车身类型 | **节点类型：** | 类型 | 数量 | 示例 | |------|-------|---------| | 信号 | 94,671 | "氢能移动公司宣布关闭德国境内22座加油站" | | 文档 | 69,093 | autocar.co.uk文章（CC-MAIN-2026-04） | | 主体 | 11,189 | 特斯拉、宝马、宁德时代、麦格纳国际 | | 产品 | 1,261 | `prd_tesla_model_y_2024`（2024款特斯拉Model Y SUV） | | 特征 | 110 | 防抱死制动系统、电池类型、驱动形式 | **边角色：** | 角色 | 数量 | 关联对象 | |------|-------|----------| | `AFFECTS` | 186,696 | 信号 → 主体 | | `SOURCES` | 94,671 | 文档 → 信号 | | `NEXT` | 84,981 | 信号 → 信号（时序链） | | `EMITS` | 61,151 | 主体 → 信号 | | `ABOUT_PRODUCT` | 55,051 | 信号 → 产品 | | `HAS_FEATURE` | 22,521 | 产品 → 特征 | | `COMPETES_WITH` | 17,835 | 产品 ↔ 产品 | | `ABOUT_FEATURE` | 12,297 | 信号 → 特征 | | `MADE_BY` | 1,592 | 产品 → 主体 | | `SUCCEEDS` | 952 | 产品 → 产品（车型年谱系） | **信号分类体系（一级领域）：** | 领域 | 数量 | 描述 | |--------|-------|-------------| | P — 产品 | 41,116 | 产品发布、架构、生命周期、特征、质量与品牌 | | C — 竞争 | 40,065 | 市场份额、合作关系、并购与定价 | | T — 技术 | 8,080 | 电池、自动驾驶、网联化与制造 | | M — 市场 | 3,014 | 关税、汇率、能源政策与地缘政治 | | F — 财务 | 1,030 | 营收、投资、定价与估值 | | S — 供应链 | 731 | 原材料、半导体、物流与生产 | | R — 监管 | 387 | 排放、安全、贸易与补贴 | | ST — 战略 | 248 | 电动化战略、市场进出、平台与品牌 | ### vPIC参考数据集（`data/vpic/`）该数据集包含完整的NHTSA车辆产品信息目录，共计97张数据表，涵盖品牌、车型、制造商、车身样式、安全特征与VIN解码规则。 | 核心数据表 | 行数 | 描述 | |------------|------|-------------| | `pattern.parquet` | 1,651,121 | VIN解码规则 | | `model.parquet` | 31,396 | 车辆车型 | | `manufacturer.parquet` | 22,447 | 制造商 | | `make.parquet` | 12,119 | 品牌 | | `element.parquet` | ~160 | 车辆规格要素 | | `bodystyle.parquet` | 71 | 车身样式分类 | | `vehicletype.parquet` | 9 | 车辆类型分类 | | + 90张查找表 | | 如ABS类型、驱动形式、燃油类型等 | ### 工作坊文件（`workshop/`）包含搭建完整数据流水线所需的全部工作坊材料： | 文件 | 描述 | |------|-------------| | `tutorial.html` | 采用D3.js可视化的交互式结构智能教程 | | `ontology.py` | 超图模式定义——包含6种节点类型、13种边角色与信号分类体系 | | `seed_vpic.py` | 基于vPIC数据集填充超图——包括产品、特征与竞争边 | | `Other_sources.md` | 全球汽车领域官方数据源（欧盟、英国、澳大利亚、日本、加拿大） | | `pipeline/definition.py` | 亚马逊云科技Step Functions流水线定义 | | `pipeline/extraction_template.py` | 遵循本体论约束的NuExtract提取模板 | | `requirements.txt` | 极简Python依赖清单 | ## 使用方法 python from datasets import load_dataset # 加载超图数据表 ds = load_dataset("cp500/auto-ontology", data_dir="data/hypergraph") # 加载指定vPIC数据表 vpic_makes = load_dataset("cp500/auto-ontology", data_files="data/vpic/make.parquet") # 查询信号数据 import pandas as pd signals = pd.read_parquet("hf://datasets/cp500/auto-ontology/data/hypergraph/signal_index.parquet") bearish = signals[signals["sentiment"] == "bearish"] print(f"共找到{len(bearish)}条看空信号") # 加载产品索引表 products = pd.read_parquet("hf://datasets/cp500/auto-ontology/data/hypergraph/product_index.parquet") tesla = products[products["make"] == "Tesla"] print(f"本数据集中共包含{len(tesla)}款特斯拉车型") ## 归属声明 ### 公共爬虫（Common Crawl）本数据集的市场信号提取自[公共爬虫（Common Crawl）](https://commoncrawl.org/)归档的网页内容。公共爬虫数据集采用[CC BY 4.0许可证](https://creativecommons.org/licenses/by/4.0/)发布。原始内容版权归属于各内容发布方。本数据集仅包含结构化提取的事实性市场事件，而非原始文章的复刻。 ### NHTSA vPIC 本数据集的车辆产品信息来源于[美国国家公路交通安全管理局产品信息目录与车辆清单（NHTSA vPIC）](https://vpic.nhtsa.dot.gov/api/)，该数据集为美国政府公开数据集。 ## 配套互动空间可通过以下互动空间探索本数据集：[cp500/auto-ontology 互动空间](https://huggingface.co/spaces/cp500/auto-ontology) - **流水线叙事**——采用D3.js可视化的全流水线滚动讲解教程 - **向本体提问**——用于查询超图的Strands智能体聊天机器人 ## 许可证本数据集采用[CC BY 4.0许可证](https://creativecommons.org/licenses/by/4.0/)发布。源数据集的许可条款请参阅上文归属声明部分。

提供机构：

cp500

5,000+

优质数据集

54 个

任务类型

进入经典数据集