five

ULTRAData/ULTRA.Elite.Semantic.Data.SEMNET

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ULTRAData/ULTRA.Elite.Semantic.Data.SEMNET
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - semantic - dictionary - wordnet - taxonomy - nlp - training-data - knowledge-graph - species - gbif pretty_name: WorldWordSemNet size_categories: - 1M<n<10M --- # WorldWordSemNet **WorldWordSemNet** is a comprehensive English semantic dictionary dataset mapping **3,981,469 words** to a structured hierarchy of **286 semantic family classes**, built as a training substrate for next-generation language models. --- ## What's in it Every entry contains: - **`word`** — the English word or multi-word expression (including scientific species names) - **`id`** — a hierarchical numeric semantic address (e.g. `4.2.8.319`) encoding the word's position in the semantic tree - **`family`** — the semantic family label (e.g. `ENTITY.ANIMAL.CANINE`, `ACTION.COMMUNICATION.SPEAK_VERB`, `DESCRIPTOR.SIZE`) - **`definitions`** — dictionary definitions where **every token in every definition is also mapped to its own semantic ID** — giving a fully grounded semantic graph of language - **`taxonomy`** — for biological entities: full Linnaean taxonomy (kingdom → phylum → class → order → family → genus → rank → scientificName) for **2,791,312 species** - **`species_description`** — up to 3 natural-language descriptions per species (202,745 species covered), with every token also semantically mapped --- ## Semantic Hierarchy The 286 semantic families are organized into a deep hierarchy: ``` ENTITY ENTITY.ANIMAL ENTITY.ANIMAL.CANINE ENTITY.ANIMAL.FELINE ENTITY.ANIMAL.PRIMATE ... ENTITY.PLANT ENTITY.SUBSTANCE ENTITY.SUBSTANCE.CHEMICAL ENTITY.SUBSTANCE.FOOD ... ENTITY.PERSON ENTITY.DISEASE ACTION ACTION.COMMUNICATION.SPEAK_VERB ACTION.MOVE ... DESCRIPTOR RELATION TIME PLACE ... ``` Numeric IDs encode the hierarchy directly: `4.2.8.319` means top-level class 4 → subclass 2 → subclass 8 → word 319. Arithmetic operations on IDs are semantically meaningful. --- ## Scale | Metric | Count | |---|---| | Total words / entries | 3,981,469 | | Words with definitions | 773,952 | | Words with full taxonomy | 2,791,312 | | Species with descriptions | 202,745 | | Semantic families | 286 | | File size (JSONL) | ~1.65 GB | --- ## Format JSONL — one JSON object per line. Example entry: ```json { "word": "dog", "id": "4.2.8.319", "family": "ENTITY.ANIMAL.CANINE", "definitions": [ { "gloss": "A mammal of the family Canidae:", "tokens": [ ["a", "6.3.1"], ["mammal", "4.2.23.4481"], ["of", "1.5.2.2875"], ["the", "6.3.379"], ["family", "4.12.6.4773"], ["canidae", "4.2.8.113"] ] } ], "taxonomy": { "kingdom": "Animalia", "phylum": "Chordata", "class": "Mammalia", "order": "Carnivora", "family": "Canidae", "genus": "Canis", "rank": "species" } } ``` --- ## Intended Use This dataset was originally built as the semantic substrate for **INFINITY** — a novel AI architecture that does **not use transformers**. INFINITY uses sparse directed matrices and a two-layer semantic tether mechanism, trained entirely on hierarchical numeric IDs rather than token embeddings. It is a fundamentally new approach to generative language modeling. **WorldWordSemNet should also serve as a major kickstart for transformer-based models.** The combination of: 1. Full semantic family labeling for nearly 4 million words 2. Hierarchically structured numeric IDs that encode meaning arithmetically 3. Definition tokens pre-mapped to their own semantic IDs — a fully grounded semantic graph 4. Complete species taxonomy and biological descriptions integrated at the word level ...makes this dataset uniquely suited for training models that require deep semantic grounding beyond simple statistical co-occurrence. --- ## License Apache 2.0.
提供机构:
ULTRAData
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作