ULTRAData/ULTRA.Elite.Semantic.Data.SEMNET
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ULTRAData/ULTRA.Elite.Semantic.Data.SEMNET
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- semantic
- dictionary
- wordnet
- taxonomy
- nlp
- training-data
- knowledge-graph
- species
- gbif
pretty_name: WorldWordSemNet
size_categories:
- 1M<n<10M
---
# WorldWordSemNet
**WorldWordSemNet** is a comprehensive English semantic dictionary dataset mapping **3,981,469 words** to a structured hierarchy of **286 semantic family classes**, built as a training substrate for next-generation language models.
---
## What's in it
Every entry contains:
- **`word`** — the English word or multi-word expression (including scientific species names)
- **`id`** — a hierarchical numeric semantic address (e.g. `4.2.8.319`) encoding the word's position in the semantic tree
- **`family`** — the semantic family label (e.g. `ENTITY.ANIMAL.CANINE`, `ACTION.COMMUNICATION.SPEAK_VERB`, `DESCRIPTOR.SIZE`)
- **`definitions`** — dictionary definitions where **every token in every definition is also mapped to its own semantic ID** — giving a fully grounded semantic graph of language
- **`taxonomy`** — for biological entities: full Linnaean taxonomy (kingdom → phylum → class → order → family → genus → rank → scientificName) for **2,791,312 species**
- **`species_description`** — up to 3 natural-language descriptions per species (202,745 species covered), with every token also semantically mapped
---
## Semantic Hierarchy
The 286 semantic families are organized into a deep hierarchy:
```
ENTITY
ENTITY.ANIMAL
ENTITY.ANIMAL.CANINE
ENTITY.ANIMAL.FELINE
ENTITY.ANIMAL.PRIMATE
...
ENTITY.PLANT
ENTITY.SUBSTANCE
ENTITY.SUBSTANCE.CHEMICAL
ENTITY.SUBSTANCE.FOOD
...
ENTITY.PERSON
ENTITY.DISEASE
ACTION
ACTION.COMMUNICATION.SPEAK_VERB
ACTION.MOVE
...
DESCRIPTOR
RELATION
TIME
PLACE
...
```
Numeric IDs encode the hierarchy directly: `4.2.8.319` means top-level class 4 → subclass 2 → subclass 8 → word 319. Arithmetic operations on IDs are semantically meaningful.
---
## Scale
| Metric | Count |
|---|---|
| Total words / entries | 3,981,469 |
| Words with definitions | 773,952 |
| Words with full taxonomy | 2,791,312 |
| Species with descriptions | 202,745 |
| Semantic families | 286 |
| File size (JSONL) | ~1.65 GB |
---
## Format
JSONL — one JSON object per line. Example entry:
```json
{
"word": "dog",
"id": "4.2.8.319",
"family": "ENTITY.ANIMAL.CANINE",
"definitions": [
{
"gloss": "A mammal of the family Canidae:",
"tokens": [
["a", "6.3.1"],
["mammal", "4.2.23.4481"],
["of", "1.5.2.2875"],
["the", "6.3.379"],
["family", "4.12.6.4773"],
["canidae", "4.2.8.113"]
]
}
],
"taxonomy": {
"kingdom": "Animalia",
"phylum": "Chordata",
"class": "Mammalia",
"order": "Carnivora",
"family": "Canidae",
"genus": "Canis",
"rank": "species"
}
}
```
---
## Intended Use
This dataset was originally built as the semantic substrate for **INFINITY** — a novel AI architecture that does **not use transformers**. INFINITY uses sparse directed matrices and a two-layer semantic tether mechanism, trained entirely on hierarchical numeric IDs rather than token embeddings. It is a fundamentally new approach to generative language modeling.
**WorldWordSemNet should also serve as a major kickstart for transformer-based models.** The combination of:
1. Full semantic family labeling for nearly 4 million words
2. Hierarchically structured numeric IDs that encode meaning arithmetically
3. Definition tokens pre-mapped to their own semantic IDs — a fully grounded semantic graph
4. Complete species taxonomy and biological descriptions integrated at the word level
...makes this dataset uniquely suited for training models that require deep semantic grounding beyond simple statistical co-occurrence.
---
## License
Apache 2.0.
提供机构:
ULTRAData



