zekebass/tensor-logic-wikipedia
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/zekebass/tensor-logic-wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- tensor-logic
- knowledge-base
- symbolic-ai
- neural-symbolic
- embeddings
- wikipedia
size_categories:
- 100K<n<1M
---
# Tensor Logic Wikipedia Knowledge Base
A structured knowledge base extracted from Wikipedia, designed for hybrid neural-symbolic reasoning.
## Dataset Description
This dataset contains:
- **403,059 facts** in Datalog-style format
- **210,188 entity embeddings** (128 dimensions) learned from relationship patterns
- Extracted from **37,000+ Wikipedia articles** (Vital Articles + random sample)
## Files
| File | Description | Size |
|------|-------------|------|
| `facts_only.tl` | Clean facts in `Relation(Subject, Object).` format | 14 MB |
| `all_facts.tl` | Facts with source article comments | 15 MB |
| `entity_embeddings.txt` | Learned embeddings (entity: dim1 dim2 ... dim128) | 257 MB |
## Fact Format
```prolog
IsA(AlbertEinstein, Physicist).
BornYear(AlbertEinstein, 1879).
Nationality(AlbertEinstein, German).
ParentOf(AlbertEinstein, HansAlbertEinstein).
Awarded(AlbertEinstein, NobelPrizeInPhysics).
```
## Relations (3,029 unique)
Common relations include:
- **Identity**: IsA, InstanceOf
- **People**: BornIn, BornYear, DiedYear, Nationality, Occupation, SpouseOf, ParentOf, Awarded
- **Places**: LocatedIn, Capital, Country, FoundedYear
- **Works**: CreatedBy, AuthorOf, DirectedBy, PublishedYear, Genre
- **Concepts**: InfluencedBy, OpposedTo, PartOf
## Embedding Format
```
AlbertEinstein: 0.123 -0.456 0.789 ... (128 floats)
MarieCurie: 0.234 -0.567 0.890 ...
```
Embeddings were trained using gradient descent on relationship similarity:
- Entities sharing nationality cluster together
- Entities of the same type (Physicist, Composer) cluster together
- Directly related entities (InfluencedBy, SpouseOf) are similar
- Co-creators and same-era individuals cluster together
## Usage
### With Julia (TensorLogic)
```julia
# Clone the repo
git clone https://github.com/zekebass/tensor-logic
cd tensor-logic
# Download data
huggingface-cli download zekebass/tensor-logic-wikipedia --local-dir knowledge/cleaned
# Run the REPL
export GROQ_API_KEY="your-key"
julia --project=. knowledge/repl.jl
```
### Direct Download
```bash
# Using huggingface_hub CLI
pip install huggingface_hub
huggingface-cli download zekebass/tensor-logic-wikipedia
# Or with Python
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="zekebass/tensor-logic-wikipedia", filename="facts_only.tl")
```
## Creation Process
1. **Source**: English Wikipedia XML dump (Vital Articles Level 4 + random sample)
2. **Extraction**: Groq API with `llama-3.3-70b-versatile` model
3. **Prompt Engineering**: Iteratively refined to produce clean, atomic facts
4. **Cleanup**: Removed duplicates, unknowns, and malformed entries
5. **Embedding Training**: ~3 minutes on CPU, gradient descent with cosine similarity loss
## Data Source & Attribution
This dataset is derived from [English Wikipedia](https://en.wikipedia.org/).
- **Source**: [Wikimedia Downloads](https://dumps.wikimedia.org/enwiki/) - `enwiki-20250601-pages-articles-multistream.xml.bz2`
- **Dump Date**: June 1, 2025
- **Articles Processed**: ~37,000 ([Wikipedia Vital Articles Level 4](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4) ~10K articles + random sample)
- **Original License**: Wikipedia content is licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)
**Note**: This dataset contains *extracted structured facts*, not verbatim Wikipedia text. The facts were generated by an LLM reading Wikipedia articles and outputting structured relations.
## Citation
```bibtex
@misc{tensorlogic2025,
title={Tensor Logic Wikipedia Knowledge Base},
author={Bass, Zeke},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/zekebass/tensor-logic-wikipedia},
note={Implementation assisted by Claude Opus 4.5 (Anthropic)}
}
```
## Based On
- Paper: ["Tensor Logic: The Language of AI"](https://arxiv.org/abs/2510.12269) by Pedro Domingos
- Implementation: [github.com/zekebass/tensor-logic](https://github.com/zekebass/tensor-logic)
## License
**Dataset**: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)
This dataset is derived from Wikipedia content (CC BY-SA 3.0) and is released under the same license to comply with the share-alike requirement.
**Note**: The *code* in the [tensor-logic repository](https://github.com/zekebass/tensor-logic) is MIT licensed. Only this dataset (the extracted facts and embeddings) is CC BY-SA.
提供机构:
zekebass



