five

zekebass/tensor-logic-wikipedia

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/zekebass/tensor-logic-wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - question-answering - text-generation language: - en tags: - tensor-logic - knowledge-base - symbolic-ai - neural-symbolic - embeddings - wikipedia size_categories: - 100K<n<1M --- # Tensor Logic Wikipedia Knowledge Base A structured knowledge base extracted from Wikipedia, designed for hybrid neural-symbolic reasoning. ## Dataset Description This dataset contains: - **403,059 facts** in Datalog-style format - **210,188 entity embeddings** (128 dimensions) learned from relationship patterns - Extracted from **37,000+ Wikipedia articles** (Vital Articles + random sample) ## Files | File | Description | Size | |------|-------------|------| | `facts_only.tl` | Clean facts in `Relation(Subject, Object).` format | 14 MB | | `all_facts.tl` | Facts with source article comments | 15 MB | | `entity_embeddings.txt` | Learned embeddings (entity: dim1 dim2 ... dim128) | 257 MB | ## Fact Format ```prolog IsA(AlbertEinstein, Physicist). BornYear(AlbertEinstein, 1879). Nationality(AlbertEinstein, German). ParentOf(AlbertEinstein, HansAlbertEinstein). Awarded(AlbertEinstein, NobelPrizeInPhysics). ``` ## Relations (3,029 unique) Common relations include: - **Identity**: IsA, InstanceOf - **People**: BornIn, BornYear, DiedYear, Nationality, Occupation, SpouseOf, ParentOf, Awarded - **Places**: LocatedIn, Capital, Country, FoundedYear - **Works**: CreatedBy, AuthorOf, DirectedBy, PublishedYear, Genre - **Concepts**: InfluencedBy, OpposedTo, PartOf ## Embedding Format ``` AlbertEinstein: 0.123 -0.456 0.789 ... (128 floats) MarieCurie: 0.234 -0.567 0.890 ... ``` Embeddings were trained using gradient descent on relationship similarity: - Entities sharing nationality cluster together - Entities of the same type (Physicist, Composer) cluster together - Directly related entities (InfluencedBy, SpouseOf) are similar - Co-creators and same-era individuals cluster together ## Usage ### With Julia (TensorLogic) ```julia # Clone the repo git clone https://github.com/zekebass/tensor-logic cd tensor-logic # Download data huggingface-cli download zekebass/tensor-logic-wikipedia --local-dir knowledge/cleaned # Run the REPL export GROQ_API_KEY="your-key" julia --project=. knowledge/repl.jl ``` ### Direct Download ```bash # Using huggingface_hub CLI pip install huggingface_hub huggingface-cli download zekebass/tensor-logic-wikipedia # Or with Python from huggingface_hub import hf_hub_download hf_hub_download(repo_id="zekebass/tensor-logic-wikipedia", filename="facts_only.tl") ``` ## Creation Process 1. **Source**: English Wikipedia XML dump (Vital Articles Level 4 + random sample) 2. **Extraction**: Groq API with `llama-3.3-70b-versatile` model 3. **Prompt Engineering**: Iteratively refined to produce clean, atomic facts 4. **Cleanup**: Removed duplicates, unknowns, and malformed entries 5. **Embedding Training**: ~3 minutes on CPU, gradient descent with cosine similarity loss ## Data Source & Attribution This dataset is derived from [English Wikipedia](https://en.wikipedia.org/). - **Source**: [Wikimedia Downloads](https://dumps.wikimedia.org/enwiki/) - `enwiki-20250601-pages-articles-multistream.xml.bz2` - **Dump Date**: June 1, 2025 - **Articles Processed**: ~37,000 ([Wikipedia Vital Articles Level 4](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4) ~10K articles + random sample) - **Original License**: Wikipedia content is licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) **Note**: This dataset contains *extracted structured facts*, not verbatim Wikipedia text. The facts were generated by an LLM reading Wikipedia articles and outputting structured relations. ## Citation ```bibtex @misc{tensorlogic2025, title={Tensor Logic Wikipedia Knowledge Base}, author={Bass, Zeke}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/zekebass/tensor-logic-wikipedia}, note={Implementation assisted by Claude Opus 4.5 (Anthropic)} } ``` ## Based On - Paper: ["Tensor Logic: The Language of AI"](https://arxiv.org/abs/2510.12269) by Pedro Domingos - Implementation: [github.com/zekebass/tensor-logic](https://github.com/zekebass/tensor-logic) ## License **Dataset**: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) This dataset is derived from Wikipedia content (CC BY-SA 3.0) and is released under the same license to comply with the share-alike requirement. **Note**: The *code* in the [tensor-logic repository](https://github.com/zekebass/tensor-logic) is MIT licensed. Only this dataset (the extracted facts and embeddings) is CC BY-SA.
提供机构:
zekebass
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作