introspector/datasets
收藏Hugging Face2026-03-19 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/introspector/datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- feature-extraction
- text-classification
language:
- en
- akk
- sux
tags:
- clifford-algebra
- erdfa
- cbor
- fungal-secretome
- sumerian
- cuneiform
- geometric-algebra
- boustrophedon
size_categories:
- n<1K
configs:
- config_name: default
data_files:
- split: train
path: data/shards.jsonl
---
# eRDFa Clifford Algebra Shards
Corpus-agnostic Clifford algebra embeddings (Cl(15,0,0)) of fungal secretome proteins and Sumerian/Akkadian cuneiform texts, packaged as eRDFa CBOR shards.
## Overview
Each shard represents a file from an upstream research repository, parsed through the abstract codec pipeline:
```
Source (FASTA / text) → Embedding → Cl(15,0,0) multivectors → eRDFa CBOR shard
```
Three embedding types share the same algebra:
- **Hebrew**: 22 consonants → 15 grade-1 + 7 grade-2 (original boustrophedon extraction)
- **Secretome**: 20 amino acids → 15 grade-1 + 5 grade-2 (fungal effector proteins)
- **Text**: 26 Latin letters → 15 grade-1 + 11 grade-2 (Sumerian transliterations)
Transport maps between embeddings are grade-signature invariant (cubical type theory path induction).
## Sources
### Fungal (39 shards)
| Source | Shards | Repository |
|--------|--------|------------|
| predict_secretome | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) |
| EffectorP-2.0 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) |
| ancient_fungal_antimicrobials | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) |
### Sumerian / Akkadian (57 shards)
| Source | Shards | Repository |
|--------|--------|------------|
| Semi-Supervised NMT | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) |
| Akkademia | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) |
| OCR Sumerian | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) |
## Schema
Each JSONL record:
```json
{
"id": "fungal/effector_p:Scripts/Effector_Testing.fasta",
"cid": "bafk...",
"component": {
"type": "KeyValue",
"pairs": [
["source", "fungal/effector_p"],
["repo", "..."],
["file", "Scripts/Effector_Testing.fasta"],
["kind", "fasta"],
["groups", "284"],
["symbols", "67890"]
]
},
"tags": ["fasta", "fungal", "input"]
}
```
## Data Files
| File | Format | Description |
|------|--------|-------------|
| `data/shards.jsonl` | JSONL | All 96 shards, one JSON object per line |
| `data/*.cbor` | DA51-tagged CBOR | Individual binary shards |
| `data/inputs.tar` | tar | Bundle of all CBOR shards + manifest |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("introspector/datasets")
```
## Pipeline
Built with [shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72):
```bash
cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl
```
## License
MIT. Upstream repositories retain their original licenses.
license: MIT许可证
task_categories:
- 特征提取
- 文本分类
language:
- 英语
- 阿卡德语(Akkadian)
- 苏美尔语(Sumerian)
tags:
- 克利福德代数(Clifford Algebra)
- eRDFa
- CBOR(cbor)
- 真菌分泌组(fungal-secretome)
- 苏美尔语(Sumerian)
- 楔形文字(cuneiform)
- 几何代数(geometric-algebra)
- 牛耕式书写法(boustrophedon)
size_categories:
- 样本数少于1000
configs:
- config_name: default
data_files:
- split: train
path: data/shards.jsonl
# eRDFa 克利福德代数分片(Clifford Algebra Shards)
本数据集包含与语料库无关的真菌分泌组蛋白质与苏美尔/阿卡德楔形文字文本的克利福德代数嵌入(Cl(15,0,0)),并封装为eRDFa CBOR分片。
## 概述
每个分片对应上游科研仓库中的一个文件,经抽象编解码器流水线解析得到:
源文件(FASTA / 文本) → 嵌入 → Cl(15,0,0) 多重向量 → eRDFa CBOR 分片
三类嵌入共享同一代数结构:
- **希伯来语**:22个辅音 → 15个1级分量 +7个2级分量(源自原始牛耕式书写法提取)
- **分泌组**:20种氨基酸 → 15个1级分量 +5个2级分量(对应真菌效应蛋白)
- **文本**:26个拉丁字母 →15个1级分量 +11个2级分量(对应苏美尔语转写文本)
嵌入之间的传输映射满足分级签名不变性(立方类型论路径归纳)。
## 数据源
### 真菌类数据集(共39个分片)
| 数据源 | 分片数量 | 仓库地址 |
|--------|--------|------------|
| 预测分泌组工具 | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) |
| EffectorP-2.0效应蛋白预测工具 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) |
| 古代真菌抗菌肽研究脚本 | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) |
### 苏美尔/阿卡德类数据集(共57个分片)
| 数据源 | 分片数量 | 仓库地址 |
|--------|--------|------------|
| 半监督神经机器翻译 | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) |
| Akkademia阿卡德语语料库 | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) |
| 苏美尔语OCR识别工具 | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) |
## 数据模式
每条JSONL(JSON Lines)格式记录如下:
json
{
"id": "fungal/effector_p:Scripts/Effector_Testing.fasta",
"cid": "bafk...",
"component": {
"type": "KeyValue",
"pairs": [
["source", "fungal/effector_p"],
["repo", "..."],
["file", "Scripts/Effector_Testing.fasta"],
["kind", "fasta"],
["groups", "284"],
["symbols", "67890"]
]
},
"tags": ["fasta", "真菌", "输入"]
}
## 数据文件
| 文件路径 | 格式 | 描述 |
|------|--------|-------------|
| `data/shards.jsonl` | JSONL格式 | 全部96个分片,每行包含一个JSON对象 |
| `data/*.cbor` | 带DA51标签的CBOR格式 | 单个二进制分片 |
| `data/inputs.tar` | tar归档格式 | 所有CBOR分片与清单文件的打包归档 |
## 使用方式
python
from datasets import load_dataset
ds = load_dataset("introspector/datasets")
## 构建流水线
本数据集基于[shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72)构建:
bash
cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl
## 许可证
MIT许可证。上游科研仓库保留其自身原有的许可证。
提供机构:
introspector



