five

introspector/datasets

收藏
Hugging Face2026-03-19 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/introspector/datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - feature-extraction - text-classification language: - en - akk - sux tags: - clifford-algebra - erdfa - cbor - fungal-secretome - sumerian - cuneiform - geometric-algebra - boustrophedon size_categories: - n<1K configs: - config_name: default data_files: - split: train path: data/shards.jsonl --- # eRDFa Clifford Algebra Shards Corpus-agnostic Clifford algebra embeddings (Cl(15,0,0)) of fungal secretome proteins and Sumerian/Akkadian cuneiform texts, packaged as eRDFa CBOR shards. ## Overview Each shard represents a file from an upstream research repository, parsed through the abstract codec pipeline: ``` Source (FASTA / text) → Embedding → Cl(15,0,0) multivectors → eRDFa CBOR shard ``` Three embedding types share the same algebra: - **Hebrew**: 22 consonants → 15 grade-1 + 7 grade-2 (original boustrophedon extraction) - **Secretome**: 20 amino acids → 15 grade-1 + 5 grade-2 (fungal effector proteins) - **Text**: 26 Latin letters → 15 grade-1 + 11 grade-2 (Sumerian transliterations) Transport maps between embeddings are grade-signature invariant (cubical type theory path induction). ## Sources ### Fungal (39 shards) | Source | Shards | Repository | |--------|--------|------------| | predict_secretome | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) | | EffectorP-2.0 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) | | ancient_fungal_antimicrobials | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) | ### Sumerian / Akkadian (57 shards) | Source | Shards | Repository | |--------|--------|------------| | Semi-Supervised NMT | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) | | Akkademia | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) | | OCR Sumerian | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) | ## Schema Each JSONL record: ```json { "id": "fungal/effector_p:Scripts/Effector_Testing.fasta", "cid": "bafk...", "component": { "type": "KeyValue", "pairs": [ ["source", "fungal/effector_p"], ["repo", "..."], ["file", "Scripts/Effector_Testing.fasta"], ["kind", "fasta"], ["groups", "284"], ["symbols", "67890"] ] }, "tags": ["fasta", "fungal", "input"] } ``` ## Data Files | File | Format | Description | |------|--------|-------------| | `data/shards.jsonl` | JSONL | All 96 shards, one JSON object per line | | `data/*.cbor` | DA51-tagged CBOR | Individual binary shards | | `data/inputs.tar` | tar | Bundle of all CBOR shards + manifest | ## Usage ```python from datasets import load_dataset ds = load_dataset("introspector/datasets") ``` ## Pipeline Built with [shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72): ```bash cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl ``` ## License MIT. Upstream repositories retain their original licenses.

license: MIT许可证 task_categories: - 特征提取 - 文本分类 language: - 英语 - 阿卡德语(Akkadian) - 苏美尔语(Sumerian) tags: - 克利福德代数(Clifford Algebra) - eRDFa - CBOR(cbor) - 真菌分泌组(fungal-secretome) - 苏美尔语(Sumerian) - 楔形文字(cuneiform) - 几何代数(geometric-algebra) - 牛耕式书写法(boustrophedon) size_categories: - 样本数少于1000 configs: - config_name: default data_files: - split: train path: data/shards.jsonl # eRDFa 克利福德代数分片(Clifford Algebra Shards) 本数据集包含与语料库无关的真菌分泌组蛋白质与苏美尔/阿卡德楔形文字文本的克利福德代数嵌入(Cl(15,0,0)),并封装为eRDFa CBOR分片。 ## 概述 每个分片对应上游科研仓库中的一个文件,经抽象编解码器流水线解析得到: 源文件(FASTA / 文本) → 嵌入 → Cl(15,0,0) 多重向量 → eRDFa CBOR 分片 三类嵌入共享同一代数结构: - **希伯来语**:22个辅音 → 15个1级分量 +7个2级分量(源自原始牛耕式书写法提取) - **分泌组**:20种氨基酸 → 15个1级分量 +5个2级分量(对应真菌效应蛋白) - **文本**:26个拉丁字母 →15个1级分量 +11个2级分量(对应苏美尔语转写文本) 嵌入之间的传输映射满足分级签名不变性(立方类型论路径归纳)。 ## 数据源 ### 真菌类数据集(共39个分片) | 数据源 | 分片数量 | 仓库地址 | |--------|--------|------------| | 预测分泌组工具 | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) | | EffectorP-2.0效应蛋白预测工具 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) | | 古代真菌抗菌肽研究脚本 | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) | ### 苏美尔/阿卡德类数据集(共57个分片) | 数据源 | 分片数量 | 仓库地址 | |--------|--------|------------| | 半监督神经机器翻译 | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) | | Akkademia阿卡德语语料库 | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) | | 苏美尔语OCR识别工具 | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) | ## 数据模式 每条JSONL(JSON Lines)格式记录如下: json { "id": "fungal/effector_p:Scripts/Effector_Testing.fasta", "cid": "bafk...", "component": { "type": "KeyValue", "pairs": [ ["source", "fungal/effector_p"], ["repo", "..."], ["file", "Scripts/Effector_Testing.fasta"], ["kind", "fasta"], ["groups", "284"], ["symbols", "67890"] ] }, "tags": ["fasta", "真菌", "输入"] } ## 数据文件 | 文件路径 | 格式 | 描述 | |------|--------|-------------| | `data/shards.jsonl` | JSONL格式 | 全部96个分片,每行包含一个JSON对象 | | `data/*.cbor` | 带DA51标签的CBOR格式 | 单个二进制分片 | | `data/inputs.tar` | tar归档格式 | 所有CBOR分片与清单文件的打包归档 | ## 使用方式 python from datasets import load_dataset ds = load_dataset("introspector/datasets") ## 构建流水线 本数据集基于[shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72)构建: bash cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl ## 许可证 MIT许可证。上游科研仓库保留其自身原有的许可证。
提供机构:
introspector
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作