introspector/datasets

Name: introspector/datasets
Creator: introspector
Published: 2026-03-19 20:39:44
License: 暂无描述

Hugging Face2026-03-19 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/introspector/datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - feature-extraction - text-classification language: - en - akk - sux tags: - clifford-algebra - erdfa - cbor - fungal-secretome - sumerian - cuneiform - geometric-algebra - boustrophedon size_categories: - n<1K configs: - config_name: default data_files: - split: train path: data/shards.jsonl --- # eRDFa Clifford Algebra Shards Corpus-agnostic Clifford algebra embeddings (Cl(15,0,0)) of fungal secretome proteins and Sumerian/Akkadian cuneiform texts, packaged as eRDFa CBOR shards. ## Overview Each shard represents a file from an upstream research repository, parsed through the abstract codec pipeline: ``` Source (FASTA / text) → Embedding → Cl(15,0,0) multivectors → eRDFa CBOR shard ``` Three embedding types share the same algebra: - **Hebrew**: 22 consonants → 15 grade-1 + 7 grade-2 (original boustrophedon extraction) - **Secretome**: 20 amino acids → 15 grade-1 + 5 grade-2 (fungal effector proteins) - **Text**: 26 Latin letters → 15 grade-1 + 11 grade-2 (Sumerian transliterations) Transport maps between embeddings are grade-signature invariant (cubical type theory path induction). ## Sources ### Fungal (39 shards) | Source | Shards | Repository | |--------|--------|------------| | predict_secretome | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) | | EffectorP-2.0 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) | | ancient_fungal_antimicrobials | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) | ### Sumerian / Akkadian (57 shards) | Source | Shards | Repository | |--------|--------|------------| | Semi-Supervised NMT | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) | | Akkademia | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) | | OCR Sumerian | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) | ## Schema Each JSONL record: ```json { "id": "fungal/effector_p:Scripts/Effector_Testing.fasta", "cid": "bafk...", "component": { "type": "KeyValue", "pairs": [ ["source", "fungal/effector_p"], ["repo", "..."], ["file", "Scripts/Effector_Testing.fasta"], ["kind", "fasta"], ["groups", "284"], ["symbols", "67890"] ] }, "tags": ["fasta", "fungal", "input"] } ``` ## Data Files | File | Format | Description | |------|--------|-------------| | `data/shards.jsonl` | JSONL | All 96 shards, one JSON object per line | | `data/*.cbor` | DA51-tagged CBOR | Individual binary shards | | `data/inputs.tar` | tar | Bundle of all CBOR shards + manifest | ## Usage ```python from datasets import load_dataset ds = load_dataset("introspector/datasets") ``` ## Pipeline Built with [shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72): ```bash cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl ``` ## License MIT. Upstream repositories retain their original licenses.

license: MIT许可证 task_categories: - 特征提取 - 文本分类 language: - 英语 - 阿卡德语（Akkadian） - 苏美尔语（Sumerian） tags: - 克利福德代数（Clifford Algebra） - eRDFa - CBOR（cbor） - 真菌分泌组（fungal-secretome） - 苏美尔语（Sumerian） - 楔形文字（cuneiform） - 几何代数（geometric-algebra） - 牛耕式书写法（boustrophedon） size_categories: - 样本数少于1000 configs: - config_name: default data_files: - split: train path: data/shards.jsonl # eRDFa 克利福德代数分片（Clifford Algebra Shards）本数据集包含与语料库无关的真菌分泌组蛋白质与苏美尔/阿卡德楔形文字文本的克利福德代数嵌入（Cl(15,0,0)），并封装为eRDFa CBOR分片。 ## 概述每个分片对应上游科研仓库中的一个文件，经抽象编解码器流水线解析得到：源文件（FASTA / 文本） → 嵌入 → Cl(15,0,0) 多重向量 → eRDFa CBOR 分片三类嵌入共享同一代数结构： - **希伯来语**：22个辅音 → 15个1级分量 +7个2级分量（源自原始牛耕式书写法提取） - **分泌组**：20种氨基酸 → 15个1级分量 +5个2级分量（对应真菌效应蛋白） - **文本**：26个拉丁字母 →15个1级分量 +11个2级分量（对应苏美尔语转写文本）嵌入之间的传输映射满足分级签名不变性（立方类型论路径归纳）。 ## 数据源 ### 真菌类数据集（共39个分片） | 数据源 | 分片数量 | 仓库地址 | |--------|--------|------------| | 预测分泌组工具 | 3 | [fmaguire/predict_secretome](https://github.com/fmaguire/predict_secretome) | | EffectorP-2.0效应蛋白预测工具 | 1 | [JanaSperschneider/EffectorP-2.0](https://github.com/JanaSperschneider/EffectorP-2.0) | | 古代真菌抗菌肽研究脚本 | 35 | [fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials](https://github.com/fantin-mesny/Scripts_analysis_ancient_fungal_antimicrobials) | ### 苏美尔/阿卡德类数据集（共57个分片） | 数据源 | 分片数量 | 仓库地址 | |--------|--------|------------| | 半监督神经机器翻译 | 52 | [cdli-gh/Semi-Supervised-NMT-for-Sumerian-English](https://github.com/cdli-gh/Semi-Supervised-NMT-for-Sumerian-English) | | Akkademia阿卡德语语料库 | 4 | [gaigutherz/Akkademia](https://github.com/gaigutherz/Akkademia) | | 苏美尔语OCR识别工具 | 1 | [ancient-world-citation-analysis/OCR_Sumerian](https://github.com/ancient-world-citation-analysis/OCR_Sumerian) | ## 数据模式每条JSONL（JSON Lines）格式记录如下： json { "id": "fungal/effector_p:Scripts/Effector_Testing.fasta", "cid": "bafk...", "component": { "type": "KeyValue", "pairs": [ ["source", "fungal/effector_p"], ["repo", "..."], ["file", "Scripts/Effector_Testing.fasta"], ["kind", "fasta"], ["groups", "284"], ["symbols", "67890"] ] }, "tags": ["fasta", "真菌", "输入"] } ## 数据文件 | 文件路径 | 格式 | 描述 | |------|--------|-------------| | `data/shards.jsonl` | JSONL格式 | 全部96个分片，每行包含一个JSON对象 | | `data/*.cbor` | 带DA51标签的CBOR格式 | 单个二进制分片 | | `data/inputs.tar` | tar归档格式 | 所有CBOR分片与清单文件的打包归档 | ## 使用方式 python from datasets import load_dataset ds = load_dataset("introspector/datasets") ## 构建流水线本数据集基于[shem-hamephorash-72](https://github.com/meta-introspector/shem-hamephorash-72)构建： bash cargo run --release --bin package_inputs -- inputs.toml data/ --jsonl ## 许可证 MIT许可证。上游科研仓库保留其自身原有的许可证。

提供机构：

introspector

5,000+

优质数据集

54 个

任务类型

进入经典数据集