EXOROBOURII/Stanza-TinyStories

Name: EXOROBOURII/Stanza-TinyStories
Creator: EXOROBOURII
Published: 2026-04-24 09:59:28
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/EXOROBOURII/Stanza-TinyStories

下载链接

链接失效反馈

官方服务：

资源简介：

Stanza-TinyStories-2 是 TinyStories 数据集（Eldan 和 Li，2023）的一个结构和形态学上的丰富版本。该数据集将大型语言模型生成的一维合成文本投影到一个完全解析的语法和拓扑空间中。训练集中的270万故事和验证集中的2.1万故事中的每个句子都经过确定性解析，提取了通用词性标签（UPOS）、通用依存关系（DepRel）、命名实体（NER）和形态学词元。该语料库为从事符号语言学、机械可解释性和神经符号AI交叉研究的研究人员提供了一个大规模、高度约束的有向无环图（DAG）存储库。数据集由Exorobourii LLC策划，语言为英语，许可证为CDLA-Sharing-1.0，源数据集为roneneldan/TinyStories。

Stanza-TinyStories-2 is a structurally and morphologically enriched iteration of the TinyStories dataset (Eldan and Li, 2023). This dataset projects the 1D synthetic text generated by large language models into a fully resolved grammatical and topological space. Every sentence in the 2.7-million-story training split and the 21,000-story validation split has been deterministically parsed to extract Universal Part-of-Speech (UPOS) tags, Universal Dependencies (DepRel), named entities (NER), and morphological lemmas. The corpus provides a massive-scale, highly constrained Directed Acyclic Graph (DAG) repository for researchers working at the intersection of symbolic linguistics, mechanistic interpretability, and neuro-symbolic AI. Curated by Exorobourii LLC, the dataset is in English, licensed under CDLA-Sharing-1.0, and sourced from roneneldan/TinyStories.

提供机构：

EXOROBOURII

5,000+

优质数据集

54 个

任务类型

进入经典数据集