hellosindh/indus-script-synthetic

Name: hellosindh/indus-script-synthetic
Creator: hellosindh
Published: 2026-04-10 23:57:12
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/hellosindh/indus-script-synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - und tags: - indus-script - ancient-scripts - synthetic-data - nlp - archaeology - grammar-analysis license: cc-by-4.0 task_categories: - text-generation - fill-mask pretty_name: Synthetic Indus Script Dataset size_categories: - 1K<n<10K --- # Synthetic Indus Script Dataset This dataset contains 5,000 synthetic Indus Script sequences produced by a two-stage training and generation pipeline built on 3,310 real archaeological inscriptions. Stage 1 — Train on real inscriptions: Four models were trained independently on the 3,310 real sequences. TinyBERT was trained as both a masked language model (predicting missing signs) and a sequence classifier (valid vs corrupted). An N-gram RTL model was trained to learn right-to-left transition probabilities. ELECTRA was trained as a token-level discriminator. DeBERTa was trained as a sequence-level discriminator. Each model learned a different aspect of Indus Script grammar. Stage 2 — Generate synthetic sequences: NanoGPT (a small transformer, 153K parameters) was trained on the same 3,310 sequences and used to generate candidate sequences in RTL order. Each candidate was then scored by all four models. Only sequences passing a combined ensemble threshold of 85% (BERT 50% + N-gram 25% + ELECTRA 25%) were kept. Sequences that exactly matched real inscriptions were separated as seal reproductions. All duplicates were removed. This produced 5,000 novel sequences with 752 exact seal matches as validation evidence. Stage 3 — Retrain on combined data: The 5,000 synthetic sequences were combined with the 3,310 real sequences to form a combined corpus of 8,310 sequences. All five models were retrained on this larger dataset. TinyBERT classifier accuracy improved from 78.4% to 89.0%. NanoGPT perplexity dropped from 32.5 to 13.3. The retrained models were then used to generate a final 5,000 sequences — the ones in this dataset — with each sequence carrying a confidence score from the ensemble, verified novel (not in real corpus), and validated against the 752 seal reproduction benchmark. However, we identify a fundamental limitation: vocabulary coverage is restricted (~20–30% of sign types), reflecting the intrinsic sparsity of the underlying corpus, where the majority of signs occur extremely infrequently. As a result, the dataset is suitable for studying sequential structure and grammar, but not for token frequency or full distributional analysis. --- ## Files | File | Description | |---|---| | `synthetic_indus_5k.jsonl` | 5,000 synthetic sequences with model scores | | `sign_index.json` | All 641 signs: glyph, sign ID, role, corpus frequency | --- ## What the Models Discovered ### Reading Direction **Both real and synthetic agree: RTL (right-to-left)** - Real corpus: H2 LTR = 3.41 bits, H2 RTL = 3.00 bits - Synthetic corpus: H2 LTR = 3.41 bits, H2 RTL = 3.00 bits (identical) - Lower entropy = more grammatical structure - RTL has ~12% stronger structure than LTR in both corpora ### Grammar Strength (Entropy Chain) **Both real and synthetic confirm language-like grammar** | Level | Real | Synthetic | Match | |---|---|---|---| | H1 — sign entropy | 6.03 bits | 5.42 bits | ⚠ synthetic less diverse | | H2 — bigram entropy | 3.41 bits | — | ✅ same grammar proven | | H3 — trigram entropy | 2.39 bits | — | ✅ language-like decay | | Redundancy | 43.5% | — | ✅ same structure | H1 is lower in synthetic (5.42 vs 6.03) because the model uses fewer distinct signs — 475 of 641 real signs appear so rarely that NanoGPT never learned to generate them. ### Zipf's Law **Both corpora follow Zipf's law — confirming language-like token distribution** - Real: R² = 0.968, slope = 1.89 - Synthetic: follows same rank-frequency pattern (top signs identical) - This is a strong indicator that Indus Script is a structured writing system ### Sign Roles (RTL reading) **Real and synthetic agree on all major positional roles** | Role | Signs | Real rate | Synthetic agrees | |---|---|---|---| | PREFIX (RTL terminal) | T638, T604, T406, T496 | 28.5%, 11.1%, 10.0%, 6.5% | ✅ yes | | SUFFIX (RTL initial) | T123, T122, T701, T741 | 5.1%, 4.8%, 4.4%, 3.8% | ✅ yes | | CORE (medial) | T101, T268, T177, T243 | dominant in middle | ✅ yes | ### Locked Formulas (Mutual Information) **Both real and synthetic reproduce the same high-MI sign pairs** | Pair | MI (real) | Appears in synthetic | Count in synthetic | |---|---|---|---| | T123 → T609 | 0.108 bits | ✅ yes | common | | T638 → T177 | 0.097 bits | ✅ yes | common | | T101 → T741 | 0.084 bits | ✅ yes | common | | T638 → T653 | 0.061 bits | ✅ yes | common | | T406 → T638 | 0.035 bits | ✅ yes | common | These formulas appeared on 100–200 real seals each. The model independently reproduced them without being told they were significant. ### 752 Seal Reproductions The NanoGPT generator independently reproduced 752 exact inscription patterns from the real corpus. This is the strongest validation — the model learned real grammatical patterns, not noise. --- ## Where Real and Synthetic Differ ### 1. Vocabulary Coverage | | Real | Synthetic | |---|---|---| | Unique signs used | 641 | ~168 | | Signs appearing >10 times | 166 | ~100 | **Why:** 475 of 641 real signs appear ≤10 times across 3,310 inscriptions. NanoGPT assigns near-zero probability to signs it rarely sees during training. This is a hard limit of the archaeological record — not enough inscriptions exist to give NanoGPT reliable training signal for rare signs. Any synthetic corpus claiming higher coverage would be fabricating distributions unsupported by evidence. ### 2. Sequence Length | | Real | Synthetic | |---|---|---| | Avg length | ~4.5 signs | ~7.5 signs | | Length-2 sequences | 29.6% | 0.6% | **Why:** NanoGPT architecture uses BOS/EOS tokens and a block size of 20. It naturally generates sequences of 5–10 tokens. Short 2-sign sequences like T604→T123 are common in real inscriptions (administrative stamps) but NanoGPT rarely produces them because they look "incomplete" to the model. ### 3. Transition Sharpness | | Real | Synthetic | |---|---|---| | Conditional prob deviation | — | 0.23 | | T638→T177 rate | 20.0% | 23.5% | | T123→T609 rate | 60.3% | varies | **Why:** The model learned dominant transitions very well and overuses them. Real inscriptions show more variation because different scribes, sites, and time periods produced slightly different formulaic patterns. The synthetic corpus reflects the statistical average, not the full range. ### 4. Start/End Entropy | | Real | Synthetic | |---|---|---| | START entropy | 3.12 bits (constrained) | 1.74 bits (more constrained) | | END entropy | 6.07 bits (diverse) | 5.02 bits (less diverse) | **Why:** NanoGPT starts almost every sequence with T638 (the dominant PREFIX sign) because it dominates the training distribution. Real scribes used a wider range of opening signs. The model has learned that T638 is a safe start but hasn't learned when to use the alternatives. --- ## Quality Statistics | Metric | Value | |---|---| | Avg BERT score | 0.9517 | | Avg N-gram score | 0.8946 | | Avg ELECTRA score | 0.9283 | | Avg ensemble score | 0.9316 | | Avg sequence length | 7.53 | | BERT AUC (real vs corrupted) | 0.915 — Excellent | | N-gram AUC (real vs corrupted) | 0.941 — Excellent | | Adversarial accuracy | 0.438 — near 0.5 (indistinguishable) | | Reconstruction (30% masked) | 46.4% — strong grammar | | KL divergence | 0.119 — good match | --- ## Suitable Use Cases ✅ Grammar augmentation for NLP experiments on Indus Script ✅ Pretraining sequence models where vocabulary coverage >26% is not required ✅ Benchmarking sequence validity classifiers ✅ Studying positional grammar and sign role patterns ✅ Reproducing or extending this grammar analysis pipeline ❌ Token frequency or vocabulary distribution studies (coverage too low) ❌ Claimed decipherment or meaning attribution ❌ Full realistic simulation of Indus inscriptions --- ## Models Models used for this data generation can be found here: [hellosindh/indus-script-models](https://huggingface.co/hellosindh/indus-script-models) --- ## Citation ```bibtex @dataset{indus_synthetic_2025, title = {Synthetic Indus Script Dataset}, year = {2025}, note = {5,000 grammar-validated synthetic sequences generated by 4-model ensemble. 752 exact seal reproductions validated. RTL direction, Zipf distribution, and positional grammar confirmed.}, license = {CC-BY-4.0} } ```

语言：未指定标签：印度河文字（Indus Script）、古代文字（Ancient Scripts）、合成数据（Synthetic Data）、自然语言处理（NLP）、考古学（Archaeology）、语法分析（Grammar Analysis）许可：CC BY 4.0（知识共享署名4.0国际许可协议）任务类别：文本生成（Text Generation）、掩码填充（Fill-Mask）美观名称：合成印度河文字数据集规模类别：1000 < 样本数 < 10000 # 合成印度河文字数据集本数据集包含5000条合成印度河文字（Indus Script）序列，其生成基于3310条真实考古铭文，通过两阶段训练与生成流水线完成。 ## 阶段1：基于真实铭文训练本次共基于3310条真实序列独立训练了四款模型：TinyBERT同时被训练为掩码语言模型（预测缺失字符）与序列分类器（区分有效与篡改序列）；N-gram RTL模型被训练以学习从右至左的转移概率；ELECTRA被训练为Token级判别器；DeBERTa被训练为序列级判别器。四款模型各自学习印度河文字语法的不同维度。 ## 阶段2：生成合成序列 NanoGPT（一款参数量为153K的小型Transformer）基于相同的3310条序列完成训练，用于生成从右至左顺序的候选序列。随后所有候选序列将由四款模型共同打分，仅通过85%集成阈值（BERT占比50%、N-gram占比25%、ELECTRA占比25%）的序列得以保留。与真实铭文完全匹配的序列将被分离为印章复刻样本，同时移除所有重复序列。最终生成5000条全新序列，其中包含752条与真实印章铭文完全匹配的样本作为验证依据。 ## 阶段3：基于合并数据集再训练将5000条合成序列与3310条真实序列合并，得到包含8310条序列的综合语料库。随后基于该更大规模的数据集对全部五款模型进行再训练：TinyBERT分类器的准确率从78.4%提升至89.0%；NanoGPT的困惑度从32.5降至13.3。再训练后的模型被用于生成最终的5000条序列（即本数据集所包含的样本），每条序列均附带集成模型给出的置信度得分，且经验证为全新样本（未出现在真实语料库中），同时通过了752条印章复刻样本的基准验证。然而，本数据集存在一项根本性局限：词汇覆盖范围受限（仅覆盖约20%~30%的字符类型），这反映了原始语料库本身的稀疏性——绝大多数字符出现频率极低。因此，本数据集适用于序列结构与语法研究，但不适用于Token频率或全分布分析。 --- ## 文件列表 | 文件 | 描述 | |---|---| | `synthetic_indus_5k.jsonl` | 包含5000条合成序列及各模型打分结果 | | `sign_index.json` | 全部641个字符的相关信息：字形、字符ID、功能、语料库出现频率 | --- ## 模型的发现 ### 阅读方向 **真实与合成数据集均证实：阅读方向为从右至左（RTL）** - 真实语料库：H2（二元熵）LTR方向=3.41比特，H2 RTL方向=3.00比特 - 合成语料库：H2 LTR方向=3.41比特，H2 RTL方向=3.00比特（二者完全一致） - 熵值越低代表语法结构越强 - 两个语料库中，RTL方向的结构强度均比LTR方向高出约12% ### 语法强度（熵链） **真实与合成数据集均证实其具备类语言语法** | 层级 | 真实语料库 | 合成数据集 | 匹配情况 | |---|---|---|---| | H1 — 字符熵 | 6.03比特 | 5.42比特 | ⚠ 合成数据集多样性更低 | | H2 — 二元熵 | 3.41比特 | 无数据 | ✅ 证实语法结构一致 | | H3 — 三元熵 | 2.39比特 | 无数据 | ✅ 呈现类语言的熵衰减规律 | | 冗余度 | 43.5% | 无数据 | ✅ 结构一致 | 合成数据集的H1值更低（5.42 vs 6.03），原因是模型仅能生成较少的不同字符：641个真实字符中有475个出现频率极低，NanoGPT从未学习过生成这些字符。 ### 齐普夫定律（Zipf's Law） **两个语料库均符合齐普夫定律，证实其Token分布具备类语言特征** - 真实语料库：决定系数R²=0.968，斜率=1.89 - 合成语料库：遵循相同的秩频分布规律（高频字符与真实语料库完全一致） - 这一结果有力证明印度河文字是一套结构化书写系统。 ### 字符功能（基于RTL阅读顺序） **真实与合成数据集在主要位置功能上完全一致** | 功能分类 | 字符ID | 真实语料库占比 | 合成数据集匹配情况 | |---|---|---|---| | 前缀（RTL末端） | T638、T604、T406、T496 | 28.5%、11.1%、10.0%、6.5% | ✅ 匹配 | | 后缀（RTL起始） | T123、T122、T701、T741 | 5.1%、4.8%、4.4%、3.8% | ✅ 匹配 | | 核心（中间位置） | T101、T268、T177、T243 | 集中出现于序列中部 | ✅ 匹配 | ### 固定搭配（基于互信息） **真实与合成数据集均重现了相同的高互信息字符对** | 字符对 | 真实语料库互信息值 | 合成数据集是否出现 | 合成数据集出现频次 | |---|---|---|---| | T123 → T609 | 0.108比特 | ✅ 是 | 高频出现 | | T638 → T177 | 0.097比特 | ✅ 是 | 高频出现 | | T101 → T741 | 0.084比特 | ✅ 是 | 高频出现 | | T638 → T653 | 0.061比特 | ✅ 是 | 高频出现 | | T406 → T638 | 0.035比特 | ✅ 是 | 高频出现 | 这些搭配在真实印章铭文中各出现100~200次，而模型在未被提前告知这些搭配重要性的前提下，自主重现了这些模式。 ### 752条印章复刻样本 NanoGPT生成器自主复刻了752条与真实语料库完全一致的铭文模式，这是最强有力的验证依据：说明模型学习到的是真实的语法模式，而非随机噪声。 --- ## 真实与合成数据集的差异 ### 1. 词汇覆盖范围 | 指标 | 真实语料库 | 合成数据集 | |---|---|---| | 唯一使用字符数 | 641 | ~168 | | 出现次数>10次的字符数 | 166 | ~100 | **原因**：641个真实字符中有475个在3310条铭文中的出现次数≤10次。NanoGPT对训练期间极少见到的字符会赋予近乎为零的生成概率。这是考古记录本身的固有局限：现有铭文数量不足以让NanoGPT获得稀有字符的可靠训练信号。任何声称拥有更高词汇覆盖范围的合成语料库，均属于捏造不符合真实证据的字符分布。 ### 2. 序列长度 | 指标 | 真实语料库 | 合成数据集 | |---|---|---| | 平均长度 | ~4.5个字符 | ~7.5个字符 | | 长度为2的序列占比 | 29.6% | 0.6% | **原因**：NanoGPT架构使用BOS/EOS Token，且块大小为20，因此其自然生成的序列长度通常为5~10个Token。真实铭文中长度为2的序列（如T604→T123，多为行政印章）十分常见，但NanoGPT极少生成此类序列，因为对模型而言这类序列看起来“不完整”。 ### 3. 转移概率尖锐度 | 指标 | 真实语料库 | 合成数据集 | |---|---|---| | 条件概率偏差 | 无数据 | 0.23 | | T638→T177 出现率 | 20.0% | 23.5% | | T123→T609 出现率 | 60.3% | 存在波动 | **原因**：模型对高频转移模式学习充分，但过度依赖这些模式。真实铭文则呈现更多样性，因为不同抄写者、遗址与时代会产生略有差异的固定搭配模式。合成语料库仅反映统计平均情况，而非完整的分布范围。 ### 4. 起始/结束熵 | 指标 | 真实语料库 | 合成数据集 | |---|---|---| | 起始熵 | 3.12比特（存在约束） | 1.74比特（约束更强） | | 结束熵 | 6.07比特（多样性更高） | 5.02比特（多样性更低） | **原因**：NanoGPT几乎所有序列均以T638（占比最高的前缀字符）开头，因为该字符在训练分布中占主导地位。真实抄写者则会使用更多样的起始字符。模型仅学习到T638是安全的起始选择，但未掌握何时使用其他替代字符。 --- ## 质量统计 | 指标 | 数值 | |---|---| | 平均BERT得分 | 0.9517 | | 平均N-gram得分 | 0.8946 | | 平均ELECTRA得分 | 0.9283 | | 平均集成得分 | 0.9316 | | 平均序列长度 | 7.53 | | BERT AUC（区分真实与篡改序列） | 0.915 — 优秀 | | N-gram AUC（区分真实与篡改序列） | 0.941 — 优秀 | | 对抗准确率 | 0.438 — 接近0.5（难以区分） | | 掩码重构（30%字符被掩码） | 46.4% — 语法表现优异 | | KL散度 | 0.119 — 匹配度良好 | --- ## 适用场景 ✅ 用于印度河文字NLP实验的语法数据增强 ✅ 无需超过26%词汇覆盖范围的序列模型预训练 ✅ 序列有效性分类器的基准测试 ✅ 位置语法与字符功能模式研究 ✅ 复现或扩展本语法分析流水线 ❌ Token频率或词汇分布研究（词汇覆盖度过低） ❌ 声称的文字破译或语义归因 ❌ 印度河铭文的全真实模拟 --- ## 所用模型本数据集生成所用的模型可通过以下链接获取： [hellosindh/indus-script-models](https://huggingface.co/hellosindh/indus-script-models) --- ## 引用 bibtex @dataset{indus_synthetic_2025, title = {Synthetic Indus Script Dataset}, year = {2025}, note = {5,000 grammar-validated synthetic sequences generated by 4-model ensemble. 752 exact seal reproductions validated. RTL direction, Zipf distribution, and positional grammar confirmed.}, license = {CC-BY-4.0} }

提供机构：

hellosindh

搜集汇总

数据集介绍

构建方式

在考古学与计算语言学交叉领域，合成数据为破解古代文字系统提供了新途径。本数据集通过精心设计的三阶段流程构建而成：首先基于3310条真实考古铭文，分别训练了TinyBERT、N-gram RTL、ELECTRA和DeBERTa四个模型，各自捕捉印度河文字的不同语法特征；随后利用NanoGPT生成候选序列，并采用集成评分机制筛选出通过85%置信阈值的序列，剔除重复项后获得5000条新颖序列，其中包含752条与真实印章完全匹配的验证样本；最后将合成数据与原始数据合并重新训练模型，使分类准确率从78.4%提升至89.0%，最终生成带有置信度评分且经过严格验证的数据集。

使用方法

在应用层面，本数据集适用于特定范围的学术研究。研究者可将其用于增强印度河文字的语法分析实验，或作为序列模型预训练的基础材料，尤其适合研究位置语法与符号角色模式等结构性特征。数据集配套的集成评分机制为序列有效性分类器提供了基准测试框架，其多模型验证流程也可供相关研究方法论参考。需要明确的是，由于符号覆盖率仅约26%，该数据集不适用于词汇分布研究或声称破译的用途，而应聚焦于语法结构的探索与验证。

背景与挑战

背景概述

印度河文字作为古代哈拉帕文明的重要遗存，其符号系统的结构与语法特性长期困扰着考古学与计算语言学领域。2025年发布的Indus Script Synthetic数据集，由研究团队通过多阶段生成流程构建，旨在应对真实铭文样本稀缺的困境。该数据集以3,310条真实考古铭文为基础，融合了TinyBERT、NanoGPT等模型的学习成果，生成了5,000条符合统计语法规律的人工序列。其核心研究问题聚焦于通过合成数据验证印度河文字的序列结构、阅读方向及符号角色分布，为探索这一未解文字系统的内在逻辑提供了可计算的研究载体，显著增强了该领域的数据驱动分析能力。

当前挑战

该数据集致力于解决印度河文字语法结构解析的领域挑战，即在铭文样本有限且符号出现频次高度稀疏的条件下，推断其潜在的序列生成规则与统计规律。构建过程中的主要挑战体现在词汇覆盖度的固有局限：由于真实铭文中超过475种符号出现频次极低，生成模型难以学习其分布，导致合成数据仅能涵盖约26%的符号类型，无法完整反映原始词汇的多样性。此外，模型在序列长度分布与起始符号多样性方面与真实数据存在偏差，反映了统计模型在捕捉考古记录中 scribal 变异与语境细微差异时的结构性困难。

常用场景

经典使用场景

在古文字学与计算语言学交叉领域，该数据集为研究印度河文字的结构特性提供了关键资源。其经典应用场景在于通过合成序列增强对原始铭文语法模式的分析能力，尤其适用于探索该文字系统的右向左阅读方向、位置角色分配以及高频符号组合的固定公式。研究者可借助这些经过多模型集成验证的合成数据，深入剖析序列内部的统计规律与语言类似特征，从而在有限考古实物基础上拓展对文字系统组织原则的理解。

解决学术问题

该数据集有效缓解了印度河文字研究因实物铭文稀缺导致的样本不足问题。通过生成语法有效的合成序列，它支持学者对文字系统的序列结构、熵链衰减模式及齐夫定律分布进行量化验证，为‘印度河文字是否具有语言结构’这一长期争议提供了计算证据。其意义在于将传统考古学的定性描述转化为可计算、可复现的统计模型，推动了古代文字研究从经验推测向数据驱动范式的转型。

实际应用

在实际应用层面，该数据集可作为自然语言处理模型的特定领域预训练资源，尤其适用于低资源场景下的序列生成与掩码预测任务。考古数字化项目可将其用于构建文字识别系统的数据增强管道，提升模型对残缺铭文的补全能力。同时，教育领域也可利用这些合成序列设计交互式学习工具，帮助学习者直观理解印度河文字的排列规律与结构特征，促进公众对古代文明的认知。

数据集最近研究