achulz/mayan-mt5-qeqchi-dataset
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/achulz/mayan-mt5-qeqchi-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- es
- kek
tags:
- translation
- mayan
- synthetic
---
# Mayan-mT5: Q'eqchi' Synthetic Parallel Corpus (Phase 1)
This repository contains the JSONL training and validation datasets used to fine-tune the `mayan-mt5-qeqchi-adapter`. The corpus consists of parallel sentence pairs mapping English and Spanish to Q'eqchi', formatted specifically for bidirectional sequence-to-sequence training.
## Repository Cross-Links
* **Model Adapter (Hugging Face):** [achulz/mayan-mt5-qeqchi-adapter](https://huggingface.co/achulz/mayan-mt5-qeqchi-adapter)
* **Training Code & Generator (GitHub):** [achulzhanov/mayan-mt5](https://github.com/achulzhanov/mayan-mt5)
## Important Note: Synthetic Data
**This dataset is entirely synthetic.** It was not scraped from native speakers or human-translated documents.
The sentences were programmatically generated using a custom rule-based engine utilizing English, Spanish, and Q'eqchi' lexicons, combined through grammatical template structures. Word frequencies were distributed using Zipf's Law to mimic natural language patterns.
Due to the synthetic nature of the data, this corpus is strictly intended for Phase 1 baseline training and academic research. It lacks the natural entropy, idiomatic nuance, and cultural context of human-generated text.
## Dataset Structure
The files are provided in JSONL format. Each line contains a dictionary representing a single translation pair, structurally formatted for standard Hugging Face `datasets` ingestion.
* `mT5_train_v4.jsonl`: Primary training split.
* `mT5_val_v4.jsonl`: Primary validation split.
* `mT5_val_mini_v4.jsonl`: A lightweight validation split for rapid evaluation during training steps.
许可证:Apache-2.0
语言:
- 英语(en)
- 西班牙语(es)
- 克克奇语(kek)
标签:
- 翻译
- 玛雅语系
- 合成数据
# Mayan-mT5: Q'eqchi' 合成平行语料库(第一阶段)
本仓库包含用于微调`mayan-mt5-qeqchi-adapter`的JSONL格式训练与验证数据集。该语料库由映射至Q'eqchi'的英语与西班牙语平行句对组成,专为双向序列到序列(sequence-to-sequence)训练定制格式。
## 仓库交叉链接
* **模型适配器(Hugging Face):** [achulz/mayan-mt5-qeqchi-adapter](https://huggingface.co/achulz/mayan-mt5-qeqchi-adapter)
* **训练代码与生成器(GitHub):** [achulzhanov/mayan-mt5](https://github.com/achulzhanov/mayan-mt5)
## 重要说明:合成数据
**本数据集完全为合成生成。** 其数据并非取自母语使用者语料或人工翻译文档。
语句通过定制的基于规则的引擎程序化生成,该引擎结合了英语、西班牙语与Q'eqchi'词库,并通过语法模板结构进行组合。词频分布遵循齐夫定律(Zipf's Law)以模拟自然语言模式。
鉴于本数据集的合成属性,该语料库仅适用于第一阶段基线训练与学术研究。其缺失人类生成文本所具备的自然熵值、习语细微差异与文化语境。
## 数据集结构
文件以JSONL格式提供。每行均为代表单个翻译对的字典,其结构适配标准Hugging Face 数据集库(datasets)的导入格式。
* `mT5_train_v4.jsonl`:主训练集拆分。
* `mT5_val_v4.jsonl`:主验证集拆分。
* `mT5_val_mini_v4.jsonl`:轻量级验证集拆分,用于训练过程中的快速评估。
提供机构:
achulz



