sol-r/historica-instruct
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/historica-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- la
- grc
- en
- he
- cop
- non
- ang
- enm
- got
- cu
- xcl
- de
- fr
- it
- es
- ru
license: cc-by-sa-4.0
task_categories:
- text-generation
- translation
tags:
- instruction-tuning
- ancient-languages
- historical-languages
- translation
- multilingual
- latin
- ancient-greek
- coptic
- old-norse
- old-english
- gothic
- uncensored
size_categories:
- 100K<n<1M
---
# historica-instruct
instruction-response pairs for fine-tuning language models on ancient/historical
language tasks and general capability. 150k examples: 50k domain-specific
historical language tasks across 5 task types, plus 100k filtered general
instruction following.
## overview
| | count |
|---|---:|
| total examples | 150,000 |
| domain-specific (historica) | 50,000 |
| general instructions | 100,000 |
## domain-specific split (50k)
5 task types using 130 instruction templates in 8 languages:
| task | count | description |
|---|---:|---|
| translate forward | 12,500 | ancient → modern language translation |
| translate reverse | 12,500 | modern → ancient language translation |
| continuation | 12,500 | monolingual text continuation |
| analysis | 7,500 | language identification, grammar analysis, content summary |
| gapfill | 5,000 | reconstruct missing/damaged text |
### translation directions
bidirectional — the model learns to translate both FROM and INTO ancient
languages:
| direction | examples |
|---|---:|
| eng → lat | 4,920 |
| lat → eng | 4,904 |
| eng → grc | 3,832 |
| grc → eng | 3,807 |
| eng → ang | 886 |
| ang → eng | 881 |
| eng → heb | 847 |
| heb → eng | 837 |
| eng → non | 827 |
| non → eng | 860 |
| + coptic, gothic, OCS, armenian directions | ~2,400 |
### source languages
| language | code | examples |
|---|---|---:|
| Latin | lat | 16,867 |
| English | eng | 16,222 |
| Ancient Greek | grc | 9,148 |
| Middle English | enm | 2,792 |
| Old Norse | non | 1,713 |
| Old English | ang | 937 |
| Coptic | cop | 882 |
| Hebrew | heb | 837 |
| Norwegian | nob | 191 |
| Sami | sme | 123 |
| + German, French, Danish, OCS, Gothic | | < 100 each |
### instruction languages
| language | examples |
|---|---:|
| English | ~17k |
| Russian | ~8k |
| German | ~4k |
| Italian | ~4k |
| Spanish | ~4k |
| French | ~4k |
| Latin | ~4k |
| Modern Greek | ~4k |
### church content
christian/patristic texts deprioritized:
- tagged christian tradition: 15% keep rate
- keyword-detected church content: 30% keep rate
- pagan mythology, secular literature, law, philosophy proportionally overrepresented
## general split (100k)
| source | examples | description |
|---|---:|---|
| [teknium/OpenHermes-2.5](https://hf.co/datasets/teknium/OpenHermes-2.5) | 50,000 | GPT-4 generated, diverse tasks. no safety filtering. apache-2.0. |
| [allenai/tulu-3-sft-mixture](https://hf.co/datasets/allenai/tulu-3-sft-mixture) | 30,000 | high-quality SFT mix (FLAN, No Robots, OpenAssistant). refusals filtered. ODC-BY. |
| [CohereForAI/aya_dataset](https://hf.co/datasets/CohereForAI/aya_dataset) | 20,000 | human-annotated multilingual instruction pairs. 200+ languages. apache-2.0. |
### refusal filtering
18 regex patterns remove safety-theater responses from tulu-3 (~15% filtered):
"I'm sorry, but I can't...", "against my programming", "harmful or offensive",
etc. important for an academic tool discussing ancient warfare, pagan religion,
slavery without unnecessary moralizing.
## schema
| column | type | description |
|---|---|---|
| `instruction` | string | instruction prompt |
| `response` | string | expected response |
| `source` | string | task origin |
| `src_lang` | string | source text language |
| `tgt_lang` | string | target language |
| `instr_lang` | string | instruction language |
## data sources
- domain-specific built from [sol-r/historica-pairs](https://hf.co/datasets/sol-r/historica-pairs)
and [sol-r/historica-corpus](https://hf.co/datasets/sol-r/historica-corpus)
- instruction templates: 130 templates across 4 task types (translate, continue,
analyze, gapfill) in 8 languages with correct grammar per language
(Russian declension, German articles, French prepositions, etc.)
## intended use
instruction fine-tuning of encoder-decoder or decoder-only models. designed as a
complete SFT mix:
- 67% general capability (openhermes + filtered tulu-3 + aya multilingual)
- 33% domain-specific ancient language tasks (5 task types, bidirectional)
## license
domain-specific: CC BY-SA 4.0. general: apache-2.0 (OpenHermes, Aya), ODC-BY (tulu-3).
提供机构:
sol-r



