five

sol-r/historica-instruct

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/historica-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - la - grc - en - he - cop - non - ang - enm - got - cu - xcl - de - fr - it - es - ru license: cc-by-sa-4.0 task_categories: - text-generation - translation tags: - instruction-tuning - ancient-languages - historical-languages - translation - multilingual - latin - ancient-greek - coptic - old-norse - old-english - gothic - uncensored size_categories: - 100K<n<1M --- # historica-instruct instruction-response pairs for fine-tuning language models on ancient/historical language tasks and general capability. 150k examples: 50k domain-specific historical language tasks across 5 task types, plus 100k filtered general instruction following. ## overview | | count | |---|---:| | total examples | 150,000 | | domain-specific (historica) | 50,000 | | general instructions | 100,000 | ## domain-specific split (50k) 5 task types using 130 instruction templates in 8 languages: | task | count | description | |---|---:|---| | translate forward | 12,500 | ancient → modern language translation | | translate reverse | 12,500 | modern → ancient language translation | | continuation | 12,500 | monolingual text continuation | | analysis | 7,500 | language identification, grammar analysis, content summary | | gapfill | 5,000 | reconstruct missing/damaged text | ### translation directions bidirectional — the model learns to translate both FROM and INTO ancient languages: | direction | examples | |---|---:| | eng → lat | 4,920 | | lat → eng | 4,904 | | eng → grc | 3,832 | | grc → eng | 3,807 | | eng → ang | 886 | | ang → eng | 881 | | eng → heb | 847 | | heb → eng | 837 | | eng → non | 827 | | non → eng | 860 | | + coptic, gothic, OCS, armenian directions | ~2,400 | ### source languages | language | code | examples | |---|---|---:| | Latin | lat | 16,867 | | English | eng | 16,222 | | Ancient Greek | grc | 9,148 | | Middle English | enm | 2,792 | | Old Norse | non | 1,713 | | Old English | ang | 937 | | Coptic | cop | 882 | | Hebrew | heb | 837 | | Norwegian | nob | 191 | | Sami | sme | 123 | | + German, French, Danish, OCS, Gothic | | < 100 each | ### instruction languages | language | examples | |---|---:| | English | ~17k | | Russian | ~8k | | German | ~4k | | Italian | ~4k | | Spanish | ~4k | | French | ~4k | | Latin | ~4k | | Modern Greek | ~4k | ### church content christian/patristic texts deprioritized: - tagged christian tradition: 15% keep rate - keyword-detected church content: 30% keep rate - pagan mythology, secular literature, law, philosophy proportionally overrepresented ## general split (100k) | source | examples | description | |---|---:|---| | [teknium/OpenHermes-2.5](https://hf.co/datasets/teknium/OpenHermes-2.5) | 50,000 | GPT-4 generated, diverse tasks. no safety filtering. apache-2.0. | | [allenai/tulu-3-sft-mixture](https://hf.co/datasets/allenai/tulu-3-sft-mixture) | 30,000 | high-quality SFT mix (FLAN, No Robots, OpenAssistant). refusals filtered. ODC-BY. | | [CohereForAI/aya_dataset](https://hf.co/datasets/CohereForAI/aya_dataset) | 20,000 | human-annotated multilingual instruction pairs. 200+ languages. apache-2.0. | ### refusal filtering 18 regex patterns remove safety-theater responses from tulu-3 (~15% filtered): "I'm sorry, but I can't...", "against my programming", "harmful or offensive", etc. important for an academic tool discussing ancient warfare, pagan religion, slavery without unnecessary moralizing. ## schema | column | type | description | |---|---|---| | `instruction` | string | instruction prompt | | `response` | string | expected response | | `source` | string | task origin | | `src_lang` | string | source text language | | `tgt_lang` | string | target language | | `instr_lang` | string | instruction language | ## data sources - domain-specific built from [sol-r/historica-pairs](https://hf.co/datasets/sol-r/historica-pairs) and [sol-r/historica-corpus](https://hf.co/datasets/sol-r/historica-corpus) - instruction templates: 130 templates across 4 task types (translate, continue, analyze, gapfill) in 8 languages with correct grammar per language (Russian declension, German articles, French prepositions, etc.) ## intended use instruction fine-tuning of encoder-decoder or decoder-only models. designed as a complete SFT mix: - 67% general capability (openhermes + filtered tulu-3 + aya multilingual) - 33% domain-specific ancient language tasks (5 task types, bidirectional) ## license domain-specific: CC BY-SA 4.0. general: apache-2.0 (OpenHermes, Aya), ODC-BY (tulu-3).
提供机构:
sol-r
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作