apertus-pretrain-romansh
收藏魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/swiss-ai/apertus-pretrain-romansh
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consist of three differnt parts. Monolingual Romansh Data, Polylingual data or more precisely translated data from Romansh into either German, French, Italian or English and Sythetic Data.
The Polylingual data consists of aligned and non aligned data. The synthetic data was created by interweaving the translational data and prefacing it with the sentence " This is a text translated from SOURCE LANGUAGE to Rumantsch Grischun".
The data has a metadata "idiom" if the if specific idiom was provided otherwise it is implicitly assumed that the data is in Romansh Grischun.
The data constists of:
- Law Texts and announcements from the Muncipalities of: Sagogn (Sursilvan), Lantsch (Surmiran), Zernez (Vallader), Ilanz (Sursilvan)
- Law Texts from the Canton of Grisons in Rumantsch Grischun (https://www.gr-lex.gr.ch/app/rm/systematic/texts_of_law)
- The Bilingual Corpus on GitHub (https://github.com/ZurichNLP/RumantschCorpora/tree/master)
- Online Dictionaries from the Lia Rumantscha in Surmiran, Sutsilvan and Sursilvan (www.pledarigrond.ch)
- Romansh Websites on Wikipedia
Below I give a token count using `alehc/swissai-tokenizer`. Mixed languages like 'de/roh' refer to the fact that either the transational text was not aligned or that it is synthetic data using the two languages. Note that the synthetic data token count is inflated the above-mentioned prefix.



Note that all data has been preprocessed using the pipeline in https://github.com/swiss-ai/Swiss-AI-Romansh-Scripts.
Please feel free to contact me if you have any comments regarding the data (niklasc@icloud.com).
本数据集包含三个不同部分,分别为单语言罗曼什语(Romansh)数据、多语言数据(更准确地说是从罗曼什语翻译为德语、法语、意大利语或英语的译后数据)以及合成数据。
多语言数据包含对齐与未对齐两类数据。合成数据通过交织译后文本生成,并在其开头添加前缀句子:“本文本由源语言翻译至罗曼什标准语(Rumantsch Grischun)”。
该数据集带有“语域(idiom)”元数据,若已提供具体语域则标注该字段,否则默认数据为罗曼什标准语(Rumantsch Grischun)。
本数据集涵盖以下内容:
- 来自以下市镇的法律文本与公告:萨孔(Sursilvan方言区)、兰茨(Surmiran方言区)、策内茨(Vallader方言区)、伊兰茨(Sursilvan方言区)
- 格劳宾登州使用罗曼什标准语发布的法律文本(链接:https://www.gr-lex.gr.ch/app/rm/systematic/texts_of_law)
- GitHub平台上的双语语料库(链接:https://github.com/ZurichNLP/RumantschCorpora/tree/master)
- 罗曼什语言协会(Lia Rumantscha)提供的Surmiran、Sutsilvan及Sursilvan方言在线词典(网址:www.pledarigrond.ch)
- 维基百科中的罗曼什语相关页面
下文将使用`alehc/swissai-tokenizer`工具进行Token(Token)计数。诸如“de/roh”这类混合语言标记,代表翻译文本未完成对齐,或是由两种语言生成的合成数据。需注意,前述前缀会导致合成数据的Token计数偏高。
附带三张统计图表:



注:所有数据均已通过https://github.com/swiss-ai/Swiss-AI-Romansh-Scripts 中的预处理流水线完成预处理。
若您对本数据集有任何意见或建议,欢迎联系作者(邮箱:niklasc@icloud.com)。
提供机构:
maas
创建时间:
2025-09-04



