apertus-pretrain-swiss
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/swiss-ai/apertus-pretrain-swiss
下载链接
链接失效反馈官方服务:
资源简介:
# Swiss Pretrain Data
This dataset provides a large collection of open-access and license-compliant Swiss data sources for language model training.
The dataset includes the following sources:
| Name | Internal ID | Tokens (B) | Description |
|------|-------------|------------|-------------|
| [Curia Vista](https://www.parlament.ch/en/ratsbetrieb/curia-vista) | `curiavista` | 0.5 | Legal and administrative documents from the Swiss database of parliamentary proceedings. |
| [enscheidsuche](http://entscheidsuche.ch/) | `enscheidsuche_html` | 4.5 | Swiss court decisions, sampled at 50% for balance. |
| [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) | `fineweb-2-swissfilter-quality_10-filterrobots` | 1.8 | A Swiss-only subset of FineWeb-2, filtered for top 10% quality. |
| [Romansh Data](https://huggingface.co/datasets/swiss-ai/romansh_data) | `roh_data` |0.1 | Various Romansh-language data sources, as detailed in the [dataset documentation](https://huggingface.co/datasets/swiss-ai/romansh_data). |
All sampling, filtering and data-preparation scripts can be found in [our dedicated GitHub repository](https://github.com/swiss-ai/data-pipeline-pretrain).
Feel free to [reach out](mailto:sven.najem-meyer@epfl.ch) for any questions or suggestions 😊
# 瑞士预训练数据(Swiss Pretrain Data)
本数据集收录了大量可公开获取且符合许可协议的瑞士数据源,用于大语言模型(Large Language Model, LLM)的训练。
本数据集包含以下数据源:
| 名称 | 内部标识符 | Token数(十亿) | 描述 |
|------|-------------|------------|-------------|
| [Curia Vista](https://www.parlament.ch/en/ratsbetrieb/curia-vista) | `curiavista` | 0.5 | 源自瑞士议会议事数据库的法律与行政文书。 |
| [enscheidsuche](http://entscheidsuche.ch/) | `enscheidsuche_html` | 4.5 | 瑞士法院判决数据集,按50%比例采样以保证数据均衡。 |
| [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) | `fineweb-2-swissfilter-quality_10-filterrobots` | 1.8 | FineWeb-2的瑞士专属子集,经筛选保留Top 10%高质量数据。 |
| [Romansh 语料集](https://huggingface.co/datasets/swiss-ai/romansh_data) | `roh_data` | 0.1 | 各类罗曼什语数据源,详细说明见[数据集文档](https://huggingface.co/datasets/swiss-ai/romansh_data)。 |
所有采样、筛选及数据预处理脚本均可在[专属GitHub仓库](https://github.com/swiss-ai/data-pipeline-pretrain)中获取。
如有任何疑问或建议,欢迎随时[联系我们](mailto:sven.najem-meyer@epfl.ch) 😊
提供机构:
maas
创建时间:
2025-09-04



