windprak/steuerllm_pretraining_dataset
收藏Hugging Face2026-02-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/windprak/steuerllm_pretraining_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
license: cc-by-nc-4.0
task_categories:
- text-generation
---
# SteuerLLM Pretraining Dataset
[Project page](https://steuerllm.i5.ai.fau.de) | [Paper](https://arxiv.org/abs/2602.11081) | [GitHub](https://github.com/windprak/steuerllm)
Pretraining Dataset for German Tax Law filtered from FineWeb. This dataset was used for the continual pretraining stage of **SteuerLLM**, a specialized large language model for German tax law analysis.
## Dataset Description
The SteuerLLM pretraining dataset is a domain-specific subset filtered from large-scale web corpora. It focuses on identifying and extracting tax-related content from German web data to adapt base model representations to the specific terminology, precise legal terminology, and structure of German tax legislation.
For more information on the model and the full pipeline, please visit the [GitHub repository](https://github.com/windprak/steuerllm).
## Citation
```bibtex
@article{steuerllm,
author = {Wind, Sebastian and Sopa, Jeta and Schmid, Laurin and Jackl, Quirin and Kiefer, Sebastian and Wu, Fei and Mayr, Martin and Köstler, Harald and Wellein, Gerhard and Maier, Andreas and Tayebi Arasteh, Soroosh},
title = {SteuerLLM: Local specialized large language model for German tax law analysis},
year = {2026},
journal = {arXiv preprint arXiv:2602.11081},
url = {https://arxiv.org/abs/2602.11081}
}
```
## License
cc-by-nc-4.0 research only
---
language:
- 德语
license: CC-BY-NC-4.0
task_categories:
- 文本生成
---
# SteuerLLM 预训练数据集
[项目页面](https://steuerllm.i5.ai.fau.de) | [论文](https://arxiv.org/abs/2602.11081) | [GitHub](https://github.com/windprak/steuerllm)
本数据集为从FineWeb中筛选得到的德国税务法律领域预训练数据集,曾用于**SteuerLLM(面向德国税务法律分析的专用大语言模型)**的持续预训练阶段。
## 数据集描述
SteuerLLM预训练数据集是从大规模网络语料库中筛选出的领域专用子集。其核心目标是从德国网络数据中识别并提取税务相关内容,以使基础大语言模型的表征适配德国税务立法的专用术语、精准法律用语及文本结构。
如需了解该模型及完整研发流程,请访问[GitHub仓库](https://github.com/windprak/steuerllm)。
## 引用
bibtex
@article{steuerllm,
author = {Wind, Sebastian and Sopa, Jeta and Schmid, Laurin and Jackl, Quirin and Kiefer, Sebastian and Wu, Fei and Mayr, Martin and Köstler, Harald and Wellein, Gerhard and Maier, Andreas and Tayebi Arasteh, Soroosh},
title = {SteuerLLM: Local specialized large language model for German tax law analysis},
year = {2026},
journal = {arXiv preprint arXiv:2602.11081},
url = {https://arxiv.org/abs/2602.11081}
}
## 许可证
CC-BY-NC-4.0,仅限科研使用
提供机构:
windprak



