five

cis-lmu/bavarian_to_english

收藏
Hugging Face2026-03-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/cis-lmu/bavarian_to_english
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: bar-eng data_files: - split: train path: bar.parquet language: - bar - eng task_categories: - translation --- ## Bavarian to English Due to the scarcity of high-quality Bavarian–German parallel corpora, we use `GPT-4` to translate the Bavarian portion of the Wikipedia into English. ## Citation ``` @inproceedings{lin-etal-2025-construction, title = "Construction-Based Reduction of Translationese for Low-Resource Languages: A Pilot Study on {B}avarian", author = {Lin, Peiqin and Thaler, Marion and Goschala, Daniela and Kargaran, Amir Hossein and Liu, Yihong and Martins, Andr{\'e} F. T. and Sch{\"u}tze, Hinrich}, editor = "Hahn, Michael and Rani, Priya and Kumar, Ritesh and Shcherbakov, Andreas and Sorokin, Alexey and Serikov, Oleg and Cotterell, Ryan and Vylomova, Ekaterina", booktitle = "Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP", month = aug, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.sigtyp-1.13/", doi = "10.18653/v1/2025.sigtyp-1.13", pages = "114--121", ISBN = "979-8-89176-281-7", abstract = "When translating into a low-resource language, a language model can have a tendency to produce translations that are close to the source (e.g., word-by-word translations) due to a lack of rich low-resource training data in pretraining. Thus, the output often is translationese that differs considerably from what native speakers would produce naturally. To remedy this, we synthetically create a training set in which the frequency of a construction unique to the low-resource language is artificially inflated. For the case of Bavarian, we show that, after training, the language model has learned the unique construction and that native speakers judge its output as more natural. Our pilot study suggests that construction-based mitigation of translationese is a promising approach. Code and artifacts are available at \url{https://github.com/cisnlp/BayernGPT}." } ```
提供机构:
cis-lmu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作