josefbednar/11m-czech-sentences
收藏Hugging Face2025-11-04 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/josefbednar/11m-czech-sentences
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集包含11百万个原始捷克句子,是通过过滤掉包含至少一个逗号的句子从SYN2006PUB语料库中创建而成的。它是一个较小但非常干净和准确的书面语言表示,适合用于微调/训练大型语言模型或其他模型。
This dataset contains 11 million raw Czech sentences, created by filtering out sentences containing at least one comma from the SYN2006PUB corpus. It is a small but very clean and accurate representation of written language, suitable for finetuning/training large language models or other models.
提供机构:
josefbednar



