pauhidalgoo/patufet-educat
收藏Hugging Face2024-08-24 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/pauhidalgoo/patufet-educat
下载链接
链接失效反馈官方服务:
资源简介:
`patufet-educat`数据集是从Cultura X数据集中过滤出的加泰罗尼亚语教育内容,灵感来源于fineweb-edu数据集。数据集通过使用Gemini模型对文本进行评分,并使用FastText进行分类和过滤。数据集包含文本、时间戳、URL、来源和教育评分等特征。尽管数据集在过滤过程中遇到了一些挑战,如内容质量与数量的权衡、敏感话题的排除等,但它仍被认为是提高加泰罗尼亚语语言模型质量的有价值资源。
The `patufet-educat` dataset is a filtered version of the Catalan content from the Cultura X dataset, specifically focusing on educational content in Catalan. The dataset was created using a filtering process that involved annotation, classification, and filtering steps. The annotation process used the Gemini-1.5-flash model to score text samples from the Oscar corpus, and a FastText classifier was trained on these annotations. The filtering process used a threshold of 3 to balance classifier performance with the limited availability of Catalan texts. The dataset has not been formally evaluated, but it is expected to improve the quality of Catalan language models. The README also discusses the challenges encountered, such as the limited availability of high-quality Catalan content and the exclusion of certain sensitive topics. The dataset is licensed under the terms of CulturaX, which follows the licenses of mC4 and OSCAR.
提供机构:
pauhidalgoo



