five

pauhidalgoo/patufet-educat

收藏
Hugging Face2024-08-24 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/pauhidalgoo/patufet-educat
下载链接
链接失效反馈
官方服务:
资源简介:
`patufet-educat`数据集是从Cultura X数据集中过滤出的加泰罗尼亚语教育内容,灵感来源于fineweb-edu数据集。数据集通过使用Gemini模型对文本进行评分,并使用FastText进行分类和过滤。数据集包含文本、时间戳、URL、来源和教育评分等特征。尽管数据集在过滤过程中遇到了一些挑战,如内容质量与数量的权衡、敏感话题的排除等,但它仍被认为是提高加泰罗尼亚语语言模型质量的有价值资源。

The `patufet-educat` dataset is a filtered version of the Catalan content from the Cultura X dataset, specifically focusing on educational content in Catalan. The dataset was created using a filtering process that involved annotation, classification, and filtering steps. The annotation process used the Gemini-1.5-flash model to score text samples from the Oscar corpus, and a FastText classifier was trained on these annotations. The filtering process used a threshold of 3 to balance classifier performance with the limited availability of Catalan texts. The dataset has not been formally evaluated, but it is expected to improve the quality of Catalan language models. The README also discusses the challenges encountered, such as the limited availability of high-quality Catalan content and the exclusion of certain sensitive topics. The dataset is licensed under the terms of CulturaX, which follows the licenses of mC4 and OSCAR.
提供机构:
pauhidalgoo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作