five

TucanoBR/GigaVerbo-Text-Filter

收藏
Hugging Face2025-07-24 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TucanoBR/GigaVerbo-Text-Filter
下载链接
链接失效反馈
官方服务:
资源简介:
GigaVerbo Text-Filter是一个包含110,000个随机选择的样本的数据集,这些样本来自9个子集的GigaVerbo(即那些不是合成的子集)。这个数据集用于训练在论文《Tucano: Advancing Neural Text Generation for Portuguese》中描述的文本质量过滤器。为了创建文本嵌入,我们使用了sentence-transformers/LaBSE。所有的分数都是由GPT-4o生成的。

GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo (i.e., specifically those that were not synthetic). This dataset was used to train the text-quality filters described in "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)". To create the text embeddings, we used [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE). All scores were generated by GPT-4o.
提供机构:
TucanoBR
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作