TucanoBR/GigaVerbo-Text-Filter

Name: TucanoBR/GigaVerbo-Text-Filter
Creator: TucanoBR
Published: 2025-07-24 08:07:17
License: 暂无描述

Hugging Face2025-07-24 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TucanoBR/GigaVerbo-Text-Filter

下载链接

链接失效反馈

官方服务：

资源简介：

GigaVerbo Text-Filter是一个包含110,000个随机选择的样本的数据集，这些样本来自9个子集的GigaVerbo（即那些不是合成的子集）。这个数据集用于训练在论文《Tucano: Advancing Neural Text Generation for Portuguese》中描述的文本质量过滤器。为了创建文本嵌入，我们使用了sentence-transformers/LaBSE。所有的分数都是由GPT-4o生成的。

GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo (i.e., specifically those that were not synthetic). This dataset was used to train the text-quality filters described in "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)". To create the text embeddings, we used [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE). All scores were generated by GPT-4o.

提供机构：

TucanoBR

5,000+

优质数据集

54 个

任务类型

进入经典数据集