ShiniChien/binhvq-news-dedup-filter-tokenize
收藏Hugging Face2025-02-11 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/ShiniChien/binhvq-news-dedup-filter-tokenize
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了文本处理所需的特征字段,如input_ids(文本的token ID序列)、attention_mask(用于注意力机制的掩码)、labels(标签序列)、position_ids(位置索引)和length(序列长度)。数据集被划分为训练集,共有约119万个示例,总大小约为2.1GB。提供的配置信息中,默认配置下训练数据文件以train开头。
The dataset includes feature fields necessary for text processing, such as input_ids (token ID sequences of text), attention_mask (masks for attention mechanism), labels (label sequences), position_ids (position indices), and length (sequence length). The dataset is split into a training set with approximately 1,191,500 examples, totaling about 2.1GB in size. The provided configuration information includes a default configuration with training data files prefixed with train.
提供机构:
ShiniChien



