bismarck91/enA-frA-glm-xc-src-tgt-tokenized-combined-fr-en

Name: bismarck91/enA-frA-glm-xc-src-tgt-tokenized-combined-fr-en
Creator: bismarck91
Published: 2025-10-24 04:54:37
License: 暂无描述

Hugging Face2025-10-24 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/bismarck91/enA-frA-glm-xc-src-tgt-tokenized-combined-fr-en

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含输入ID序列、注意力掩码序列和标签序列。输入ID和注意力掩码序列分别使用int32和int8数据类型，而标签序列使用int64数据类型。数据集的训练集部分包含超过500万个示例，总大小约为20GB。数据集提供了一个默认配置，用于指定训练数据的文件路径。

The dataset includes sequences of input IDs, attention masks, and labels. The input ID and attention mask sequences are stored as int32 and int8 data types, respectively, while the label sequence uses the int64 data type. The training set of the dataset contains more than 5 million examples, with a total size of approximately 20GB. The dataset provides a default configuration to specify the file path for the training data.

提供机构：

bismarck91

5,000+

优质数据集

54 个

任务类型

进入经典数据集