skymizer/fineweb-edu-dedup-train-5B-by-Llama-3.2-3B-tokenizer-2048-pack-pad
收藏Hugging Face2024-12-23 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/skymizer/fineweb-edu-dedup-train-5B-by-Llama-3.2-3B-tokenizer-2048-pack-pad
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含文本数据的数据集,具体应用场景和内容在README中并未明确说明。数据集由多个字段组成,包括input_ids(文本的数字表示)、attention_mask(用于指示文本中有效部分的位置)、labels(可能的标签或目标值)、position_ids(文本中每个标记的位置信息)和length(文本长度)。数据集被划分为训练集,大小为约102GB,包含5583350个示例。
This is a dataset containing text data, with the specific application scenario and content not explicitly stated in the README. The dataset consists of multiple fields including input_ids (numerical representation of text), attention_mask (indicating the positions of valid parts of the text), labels (possible labels or target values), position_ids (position information of each token in the text), and length (length of the text). The dataset is split into a training set, which is approximately 102GB in size and contains 5583350 examples.
提供机构:
skymizer



