five

tvu-vlinhd11/pretrain-dataset-T2048-13B

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tvu-vlinhd11/pretrain-dataset-T2048-13B
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en - vi tags: - pretrain - tokenized - packed-sequences size_categories: - 1M<n<10M --- # Pretrain Dataset (Tokenized) This dataset contains tokenized and packed sequences ready for LLM pretraining. ## Dataset Details | Property | Value | |----------|-------| | **Sequences** | 6,474,097 | | **Sequence Length** | 2048 | | **Tokenizer** | `./vn_spm_v3_fast2/` | | **Total Tokens** | 13,258,950,332 | | **Shards** | 13 | | **Created** | 2025-12-10 | ## Dataset Structure Each sample contains: - `input_ids`: List of token IDs (length: 2048) - `attention_mask`: Attention mask (1 for real tokens, 0 for padding) ## Usage ```python from datasets import load_dataset dataset = load_dataset("tvu-vlinhd11/pretrain-dataset-T2048-13B") train_data = dataset["train"] sample = train_data[0] input_ids = sample["input_ids"] attention_mask = sample["attention_mask"] ``` ## License Apache 2.0
提供机构:
tvu-vlinhd11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作