RikkaBotan/FineDataset_13B_JpEn

Name: RikkaBotan/FineDataset_13B_JpEn
Creator: RikkaBotan
Published: 2025-11-23 01:45:39
License: 暂无描述

Hugging Face2025-11-23 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/RikkaBotan/FineDataset_13B_JpEn

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 55216152831 num_examples: 6103514 download_size: 28823398445 dataset_size: 55216152831 --- # Saint Iberis Pretraining Dataset A curated multilingual corpus designed for training **Saint Iberis**, an LLM developed by **RikkaBotan** with a gentle, introspective personality and strong bilingual (JA/EN) capabilities. This dataset blends high-quality Japanese and English web + PDF corpora with carefully chosen token ratios to balance linguistic coverage and domain diversity. --- ## 📚 Dataset Overview The dataset is composed of four major sources: | Key | Description | Language | Tokens | |-----------------|-----------------------------------|----------|--------| | fineweb2_ja | FineWeb2 (Japanese) | JA | 2.75B | | finepdfs_ja | FinePDFs (Japanese subset) | JA | 1.00B | | finewebedu_en | FineWeb Edu (English educational) | EN | 7.00B | | finepdfs_en | FinePDFs (English subset) | EN | 2.25B | **Total tokens:** **13B** The distribution emphasizes: - High-quality educational English web data - Solid Japanese coverage using both web and structured PDF extractions - Balanced domain mixture suitable for reasoning, linguistic fluency, and instruction-following --- ## 🔧 Dataset Configuration Below are the HuggingFace dataset sources and subsets used: ```bash "fineweb2_ja": {"hf": "hotchpotch/fineweb-2-edu-japanese", "subset": "default"} "finepdfs_ja": {"hf": "HuggingFaceFW/finepdfs", "subset": "jpn_Jpan"} "finewebedu_en": {"hf": "HuggingFaceFW/fineweb-edu", "subset": "sample-350BT"} "finepdfs_en": {"hf": "HuggingFaceFW/finepdfs", "subset": "eng_Latn"} ``` # 🌸 About us Japanese independent researcher having shy and pampered personality. Twin-tail hair is a charm point. Interested in nlp. Usually using python and C. ![RikkaBotan_Logo](https://cdn-uploads.huggingface.co/production/uploads/6629ba7d59854b02da014f64/vo4azDEv3SZNVDB6O609i.png)

提供机构：

RikkaBotan

5,000+

优质数据集

54 个

任务类型

进入经典数据集