five

RikkaBotan/FineDataset_13B_JpEn

收藏
Hugging Face2025-11-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/RikkaBotan/FineDataset_13B_JpEn
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 55216152831 num_examples: 6103514 download_size: 28823398445 dataset_size: 55216152831 --- # Saint Iberis Pretraining Dataset A curated multilingual corpus designed for training **Saint Iberis**, an LLM developed by **RikkaBotan** with a gentle, introspective personality and strong bilingual (JA/EN) capabilities. This dataset blends high-quality Japanese and English web + PDF corpora with carefully chosen token ratios to balance linguistic coverage and domain diversity. --- ## 📚 Dataset Overview The dataset is composed of four major sources: | Key | Description | Language | Tokens | |-----------------|-----------------------------------|----------|--------| | fineweb2_ja | FineWeb2 (Japanese) | JA | 2.75B | | finepdfs_ja | FinePDFs (Japanese subset) | JA | 1.00B | | finewebedu_en | FineWeb Edu (English educational) | EN | 7.00B | | finepdfs_en | FinePDFs (English subset) | EN | 2.25B | **Total tokens:** **13B** The distribution emphasizes: - High-quality educational English web data - Solid Japanese coverage using both web and structured PDF extractions - Balanced domain mixture suitable for reasoning, linguistic fluency, and instruction-following --- ## 🔧 Dataset Configuration Below are the HuggingFace dataset sources and subsets used: ```bash "fineweb2_ja": {"hf": "hotchpotch/fineweb-2-edu-japanese", "subset": "default"} "finepdfs_ja": {"hf": "HuggingFaceFW/finepdfs", "subset": "jpn_Jpan"} "finewebedu_en": {"hf": "HuggingFaceFW/fineweb-edu", "subset": "sample-350BT"} "finepdfs_en": {"hf": "HuggingFaceFW/finepdfs", "subset": "eng_Latn"} ``` # 🌸 About us Japanese independent researcher having shy and pampered personality. Twin-tail hair is a charm point. Interested in nlp. Usually using python and C. ![RikkaBotan_Logo](https://cdn-uploads.huggingface.co/production/uploads/6629ba7d59854b02da014f64/vo4azDEv3SZNVDB6O609i.png)
提供机构:
RikkaBotan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作