Raziel1234/MiniWebText-4

Name: Raziel1234/MiniWebText-4
Creator: Raziel1234
Published: 2025-11-20 13:46:09
License: 暂无描述

Hugging Face2025-11-20 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Raziel1234/MiniWebText-4

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation tags: - code - chemistry - biology - legal - art - climate - agent - not-for-all-audiences - finance - medical size_categories: - 100K<n<1M --- # MiniWebText-4 **MiniWebText-4** is a high-quality, multilingual text dataset designed for training small to medium-sized language models. It is a carefully curated, smaller version of the full **WebText-4**, optimized for efficiency without sacrificing linguistic richness or semantic diversity. --- ## 🌐 Overview - **Dataset Name:** MiniWebText-4 - **Type:** Text corpus for language model pre-training and fine-tuning - **Lines:** 382,468 text lines - **Tokens:** Approximately 391 million tokens, with each line tokenized to 1024 tokens - **Languages:** Multilingual, including English, Hebrew, Japanese, Chinese, Kannada (ಅವಧಯಲಲ), Arabic, and more - **Purpose:** Ideal for small-to-medium model training, research experiments, and multilingual language model development --- ## 📚 Sources MiniWebText-4 is collected from a diverse set of online sources to ensure a broad linguistic and contextual coverage: - Wikipedia articles in multiple languages - Community forums and discussion boards - Reddit threads across various categories - Blogs, educational sites, and tech platforms - Code repositories and developer communities - Social media content with public availability This diversity ensures that models trained on MiniWebText-4 can understand multiple languages, context types, and content styles. --- ## ⚡ Dataset Features - **High-quality text:** Preprocessed with advanced cleaning to remove HTML, scripts, URLs, and unwanted characters. - **Multilingual coverage:** Includes Western, Semitic, East Asian, and Indic languages for diverse language understanding. - **Tokenization-ready:** Each line can be directly used with standard tokenizers, with 1024 tokens per line for training consistency. - **Efficient size:** Smaller than the full WebText-4, making it suitable for resource-limited experiments while still providing hundreds of millions of tokens for effective model learning. - **Balanced content:** Contains general knowledge, technical information, community discussions, and entertainment content, creating a well-rounded dataset for diverse model training. --- ## 🏗 Use Cases - Pre-training and fine-tuning small-to-medium language models - Multilingual language model experiments - NLP research and benchmarks - Text generation, summarization, and understanding tasks --- ## 🔧 Getting Started 1. Clone the dataset repository or download the dataset files. 2. Use your preferred tokenizer (e.g., GPT tokenizer, BPE, or SentencePiece). 3. Feed the tokenized lines into your training pipeline. MiniWebText-4 is ready for immediate use in both academic research and experimental model development. --- ## ⚠️ License & Usage MiniWebText-4 is open for research and commercial use under a permissive license. Attribution is appreciated when using this dataset in publications or models. --- **MiniWebText-4** provides an efficient, high-quality, and multilingual dataset that balances size, diversity, and usability—perfect for modern language model training without the overhead of extremely large corpora.

提供机构：

Raziel1234

5,000+

优质数据集

54 个

任务类型

进入经典数据集