five

m1llion-ai-high-end-group/m1llion-lang

收藏
Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/m1llion-ai-high-end-group/m1llion-lang
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - translation size_categories: - 1B<n<10B --- # 🌍 M1llion-Lang: Multilingual Instruction Dataset with Emoji Expression [![Hugging Face](https://img.shields.io/badge/🤗-Dataset-yellow.svg)](https://huggingface.co/datasets/your-username/m1llion-lang) [![License](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/) [![Size](https://img.shields.io/badge/Size-4GB-green.svg)]() [![Languages](https://img.shields.io/badge/Languages-20+-orange.svg)]() **M1llion-Lang** is a high-quality, large-scale multilingual instruction dataset designed for training and fine-tuning large language models (LLMs) to understand and generate text in 20+ languages with natural emoji expression and cultural nuance. ## 📊 Dataset Overview - **Total Size**: ~4GB (JSON Lines format) - **Languages**: 20 languages covering 95%+ of global internet users - **Format**: Conversational JSONL with ShareGPT-compatible structure - **Special Feature**: Emoji-enriched responses for emotional intelligence training - **License**: CC BY 4.0 ### Supported Languages | Language | Code | Size | Emoji Freq | Scripts | |----------|------|------|------------|---------| | 🇺🇸 English | `en` | ~200MB | 30% | Latin | | 🇨🇳 Chinese | `zh` | ~200MB | 40% | Hanzi | | 🇪🇸 Spanish | `es` | ~200MB | 35% | Latin | | 🇫🇷 French | `fr` | ~200MB | 30% | Latin | | 🇩🇪 German | `de` | ~200MB | 25% | Latin | | 🇯🇵 Japanese | `ja` | ~200MB | 50% | Kanji/Kana | | 🇸🇦 Arabic | `ar` | ~200MB | 30% | Arabic | | 🇮🇳 Hindi | `hi` | ~200MB | 35% | Devanagari | | ... and 12 more | | | | | ## 🏗️ Dataset Structure Each conversation follows this schema: ```json { "id": "en_00000001", "language": "en", "language_name": "English", "conversation_type": "qa|instruction|chat|coding|creative|reasoning", "messages": [ { "role": "system", "content": "You are a helpful, friendly AI assistant..." }, { "role": "user", "content": "Explain quantum computing in simple terms 🎯" }, { "role": "assistant", "content": "Quantum computing is like... 💡" } ], "metadata": { "timestamp": "2024-01-29T12:00:00", "version": "1.0.0", "emoji_count": 3 } } ``` ## 🎯 Use Cases - **Multilingual LLM Training**: Pre-training and fine-tuning for polyglot models - **Emoji Understanding**: Training models to use emojis contextually and culturally appropriately - **Cross-lingual Transfer**: Improving zero-shot performance across language families - **Conversational AI**: Chatbots with emotional intelligence and cultural awareness - **Code-switching Research**: Natural multilingual conversation flow ## 🚀 Quick Start ### Installation ```bash pip install datasets huggingface_hub tqdm ``` ### Loading the Dataset ```python from datasets import load_dataset # Load specific language dataset_en = load_dataset("your-username/m1llion-lang", "en") # Load all languages dataset = load_dataset("your-username/m1llion-lang") # Access conversations for item in dataset['train']: print(item['messages'][1]['content']) # User query print(item['messages'][2]['content']) # Assistant response with emojis ``` ### Generating Locally ```bash # Clone repository git clone https://github.com/your-username/m1llion-lang.git cd m1llion-lang # Generate full 4GB dataset python generate_dataset.py --output ./data --size-per-lang 200 # Push to your own Hugging Face repo python push_to_hub.py --path ./data --repo your-username/m1llion-lang ``` ## 📈 Dataset Statistics - **Total Conversations**: ~2.5M - **Average Conversation Length**: 3.2 turns - **Unique Emojis**: 40+ distinct emojis - **Emoji Distribution**: Contextually weighted by cultural usage patterns - **Vocabulary Coverage**: Comprehensive coverage of daily, technical, and creative vocabulary ### Conversation Type Distribution - **QA (Question Answering)**: 25% - **Instruction Following**: 30% - **Multi-turn Chat**: 20% - **Coding/Technical**: 15% - **Creative Writing**: 5% - **Reasoning/Logic**: 5% ## 🎨 Emoji Methodology Emojis are not randomly inserted but strategically placed based on: 1. **Cultural Context**: Different emoji frequencies per language (e.g., higher in Japanese, moderate in German) 2. **Semantic Relevance**: Emojis match the sentiment and topic of conversation 3. **Positional Awareness**: End-of-sentence for emphasis, mid-sentence for emotional breaks 4. **Frequency Control**: Natural occurrence rates (not over-saturated) ## 🔍 Quality Assurance - **Language Verification**: Native speaker validation scripts - **Format Validation**: JSON Schema enforcement - **Content Filtering**: Toxicity and PII removal pipelines - **Diversity Checks**: Ensured topic and style variety - **Unicode Compliance**: Proper handling of RTL scripts (Arabic) and CJK characters ## 🤝 Contributing We welcome contributions to expand language coverage or improve quality! 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-improvement`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-improvement`) 5. Open a Pull Request ## 📜 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{m1llion_lang_2024, author = {Your Name}, title = {M1llion-Lang: A Multilingual Instruction Dataset with Emoji Expression}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/your-username/m1llion-lang} } ``` ## 📄 License This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to: - **Share** — copy and redistribute the material in any medium or format - **Adapt** — remix, transform, and build upon the material for any purpose, even commercially Under the following terms: - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. ## 🙏 Acknowledgments - Inspired by ShareGPT, Alpaca, and MultiPL-E datasets - Emoji frequency research based on linguistic studies of digital communication - Community contributors and validators from 20+ countries --- **Made with 💖 and 🤖 for the global NLP community**
提供机构:
m1llion-ai-high-end-group
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作