five

kotorii1/EnVi-Tech-Reasoning-SFT

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/kotorii1/EnVi-Tech-Reasoning-SFT
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - vi license: mit task_categories: - translation - text-generation - question-answering tags: - nvidia - system-engineering - cuda - reasoning - technical-translation - synthetic size_categories: - 10K<n<100K pretty_name: EnVi Tech & Reasoning SFT --- # 🚀 EnVi-Tech-Reasoning-SFT > **A high-quality, curated English-Vietnamese parallel corpus focused on System Engineering, AI/MLOps, and Logical Reasoning.** ## 📖 Overview Standard English-Vietnamese datasets (like OPUS-100) often fail to translate technical terminology correctly (e.g., translating "latency" as "sự trễ nải" instead of "độ trễ", or "driver" as "tài xế" instead of "trình điều khiển"). **EnVi-Tech-Reasoning-SFT** is designed to bridge this gap. It contains **15,115** carefully curated and synthetically generated sentence pairs, specifically optimized for fine-tuning **Small Language Models (SLMs)** like TinyLlama, Qwen, or Phi-3 for technical NMT (Neural Machine Translation) tasks. ## 📊 Dataset Distribution The dataset is strategically balanced to prioritize technical accuracy while maintaining natural conversational capabilities. | Domain Category | Count | Percentage | Description | | :--- | :--- | :--- | :--- | | **Technology & Engineering** | **7,464** | **49.38%** | Hardware (CUDA, GPU), Coding (Git, Algo), ML Ops. | | **Logical Reasoning** | **4,100** | **27.13%** | Algorithmic logic, Math word problems, Commonsense reasoning. | | **Social & Cultural** | **2,051** | **13.57%** | Gen Z slang, Idioms, Drama, Natural conversation. | | **Business & Formal** | **1,500** | **9.92%** | Formal emails, Financial reports, Business etiquette. | | **Total** | **15,115** | **100%** | | ## 💡 Example Data The dataset uses a JSONL format with an explicit `category` field for easy filtering. ### 1. Tech: Hardware & System ```json { "en": "We hit a bottleneck due to low memory bandwidth on the GPU.", "vi": "Chúng ta gặp nút thắt cổ chai do băng thông bộ nhớ trên GPU quá thấp.", "category": "tech_hardware" } ``` ### 2. Tech: Coding & ML Ops ```json { "en": "The validation loss started diverging after epoch 50.", "vi": "Loss trên tập kiểm thử bắt đầu phân kỳ sau epoch thứ 50.", "category": "tech_ml_ops" } ``` ### 3. Social: Slang & Idioms (Cultural Nuance) ```json { "en": "Don't ghost me like that, bro.", "vi": "Đừng có bơ tôi như thế chứ ông bạn.", "category": "social_genz" } ``` ### 4. Logic & Reasoning ```json { "en": "If the server response time is > 200ms, trigger an alert. Current time is 150ms.", "vi": "Nếu thời gian phản hồi máy chủ > 200ms, hãy kích hoạt cảnh báo. Thời gian hiện tại là 150ms.", "category": "logic_algo" } ``` ## 🛠️ Creation Process (The Engineering Pipeline) This dataset was not merely scraped; it was engineered using a **Synthetic Data Generation Pipeline** powered by Gemini 2.5 Flash to ensure high quality and domain specificity. 1. **Topic Definition:** Defined 10+ specific sub-domains (e.g., `tech_cuda`, `logic_math`, `social_slang`) relevant to modern AI engineering requirements. 2. **Prompt Engineering:** Used advanced prompting techniques to enforce "Cultural Accuracy" (e.g., forcing the model to use Vietnamese tech slang like "con bug", "train model"). 3. **Data Validation:** Automatic filtering to remove malformed JSON and ensure alignment between English and Vietnamese pairs. 4. **Label Consolidation:** Merged granular topics into 4 main categories for efficient training. ## 💻 How to Use You can load this dataset directly with Hugging Face `datasets`: ```python from datasets import load_dataset dataset = load_dataset("kotorii1/EnVi-Tech-Reasoning-SFT") # Filter for Technical data only tech_data = dataset.filter(lambda x: "tech" in x["category"]) print(tech_data["train"][0]) ``` ## ⚖️ License This dataset is released under the **MIT License**. Feel free to use it for research, commercial projects, or fine-tuning your own models. ----- *Created by Kotori - Focused on High-Performance AI Systems.*
提供机构:
kotorii1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作