kotorii1/EnVi-Tech-Reasoning-SFT
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/kotorii1/EnVi-Tech-Reasoning-SFT
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- vi
license: mit
task_categories:
- translation
- text-generation
- question-answering
tags:
- nvidia
- system-engineering
- cuda
- reasoning
- technical-translation
- synthetic
size_categories:
- 10K<n<100K
pretty_name: EnVi Tech & Reasoning SFT
---
# 🚀 EnVi-Tech-Reasoning-SFT
> **A high-quality, curated English-Vietnamese parallel corpus focused on System Engineering, AI/MLOps, and Logical Reasoning.**
## 📖 Overview
Standard English-Vietnamese datasets (like OPUS-100) often fail to translate technical terminology correctly (e.g., translating "latency" as "sự trễ nải" instead of "độ trễ", or "driver" as "tài xế" instead of "trình điều khiển").
**EnVi-Tech-Reasoning-SFT** is designed to bridge this gap. It contains **15,115** carefully curated and synthetically generated sentence pairs, specifically optimized for fine-tuning **Small Language Models (SLMs)** like TinyLlama, Qwen, or Phi-3 for technical NMT (Neural Machine Translation) tasks.
## 📊 Dataset Distribution
The dataset is strategically balanced to prioritize technical accuracy while maintaining natural conversational capabilities.
| Domain Category | Count | Percentage | Description |
| :--- | :--- | :--- | :--- |
| **Technology & Engineering** | **7,464** | **49.38%** | Hardware (CUDA, GPU), Coding (Git, Algo), ML Ops. |
| **Logical Reasoning** | **4,100** | **27.13%** | Algorithmic logic, Math word problems, Commonsense reasoning. |
| **Social & Cultural** | **2,051** | **13.57%** | Gen Z slang, Idioms, Drama, Natural conversation. |
| **Business & Formal** | **1,500** | **9.92%** | Formal emails, Financial reports, Business etiquette. |
| **Total** | **15,115** | **100%** | |
## 💡 Example Data
The dataset uses a JSONL format with an explicit `category` field for easy filtering.
### 1. Tech: Hardware & System
```json
{
"en": "We hit a bottleneck due to low memory bandwidth on the GPU.",
"vi": "Chúng ta gặp nút thắt cổ chai do băng thông bộ nhớ trên GPU quá thấp.",
"category": "tech_hardware"
}
```
### 2. Tech: Coding & ML Ops
```json
{
"en": "The validation loss started diverging after epoch 50.",
"vi": "Loss trên tập kiểm thử bắt đầu phân kỳ sau epoch thứ 50.",
"category": "tech_ml_ops"
}
```
### 3. Social: Slang & Idioms (Cultural Nuance)
```json
{
"en": "Don't ghost me like that, bro.",
"vi": "Đừng có bơ tôi như thế chứ ông bạn.",
"category": "social_genz"
}
```
### 4. Logic & Reasoning
```json
{
"en": "If the server response time is > 200ms, trigger an alert. Current time is 150ms.",
"vi": "Nếu thời gian phản hồi máy chủ > 200ms, hãy kích hoạt cảnh báo. Thời gian hiện tại là 150ms.",
"category": "logic_algo"
}
```
## 🛠️ Creation Process (The Engineering Pipeline)
This dataset was not merely scraped; it was engineered using a **Synthetic Data Generation Pipeline** powered by Gemini 2.5 Flash to ensure high quality and domain specificity.
1. **Topic Definition:** Defined 10+ specific sub-domains (e.g., `tech_cuda`, `logic_math`, `social_slang`) relevant to modern AI engineering requirements.
2. **Prompt Engineering:** Used advanced prompting techniques to enforce "Cultural Accuracy" (e.g., forcing the model to use Vietnamese tech slang like "con bug", "train model").
3. **Data Validation:** Automatic filtering to remove malformed JSON and ensure alignment between English and Vietnamese pairs.
4. **Label Consolidation:** Merged granular topics into 4 main categories for efficient training.
## 💻 How to Use
You can load this dataset directly with Hugging Face `datasets`:
```python
from datasets import load_dataset
dataset = load_dataset("kotorii1/EnVi-Tech-Reasoning-SFT")
# Filter for Technical data only
tech_data = dataset.filter(lambda x: "tech" in x["category"])
print(tech_data["train"][0])
```
## ⚖️ License
This dataset is released under the **MIT License**. Feel free to use it for research, commercial projects, or fine-tuning your own models.
-----
*Created by Kotori - Focused on High-Performance AI Systems.*
提供机构:
kotorii1



