hayas11/maritime-arabic-industrial-ai

Name: hayas11/maritime-arabic-industrial-ai
Creator: hayas11
Published: 2026-03-28 17:01:16
License: 暂无描述

Hugging Face2026-03-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hayas11/maritime-arabic-industrial-ai

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar - en tags: - maritime - arabic - industrial-ai - sft - rlhf - multi-turn-conversations - glossary-(term-level) - error-correction-pairs - translation-pairs-(en↔ar) - ship-stability - naval-architecture - STCW pretty_name: Maritime Technical Arabic - Industrial AI Dataset size_categories: - n<1K task_categories: - text-generation - translation license: cc-by-4.0 configs: - config_name: error_correction data_files: "dataset/error_correction/*" - config_name: multi_turn_conversations data_files: "dataset/multi_turn_conversations/*" - config_name: rlhf_ranking data_files: "dataset/rlhf_ranking/*" - config_name: sft data_files: "dataset/sft/*" - config_name: translation_pairs data_files: "dataset/translation_pairs/*" - config_name: glossaries data_files: "dataset/glossaries/*" --- # Maritime Technical Arabic — Industrial AI Dataset **Author:** Mohammed Almalki - Marine Engineering & Naval Architecture **Purpose:** Bridging the gap between general Arabic NLP and heavy industry technical language for AI models. --- ## The Problem This Dataset Solves This dataset solves the problem of specialized knowledge. By converting complex engineering research into a structured format, it allows AI developers to create tools that can accurately answer questions on Naval Research & Analysis, Marine Technology, Naval Architecture & Marine Engineering in the Arabic language a task that general-purpose models currently struggle with due to a lack of domain-specific training data. Synthetic Augmentation: 90% of examples were generated by LLM pipelines (using the sources) and then validated, not original human-authored content --- ## Dataset Summary This dataset is a proof-of-concept, not a production-scale resource but a curated technical corpus focused on maritime engineering and future-tech feasibility within the Arabic-speaking world. Sources include academic research, technical textbooks, and feasibility studies covering many aspects. ## Methodology & AI Orchestration To transform dense academic PDFs and texts into a structured dataset, the following pipeline was implemented: * **Programmatic Data Cleaning:** Developed custom LLM workflows to extract technical specifications, formulas, and core arguments from unstructured research papers, normalizing terminology across Arabic and English technical lexicons. * **Synthetic Data Augmentation:** Generated complex Q&A pairs and summarizations based on the core research to expand the dataset’s utility for training RAG (Retrieval-Augmented Generation) systems. * **Human-in-the-Loop (HITL) Validation:** Performed rigorous manual audits of technical terms (e.g., "Nuclear Feasibility," "Electromagnetic Induction") to ensure that the Arabic translation and conceptual mapping remained scientifically accurate. ## Quality Assurance & Validation Validation Methodology 1 - Human-in-the-Loop Review (100% coverage) -All error correction pairs manually reviewed for technical accuracy -Glossary terms cross-referenced against maritime engineering textbooks -Multi-turn conversations checked for logical flow and coherence 2 - Consistency Checks -All technical terms in SFT pairs validated against master glossary -Translation pairs checked for semantic equivalence (not literal translation) -Domain terminology usage consistent across all subsets 3 - Accuracy Metrics -Spot-check validation: 50 random error correction pairs reviewed - 100% accuracy -Glossary-to-dataset consistency: of terms appear correctly in context -Translation pair accuracy (technical meaning preservation) ## Dataset Contents ### 1. Glossary - Term Maritime Arabic Glossary A bilingual terminology reference covering domain-specific terms across naval architecture, marine propulsion, nuclear engineering, and maritime law. | Field | Type | Description | |--------------------|-------------|-------------------------------------------------------| | `english_term` | string | Technical term in English | | `arabic_term` | string | Verified Arabic translation | | `arabic_definition`| string | Arabic definition/explanation of the term | ### 2. SFT - Supervised Fine-Tuning Prompt-Response Pairs Single-turn instruction-response pairs designed for supervised fine-tuning. Each entry presents a technical question a practitioner or researcher might ask, paired with a detailed, accurate Arabic response grounded in the source research. | Field | Type | Description | |-------------------|--------|-----------------------------------------------------------------| | `instruction` | string | Technical question in Arabic | | `response` | string | Expert-level Arabic answer (typically 100–300 words) | | `metadata_topic` | string | Broad topic tag (e.g. "nuclear propulsion", "submarine design") | | `metadata_source` | string | Research paper or chapter the response is derived from | **Designed for:** `transformers` SFT pipelines **Format compatibility:** Converts directly to ShareGPT or Alpaca format. ### 4. RLHF Ranking — Preference Pairs Preference pairs for reward model training or Direct Preference Optimization (DPO). Each entry presents the same prompt with two responses one technically accurate and one containing a plausible hallucination or imprecision labelled as chosen and rejected. The "rejected" responses are not randomly wrong. They represent the category of errors Arabic LLMs most commonly produce on technical maritime content: wrong numerical values, institution name confusions, and conceptual misattributions that are internally coherent but factually incorrect. | Field | Type | Description | |------------|--------|----------------------------------------------------------| | `prompt` | string | Technical question or scenario in Arabic | | `chosen` | string | Accurate, expert-level Arabic response | | `rejected` | string | Plausible but incorrect or imprecise Arabic response | **Designed for:** DPO, IPO, KTO trainers (TRL `DPOTrainer`, Axolotl `dpo`). **Direct compatibility:** Hugging Face `trl` DPO format (prompt/chosen/rejected). ## 4. Multi-turn conversations - Dialogues consisting of multiple back-and-forth exchanges Realistic dialogues simulating a domain expert being consulted on maritime engineering topics. Each conversation contains between 3 and 8 turns, covering follow-up questions, clarification requests, and technical elaboration patterns that single-turn SFT data cannot capture. | Field | Type | Description | |----------------|--------------|----------------------------------------------------------| | `messages` | list[object] | Ordered list of turns | | `role` | string | `"user"` or `"assistant"` | | `content` | string | The message content in Arabic | | `topic` | string | Conversation topic tag | **Designed for:** Chat fine-tuning with `apply_chat_template` (ChatML, LLaMA-3 Instruct, Qwen, Mistral Instruct formats). ## 5. Translation pairs (EN-AR) Parallel sentence pairs for technical translation evaluation and fine-tuning. Sentences are drawn from engineering specifications, research abstracts, and safety documentation contexts where literal translation fails and domain-faithful rendering is required. | Field | Type | Description | |------------|--------|-------------------------------------------| | `en` | string | Source sentence in English | | `ar` | string | Expert Arabic translation | | `domain` | string | Sub-domain of the sentence | **Designed for:** Fine-tuning translation models (NLLB, SeamlessM4T) and evaluating Arabic technical translation quality. ## 6. Error correction pairs Pairs of incorrect and correct Arabic technical statements. The incorrect version represents a believable hallucination wrong facts, swapped terms, or plausible-sounding but inaccurate claims drawn from the same research domain as the correct version. | Field | Type | Description | |-------------|--------|-----------------------------------------------------| | `incorrect` | string | Hallucinated or factually wrong statement in Arabic | | `correct` | string | Accurate corrected statement in Arabic | | `topic` | string | Sub-domain of the pair | **Designed for:** Training models on factual grounding, building hallucination-detection classifiers, and constructing RLHF rejected samples. ## Intended use - RAG (Retrieval-Augmented Generation): Optimized for building AI-powered technical assistants that can query complex maritime research - Domain-Specific Fine-Tuning: Ideal for fine-tuning LLMs to understand the specific linguistic patterns and technical naming of the Arabic maritime engineering sector. - Technical Translation (Cross-Lingual): A baseline for improving the translation accuracy of high-stakes engineering terms from English to Arabic, ensuring scientific "faithfulness" rather than just literal translation. ## Source & Derivation The technical content in this dataset is derived from graduate-level engineering research in naval architecture and marine engineering and from textbooks, covering: power systems, hull design, propulsion,marine propulsion, electrical systems, auxiliary machinery, Descriptions of mechanical failures, diagnostic steps, and repair procedures. - maritime-engineering - industrial-ai - safety-protocols - naval-architecture - technical-arabic - auxiliary-machinery - port-operations All content has undergone human-in-the-loop validation by a Marine Engineering & Naval Architecture specialist to ensure scientific accuracy. ## Citation ```bibtex @dataset{hayas11_2025maritimearabic, author = {hayas11}, title = {Maritime Technical Arabic — Industrial AI Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/hayas11/maritime-arabic-industrial-ai} } ``` --- ## License Creative Commons Attribution 4.0 International (CC BY 4.0) Free to use for research and model training with attribution.

提供机构：

hayas11

搜集汇总

数据集介绍

构建方式

在航海工程与工业人工智能交叉领域，构建专业数据集面临术语精确性与技术深度双重挑战。本数据集通过程序化数据清洗流程，从非结构化的学术文献中提取技术规范与核心论点，并借助合成数据增强技术生成问答对以扩展训练样本。为确保科学准确性，采用人机协同验证机制，由航海工程专家对术语翻译与概念映射进行严格审核，形成结构化的多模态语料库。

特点

该数据集以航海工程为核心，涵盖船舶稳定性、海军建筑等专业领域，具备鲜明的领域特异性。其设计包含术语表、监督微调对、多轮对话等多种子集，支持从基础术语学习到复杂对话生成的完整训练流程。特别值得注意的是，数据集通过精心构造的错误修正对与偏好排序样本，模拟了专业模型中常见的幻觉类型，为模型事实性校准提供了关键训练素材。

使用方法

针对检索增强生成系统开发，可利用术语表与监督微调对构建领域知识库，实现精准的技术问答。在模型微调层面，多轮对话子集适用于聊天模板适配，而偏好排序数据可直接用于直接偏好优化等对齐训练。对于跨语言技术翻译任务，平行句对为专业术语的语义保持翻译提供了评估基准与训练数据。

背景与挑战

背景概述

在自然语言处理领域，专业术语的精准处理与跨语言技术文档的准确理解一直是核心难题。Maritime Technical Arabic — Industrial AI Dataset 由海事工程与造船学专家 Mohammed Almalki 于2025年创建，旨在弥合通用阿拉伯语自然语言处理与重工业技术语言之间的鸿沟。该数据集聚焦于船舶稳定性、造船学及STCW公约等海事工程领域，通过结构化呈现复杂的工程研究成果，专门用于训练能够精准理解并生成阿拉伯语技术内容的人工智能模型。其核心研究问题在于解决通用大模型因缺乏领域特定数据而难以准确处理阿拉伯语海事技术文本的困境，为阿拉伯语世界的工业人工智能应用提供了关键的数据基础。

当前挑战

该数据集致力于解决海事工程领域阿拉伯语技术文本的精准理解与生成挑战，具体包括技术术语的跨语言准确对齐、复杂工程概念的忠实传达，以及对抗模型在专业领域产生事实性幻觉的问题。在构建过程中，挑战主要源于将非结构化的学术PDF与文本转化为高质量结构化数据的复杂性。这涉及开发定制化的大语言模型工作流以提取技术规格与公式，并确保阿拉伯语与英语技术词汇的术语规范化。此外，尽管采用了合成数据增强策略以扩展数据集规模，但维持生成内容在核工程、电磁感应等高度专业化主题上的科学准确性，仍需依赖严格的人工循环验证流程，这构成了数据集构建的主要瓶颈。

常用场景

经典使用场景

在阿拉伯语自然语言处理领域，针对海事工程与船舶设计等重工业技术语言的专门化需求日益凸显。该数据集通过提供涵盖船舶稳定性、核推进系统及海事法规等多主题的阿拉伯语技术语料，经典应用于领域特定大型语言模型的监督微调与强化学习对齐。其精心构建的指令-响应对与多轮对话数据，能够有效训练模型理解复杂工程概念，生成符合专业规范的阿拉伯语技术内容，从而弥合通用模型与专业领域之间的知识鸿沟。

衍生相关工作

围绕该数据集，已衍生出多项聚焦于低资源技术语言建模的经典研究工作。例如，基于其监督微调数据构建的领域适配模型，被用于探索专业术语在预训练模型中的注入与对齐机制。其强化学习从人类反馈排序数据则促进了针对技术内容事实性校准的直接偏好优化方法在阿拉伯语场景下的应用。此外，该数据集的双语术语资源常被用作基准，以评估和改进神经机器翻译模型在处理海事工程等专业领域文本时的概念保真度与术语一致性。

数据集最近研究