Ik45/data-science-en-id

Name: Ik45/data-science-en-id
Creator: Ik45
Published: 2026-04-06 09:07:57
License: 暂无描述

Hugging Face2026-04-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Ik45/data-science-en-id

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en - id base_model: - wikimedia/wikipedia - ccdv/arxiv-summarization tags: - data-science - scientific - machine-translation - nlp - wikipedia-subset - arxiv-subset dataset_info: splits: - name: train size_categories: - 10M<n<100M --- # Data Science EN-ID Parallel Corpus (Scientific Domain) ## Dataset Description This dataset is a curated English-Indonesian (EN-ID) parallel corpus specifically designed for the **Scientific** and **Data Science** domains. It was developed to support the training of Machine Translation (NMT) models and Large Language Models (LLMs) to better handle technical terminology, academic structures, and formal scientific language. - **Primary Languages:** English (EN) and Indonesian (ID) - **Domain:** Data Science, Artificial Intelligence, Machine Learning, and General Science. - **Applications:** Neural Machine Translation, Domain Adaptation, Cross-lingual Information Retrieval. ## Source Data & Origin This dataset is a specialized extension and refined subset derived from two primary high-quality sources available on Hugging Face: 1. **[wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia):** We extracted the English-Indonesian parallel subsets, specifically focusing on articles categorized under Science, Technology, and Mathematics. 2. **[ccdv/arxiv-summarization](https://huggingface.co/datasets/ccdv/arxiv-summarization):** We utilized the Arxiv metadata and document summaries to build a robust scientific corpus, filtering for Computer Science and Data Science domains. ## Extraction & Refinement Pipeline To transform these general-purpose datasets into a domain-specific parallel corpus, the following pipeline was implemented: ### 1. Domain-Specific Filtering Instead of using the entire Wikipedia or Arxiv dump, we applied a **Lexical Filter** using a dictionary of 500+ technical keywords. A sentence pair is only included if it contains terminology relevant to Data Science (e.g., *Neural Networks, Statistical Inference, Heuristics*). ### 2. Scientific Alignment For the Arxiv data, we performed a custom alignment process to pair English scientific abstracts with their Indonesian technical equivalents, ensuring that the formal tone and academic nomenclature are preserved. ### 3. Noise Reduction (Regex-Based) Since Arxiv and Wikipedia data often contain LaTeX code, citations (e.g., `[1]`, `(Author, 2023)`), and HTML artifacts, we applied a rigorous cleaning script to ensure the final output consists of clean, natural language sentences. ### 4. Deduplication We cross-referenced both sources to remove overlapping entries, ensuring that the `Ik45/data-science-en-id` corpus is diverse and free from redundant training signals. ## Creation Methodology Based on the processing pipeline in `ScriptTextTranslationDomainIlmiah.ipynb`, the dataset was constructed through several key stages: ### 1. Data Collection The corpus was aggregated from various scientific articles , academic papers, and technical datasets. The focus was strictly maintained on high-quality technical content to ensure domain relevance. ### 2. Preprocessing & Data Cleaning - **Deduplication:** Removal of redundant pairs to ensure data diversity and prevent model overfitting. - **Noise Reduction:** Cleaning of non-textual characters, artifacts from PDF extractions, and broken symbols. - **Normalization:** Standardizing text formats to maintain consistency across the entire corpus. ### 3. Alignment Rigorous sentence-level alignment was performed to ensure that English technical terms are correctly mapped to their appropriate Indonesian counterparts within a scientific context. ## Usage You can easily load this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("Ik45/data-science-en-id") # Preview a sample print(dataset['train'][0]) ```

提供机构：

Ik45

5,000+

优质数据集

54 个

任务类型

进入经典数据集