Ik45/data-science-en-id
收藏Hugging Face2026-04-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ik45/data-science-en-id
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
- id
base_model:
- wikimedia/wikipedia
- ccdv/arxiv-summarization
tags:
- data-science
- scientific
- machine-translation
- nlp
- wikipedia-subset
- arxiv-subset
dataset_info:
splits:
- name: train
size_categories:
- 10M<n<100M
---
# Data Science EN-ID Parallel Corpus (Scientific Domain)
## Dataset Description
This dataset is a curated English-Indonesian (EN-ID) parallel corpus specifically designed for the **Scientific** and **Data Science** domains. It was developed to support the training of Machine Translation (NMT) models and Large Language Models (LLMs) to better handle technical terminology, academic structures, and formal scientific language.
- **Primary Languages:** English (EN) and Indonesian (ID)
- **Domain:** Data Science, Artificial Intelligence, Machine Learning, and General Science.
- **Applications:** Neural Machine Translation, Domain Adaptation, Cross-lingual Information Retrieval.
## Source Data & Origin
This dataset is a specialized extension and refined subset derived from two primary high-quality sources available on Hugging Face:
1. **[wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia):** We extracted the English-Indonesian parallel subsets, specifically focusing on articles categorized under Science, Technology, and Mathematics.
2. **[ccdv/arxiv-summarization](https://huggingface.co/datasets/ccdv/arxiv-summarization):** We utilized the Arxiv metadata and document summaries to build a robust scientific corpus, filtering for Computer Science and Data Science domains.
## Extraction & Refinement Pipeline
To transform these general-purpose datasets into a domain-specific parallel corpus, the following pipeline was implemented:
### 1. Domain-Specific Filtering
Instead of using the entire Wikipedia or Arxiv dump, we applied a **Lexical Filter** using a dictionary of 500+ technical keywords. A sentence pair is only included if it contains terminology relevant to Data Science (e.g., *Neural Networks, Statistical Inference, Heuristics*).
### 2. Scientific Alignment
For the Arxiv data, we performed a custom alignment process to pair English scientific abstracts with their Indonesian technical equivalents, ensuring that the formal tone and academic nomenclature are preserved.
### 3. Noise Reduction (Regex-Based)
Since Arxiv and Wikipedia data often contain LaTeX code, citations (e.g., `[1]`, `(Author, 2023)`), and HTML artifacts, we applied a rigorous cleaning script to ensure the final output consists of clean, natural language sentences.
### 4. Deduplication
We cross-referenced both sources to remove overlapping entries, ensuring that the `Ik45/data-science-en-id` corpus is diverse and free from redundant training signals.
## Creation Methodology
Based on the processing pipeline in `ScriptTextTranslationDomainIlmiah.ipynb`, the dataset was constructed through several key stages:
### 1. Data Collection
The corpus was aggregated from various scientific articles , academic papers, and technical datasets. The focus was strictly maintained on high-quality technical content to ensure domain relevance.
### 2. Preprocessing & Data Cleaning
- **Deduplication:** Removal of redundant pairs to ensure data diversity and prevent model overfitting.
- **Noise Reduction:** Cleaning of non-textual characters, artifacts from PDF extractions, and broken symbols.
- **Normalization:** Standardizing text formats to maintain consistency across the entire corpus.
### 3. Alignment
Rigorous sentence-level alignment was performed to ensure that English technical terms are correctly mapped to their appropriate Indonesian counterparts within a scientific context.
## Usage
You can easily load this dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Ik45/data-science-en-id")
# Preview a sample
print(dataset['train'][0])
```
提供机构:
Ik45



