Ik45/wikipedia_dataset_science_en_id
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ik45/wikipedia_dataset_science_en_id
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- id
license: mit
configs:
- config_name: science_translation # Nama subset pertama
data_files:
- split: train
path: "science_translation/parallel_dataset_science_en_id.parquet"
- config_name: synthetic_latex # Nama subset kedua
data_files:
- split: train
path: "synthetic_latex/data*.parquet"
size_categories:
- 100K<n<1M
task_categories:
- translation
- text-generation
- sentence-similarity
tags:
- science
- wikipedia
- bilingual
- alignment
---
# Wikipedia Dataset Science (English - Indonesian)
## Dataset Description
This dataset contains **122,433 aligned sentence pairs** extracted from Wikipedia science articles in English and Indonesian. It is highly suitable for Natural Language Processing (NLP) tasks such as machine translation, cross-lingual alignment, and fine-tuning Large Language Models (LLMs) to better understand scientific terminology in Indonesian.
- **Language(s):** English (`en`) and Indonesian (`id`)
- **Domain:** Science (Physics, Chemistry, Biology, Mathematics, Astronomy, Computer science, Engineering, Medicine)
- **License:** MIT
- **Number of Rows:** 122,433
- **Dataset Size:** ~21.1 MB
## Dataset Structure
### Data Instances
A typical instance in the dataset includes the aligned sentence pair, a similarity score, and the source Wikipedia article titles.
```json
{
"en": "Applied physics is rooted in the fundamental truths and basic concepts of the physical sciences...",
"id": "Fisika terapan juga berkaitan dengan pemanfaatan prinsip-prinsip ilmiah dalam perangkat dan sistem praktis...",
"score": 0.852182,
"en_title": "Applied physics",
"id_title": "Fisika terapan"
}
```
Data Fields
en: The text / sentence in English.
id: The translated or corresponding text / sentence in Indonesian.
score: The semantic alignment score computed using the paraphrase-multilingual-MiniLM-L12-v2 model. Higher scores indicate a closer semantic match.
en_title: The title of the source English Wikipedia article.
id_title: The title of the source Indonesian Wikipedia article.
Data Collection and Cleaning Processes
1. Data Crawling
Domain Targeting: The crawling process started from eight root scientific categories: Physics, Chemistry, Biology, Mathematics, Astronomy, Computer science, Engineering, and Medicine. The script recursively traversed subcategories up to a depth of 2 to gather relevant scientific articles.
Cross-Lingual Matching: To establish the English-Indonesian document pairs, the Wikipedia API's langlinks property (lllang="id") was used. This ensures that the English article and the Indonesian article are officially linked as equivalents on Wikipedia.
Text Extraction: Article text was fetched using the API's extracts property with explaintext=True to retrieve plain text, effectively stripping out raw HTML and Wikipedia markup natively.
2. Data Cleaning & Splitting
Sentence Tokenization: The extracted plain text for both English and Indonesian articles was tokenized into individual sentences using nltk.sent_tokenize.
Filtering: Empty spaces were stripped, and any sentence with a length of fewer than 20 characters was discarded to remove incomplete sentences, headers, or noisy fragments.
3. Sentence Alignment & Scoring
To ensure high-quality translation pairs, the sentences were aligned using semantic similarity:
Cross-Lingual Embeddings: The paraphrase-multilingual-MiniLM-L12-v2 model from Hugging Face's Sentence-Transformers was utilized to generate vector embeddings for both the English and Indonesian sentences.
Similarity Calculation: The score column represents the Cosine Similarity between the English and Indonesian sentence embeddings. This method effectively matches sentences with the same underlying meaning, even if they are not literal word-for-word translations or if the sentence structures differ between the two languages.

Usage Limitations & Tips
Filtering by Quality: Users are strongly encouraged to filter the dataset based on the score column. For strict machine translation tasks, filtering pairs with a higher score threshold (e.g., score > 0.85 or 0.90) will yield the highest quality parallel data. Lower scores can be retained for general pre-training or broader language modeling.
Wikipedia Nature: The data reflects the style and potential biases present in encyclopedic texts.
Citation
If you use this dataset in your research or projects, please link back to this repository.
提供机构:
Ik45



