five

Ik45/wikipedia_dataset_science_en_id

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ik45/wikipedia_dataset_science_en_id
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - id license: mit configs: - config_name: science_translation # Nama subset pertama data_files: - split: train path: "science_translation/parallel_dataset_science_en_id.parquet" - config_name: synthetic_latex # Nama subset kedua data_files: - split: train path: "synthetic_latex/data*.parquet" size_categories: - 100K<n<1M task_categories: - translation - text-generation - sentence-similarity tags: - science - wikipedia - bilingual - alignment --- # Wikipedia Dataset Science (English - Indonesian) ## Dataset Description This dataset contains **122,433 aligned sentence pairs** extracted from Wikipedia science articles in English and Indonesian. It is highly suitable for Natural Language Processing (NLP) tasks such as machine translation, cross-lingual alignment, and fine-tuning Large Language Models (LLMs) to better understand scientific terminology in Indonesian. - **Language(s):** English (`en`) and Indonesian (`id`) - **Domain:** Science (Physics, Chemistry, Biology, Mathematics, Astronomy, Computer science, Engineering, Medicine) - **License:** MIT - **Number of Rows:** 122,433 - **Dataset Size:** ~21.1 MB ## Dataset Structure ### Data Instances A typical instance in the dataset includes the aligned sentence pair, a similarity score, and the source Wikipedia article titles. ```json { "en": "Applied physics is rooted in the fundamental truths and basic concepts of the physical sciences...", "id": "Fisika terapan juga berkaitan dengan pemanfaatan prinsip-prinsip ilmiah dalam perangkat dan sistem praktis...", "score": 0.852182, "en_title": "Applied physics", "id_title": "Fisika terapan" } ``` Data Fields en: The text / sentence in English. id: The translated or corresponding text / sentence in Indonesian. score: The semantic alignment score computed using the paraphrase-multilingual-MiniLM-L12-v2 model. Higher scores indicate a closer semantic match. en_title: The title of the source English Wikipedia article. id_title: The title of the source Indonesian Wikipedia article. Data Collection and Cleaning Processes 1. Data Crawling Domain Targeting: The crawling process started from eight root scientific categories: Physics, Chemistry, Biology, Mathematics, Astronomy, Computer science, Engineering, and Medicine. The script recursively traversed subcategories up to a depth of 2 to gather relevant scientific articles. Cross-Lingual Matching: To establish the English-Indonesian document pairs, the Wikipedia API's langlinks property (lllang="id") was used. This ensures that the English article and the Indonesian article are officially linked as equivalents on Wikipedia. Text Extraction: Article text was fetched using the API's extracts property with explaintext=True to retrieve plain text, effectively stripping out raw HTML and Wikipedia markup natively. 2. Data Cleaning & Splitting Sentence Tokenization: The extracted plain text for both English and Indonesian articles was tokenized into individual sentences using nltk.sent_tokenize. Filtering: Empty spaces were stripped, and any sentence with a length of fewer than 20 characters was discarded to remove incomplete sentences, headers, or noisy fragments. 3. Sentence Alignment & Scoring To ensure high-quality translation pairs, the sentences were aligned using semantic similarity: Cross-Lingual Embeddings: The paraphrase-multilingual-MiniLM-L12-v2 model from Hugging Face's Sentence-Transformers was utilized to generate vector embeddings for both the English and Indonesian sentences. Similarity Calculation: The score column represents the Cosine Similarity between the English and Indonesian sentence embeddings. This method effectively matches sentences with the same underlying meaning, even if they are not literal word-for-word translations or if the sentence structures differ between the two languages. ![Dataset Creation Workflow](workflow.png) Usage Limitations & Tips Filtering by Quality: Users are strongly encouraged to filter the dataset based on the score column. For strict machine translation tasks, filtering pairs with a higher score threshold (e.g., score > 0.85 or 0.90) will yield the highest quality parallel data. Lower scores can be retained for general pre-training or broader language modeling. Wikipedia Nature: The data reflects the style and potential biases present in encyclopedic texts. Citation If you use this dataset in your research or projects, please link back to this repository.
提供机构:
Ik45
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作