horelulus/ID_REG_Parsed
收藏Hugging Face2026-03-28 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/horelulus/ID_REG_Parsed
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# 🇮🇩 Indonesian Regulation Parsed Dataset
This dataset contains the **parsed content of over 250,000 Indonesian regulations**, extracted directly from PDFs at the article level — **without chunking** — resulting in **3.5+ million rows** representing individual articles and clauses.
---
## 📄 Dataset Overview
* **Source**: Parsed from the original PDF collection of Indonesian regulations (\~250K documents)
* **Parsed Units**: Articles, clauses, or sections as atomic units (no chunking applied)
* **Size**: Over **3.5 million rows** of parsed text data
* **Token Count**: Approximately **2 billion tokens** in total
---
## ⚙️ Processing Pipeline
* **Data Flow**:
1. Pull PDFs from the **[ID\_REG repository](https://huggingface.co/datasets/Azzindani/ID_REG)**
2. Parse and extract articles using automated PDF text extraction methods
3. Push cleaned, parsed text to this Hugging Face repository
* **Infrastructure**:
* Utilized **6 simultaneous Google Colab instances** for parallel processing
* Cloud-to-cloud architecture enables direct transfer without local storage bottlenecks
* Total processing time: \~30 hours
---
## 🧠 Use Cases
* Fine-grained legal text analysis and search
* Article-level classification, summarization, or information retrieval
* Training large language models on legal text with natural article boundaries
* Developing Indonesian regulation compliance tools
---
## ⚠️ Disclaimer
This dataset is provided **exclusively for research and development** purposes. It contains parsed content derived from publicly available legal PDFs, and no guarantees are made regarding completeness or legal accuracy.
---
## 🙏 Acknowledgments
* Hosted on **[Hugging Face](https://huggingface.co/)** — a wonderful platform for open data sharing
* Processing powered by **[Google Colab](https://colab.research.google.com/)** — scalable cloud resources that made fast parallel parsing possible
---
提供机构:
horelulus



