five

horelulus/ID_REG_Parsed

收藏
Hugging Face2026-03-28 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/horelulus/ID_REG_Parsed
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # 🇮🇩 Indonesian Regulation Parsed Dataset This dataset contains the **parsed content of over 250,000 Indonesian regulations**, extracted directly from PDFs at the article level — **without chunking** — resulting in **3.5+ million rows** representing individual articles and clauses. --- ## 📄 Dataset Overview * **Source**: Parsed from the original PDF collection of Indonesian regulations (\~250K documents) * **Parsed Units**: Articles, clauses, or sections as atomic units (no chunking applied) * **Size**: Over **3.5 million rows** of parsed text data * **Token Count**: Approximately **2 billion tokens** in total --- ## ⚙️ Processing Pipeline * **Data Flow**: 1. Pull PDFs from the **[ID\_REG repository](https://huggingface.co/datasets/Azzindani/ID_REG)** 2. Parse and extract articles using automated PDF text extraction methods 3. Push cleaned, parsed text to this Hugging Face repository * **Infrastructure**: * Utilized **6 simultaneous Google Colab instances** for parallel processing * Cloud-to-cloud architecture enables direct transfer without local storage bottlenecks * Total processing time: \~30 hours --- ## 🧠 Use Cases * Fine-grained legal text analysis and search * Article-level classification, summarization, or information retrieval * Training large language models on legal text with natural article boundaries * Developing Indonesian regulation compliance tools --- ## ⚠️ Disclaimer This dataset is provided **exclusively for research and development** purposes. It contains parsed content derived from publicly available legal PDFs, and no guarantees are made regarding completeness or legal accuracy. --- ## 🙏 Acknowledgments * Hosted on **[Hugging Face](https://huggingface.co/)** — a wonderful platform for open data sharing * Processing powered by **[Google Colab](https://colab.research.google.com/)** — scalable cloud resources that made fast parallel parsing possible ---
提供机构:
horelulus
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作