five

jimjunior/cocis-web-info

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jimjunior/cocis-web-info
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering language: - en tags: - education - makerere pretty_name: COCIS WEB INFO size_categories: - n<1K configs: - config_name: default data_files: - split: train path: "chunks/*.json" --- # COCIS WEB INFO ## Dataset Summary This dataset contains information about Makerere University College of Computing and Information Science that was scraped from its official website and corresponding websites. The dataset consists of approximately **513 JSON chunks**, designed for high-performance streaming and parallel processing. Each chunk represents a discrete unit of data structured for machine learning tasks. By sharding the data into 513 files, this repository supports the `datasets` library's streaming mode, allowing users to train models without downloading the entire dataset into RAM—a critical feature for resource-constrained environments or high-concurrency CI/CD pipelines. ## Repository Structure The data is organized into a `chunks/` directory to maintain a clean root level: ```text . ├── README.md # This file └── chunks/ # Directory containing 513 JSON files ├── chunk_1.json ├── chunk_2.json └── ... ``` ## Usage You can load this dataset directly using the Hugging Face datasets library: ```python from datasets import load_dataset # Standard loading dataset = load_dataset("jimjunior/cocis-web-info") # Streaming mode (Recommended for many shards) streamed_dataset = load_dataset("cocis-web-info/cocis-web-info", streaming=True) print(next(iter(streamed_dataset["train"]))) ``` ## Maintenance and Contributions This dataset was created as part of the 2026 undergraduate CSC Machine Learning assignment. Its actively mantained by [Beingana Jim Junior](https://www.linkedin.com/in/jim-junior-beingana/). Corresponding associated code used to collect and manage this data can be found at [https://github.com/jim-junior/SW-ML-1-NLP-Project](https://github.com/jim-junior/SW-ML-1-NLP-Project) ## Citation ```text @misc{junior2026dataset, author = {Jim Junior, B.}, title = {513-Chunk JSON Dataset}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, } ```
提供机构:
jimjunior
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作