jimjunior/cocis-web-info
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jimjunior/cocis-web-info
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
language:
- en
tags:
- education
- makerere
pretty_name: COCIS WEB INFO
size_categories:
- n<1K
configs:
- config_name: default
data_files:
- split: train
path: "chunks/*.json"
---
# COCIS WEB INFO
## Dataset Summary
This dataset contains information about Makerere University College of Computing and Information Science that was scraped from its official website and corresponding websites.
The dataset consists of approximately **513 JSON chunks**, designed for high-performance streaming and parallel processing. Each chunk represents a discrete unit of data structured for machine learning tasks.
By sharding the data into 513 files, this repository supports the `datasets` library's streaming mode, allowing users to train models without downloading the entire dataset into RAM—a critical feature for resource-constrained environments or high-concurrency CI/CD pipelines.
## Repository Structure
The data is organized into a `chunks/` directory to maintain a clean root level:
```text
.
├── README.md # This file
└── chunks/ # Directory containing 513 JSON files
├── chunk_1.json
├── chunk_2.json
└── ...
```
## Usage
You can load this dataset directly using the Hugging Face datasets library:
```python
from datasets import load_dataset
# Standard loading
dataset = load_dataset("jimjunior/cocis-web-info")
# Streaming mode (Recommended for many shards)
streamed_dataset = load_dataset("cocis-web-info/cocis-web-info", streaming=True)
print(next(iter(streamed_dataset["train"])))
```
## Maintenance and Contributions
This dataset was created as part of the 2026 undergraduate CSC Machine Learning assignment. Its actively mantained by [Beingana Jim Junior](https://www.linkedin.com/in/jim-junior-beingana/).
Corresponding associated code used to collect and manage this data can be found at [https://github.com/jim-junior/SW-ML-1-NLP-Project](https://github.com/jim-junior/SW-ML-1-NLP-Project)
## Citation
```text
@misc{junior2026dataset,
author = {Jim Junior, B.},
title = {513-Chunk JSON Dataset},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
}
```
提供机构:
jimjunior



