AmyIvan/mosaic-acl2026
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AmyIvan/mosaic-acl2026
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: MOSAIC
license: cc-by-nc-sa-4.0
language:
- zh
- en
task_categories:
- summarization
- text-generation
- feature-extraction
size_categories:
- 10K<n<100K
tags:
- education
- multimodal
- subtitles
- knowledge-graph
- slides
---
# MOSAIC
## Dataset Summary
MOSAIC is a course-centric multimodal dataset released with an ACL 2026 paper. The dataset centers on `mosaic.jsonl`, a JSONL file that stores course-level metadata together with nested video-level summaries, subtitles, captions, and auxiliary references.
The dataset also includes:
- `data/graph_p_results/`: course-level knowledge graph JSON files keyed by `kg`
- `data/all.csv`: URL-to-filename mapping for slide references
- `data/pdfs/shard_xx/`: sharded reference slide PDFs
## Supported Tasks
- multimodal educational data understanding
- subtitle and caption analysis
- document-aware summarization
- course knowledge graph grounding
- retrieval over linked videos, graphs, and slides
## Languages
The dataset is primarily in Chinese, with a smaller amount of English content in slide titles, references, and course materials.
## Dataset Structure
```text
.
├── README.md
└── data/
├── mosaic.jsonl
├── all.csv
├── graph_p_results/
│ ├── BIT-1001604004.json
│ └── ...
└── pdfs/
├── shard_00/
├── shard_01/
└── ...
```
## Data Instances
### Main file: `data/mosaic.jsonl`
Each line is one course record with the following top-level fields:
- `url`
- `course_title`
- `contents`
- `kg`
- `caption_anno`
- `overview`
- `objectives`
- `prerequisites`
- `references`
Each video entry inside `contents[*].courses[*]` contains:
- `video_url`
- `srt_url`
- `summary`
- `subtitle`
- `caption`
- `video_title`
- `ref`
The `ref` object includes:
- `cate`: reference category
- `doc`: list of reference document URLs
### Knowledge graphs: `data/graph_p_results/*.json`
Each knowledge graph file contains a top-level object with:
- `code`
- `message`
- `sampled`
- `traceId`
- `result`
The main graph payload is stored in:
- `result.mocKgNodeDtoList`
### Slide mapping: `data/all.csv`
Columns:
- `doc_url`: document URL referenced in `mosaic.jsonl`
- `filename`: corresponding PDF filename
### PDFs: `data/pdfs/shard_xx/`
Reference slide PDFs are sharded into directories of up to 500 files each for more reliable upload and browsing.
## Dataset Creation
MOSAIC is constructed from public courses on iCourse163, a major Chinese MOOC platform. The source data follows a four-level hierarchy of course, chapter, video, and topic. Each course provides course-level metadata such as objectives and prerequisite information; chapters group related videos and associated slide decks; videos include timestamped ASR transcripts, instructor-provided knowledge-point outlines, and summaries; and topics correspond to the predefined knowledge points used for alignment. Because the platform does not provide high-quality alignment between transcripts, topic inventories, and slides, the dataset constructs these links from scratch. MOSAIC is released in two subsets: MOSAIC-G, a fully human-annotated gold benchmark built from 6 diverse courses with utterance-level topic labels and utterance-to-slide alignment, and MOSAIC-S, a large silver subset for the remaining courses produced with DORA, a two-stage pipeline that first refines noisy topic inventories and then performs joint segmentation and topic assignment. For slide linkage in MOSAIC-S, the paper describes an automatic pipeline combining title matching, rule-based filtering, and LLM verification.
## Statistics
| Metric | Value |
| --- | ---: |
| Courses | 179 |
| Videos | 14,942 |
| Knowledge graph JSON files | 167 |
| PDF files | 10,566 |
| Slide mapping rows | 10,566 |
| Raw size | ~12.17 GB (11.34 GiB) |
## Licensing Information
This dataset is released under **CC BY-NC-SA 4.0**.
## Citation Information
```bibtex
@inproceedings{ai-etal-2026-mosaic,
title = {MOSAIC: A Large-Scale Multimodal Open-Course Segmentation and Alignment Corpus in Chinese},
author = {Ai, Yuming and Fan, Shuai and Xu, Hua and Kong, Fang},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
year = {2026}
}
```
提供机构:
AmyIvan



