five

AmyIvan/mosaic-acl2026

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AmyIvan/mosaic-acl2026
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: MOSAIC license: cc-by-nc-sa-4.0 language: - zh - en task_categories: - summarization - text-generation - feature-extraction size_categories: - 10K<n<100K tags: - education - multimodal - subtitles - knowledge-graph - slides --- # MOSAIC ## Dataset Summary MOSAIC is a course-centric multimodal dataset released with an ACL 2026 paper. The dataset centers on `mosaic.jsonl`, a JSONL file that stores course-level metadata together with nested video-level summaries, subtitles, captions, and auxiliary references. The dataset also includes: - `data/graph_p_results/`: course-level knowledge graph JSON files keyed by `kg` - `data/all.csv`: URL-to-filename mapping for slide references - `data/pdfs/shard_xx/`: sharded reference slide PDFs ## Supported Tasks - multimodal educational data understanding - subtitle and caption analysis - document-aware summarization - course knowledge graph grounding - retrieval over linked videos, graphs, and slides ## Languages The dataset is primarily in Chinese, with a smaller amount of English content in slide titles, references, and course materials. ## Dataset Structure ```text . ├── README.md └── data/ ├── mosaic.jsonl ├── all.csv ├── graph_p_results/ │ ├── BIT-1001604004.json │ └── ... └── pdfs/ ├── shard_00/ ├── shard_01/ └── ... ``` ## Data Instances ### Main file: `data/mosaic.jsonl` Each line is one course record with the following top-level fields: - `url` - `course_title` - `contents` - `kg` - `caption_anno` - `overview` - `objectives` - `prerequisites` - `references` Each video entry inside `contents[*].courses[*]` contains: - `video_url` - `srt_url` - `summary` - `subtitle` - `caption` - `video_title` - `ref` The `ref` object includes: - `cate`: reference category - `doc`: list of reference document URLs ### Knowledge graphs: `data/graph_p_results/*.json` Each knowledge graph file contains a top-level object with: - `code` - `message` - `sampled` - `traceId` - `result` The main graph payload is stored in: - `result.mocKgNodeDtoList` ### Slide mapping: `data/all.csv` Columns: - `doc_url`: document URL referenced in `mosaic.jsonl` - `filename`: corresponding PDF filename ### PDFs: `data/pdfs/shard_xx/` Reference slide PDFs are sharded into directories of up to 500 files each for more reliable upload and browsing. ## Dataset Creation MOSAIC is constructed from public courses on iCourse163, a major Chinese MOOC platform. The source data follows a four-level hierarchy of course, chapter, video, and topic. Each course provides course-level metadata such as objectives and prerequisite information; chapters group related videos and associated slide decks; videos include timestamped ASR transcripts, instructor-provided knowledge-point outlines, and summaries; and topics correspond to the predefined knowledge points used for alignment. Because the platform does not provide high-quality alignment between transcripts, topic inventories, and slides, the dataset constructs these links from scratch. MOSAIC is released in two subsets: MOSAIC-G, a fully human-annotated gold benchmark built from 6 diverse courses with utterance-level topic labels and utterance-to-slide alignment, and MOSAIC-S, a large silver subset for the remaining courses produced with DORA, a two-stage pipeline that first refines noisy topic inventories and then performs joint segmentation and topic assignment. For slide linkage in MOSAIC-S, the paper describes an automatic pipeline combining title matching, rule-based filtering, and LLM verification. ## Statistics | Metric | Value | | --- | ---: | | Courses | 179 | | Videos | 14,942 | | Knowledge graph JSON files | 167 | | PDF files | 10,566 | | Slide mapping rows | 10,566 | | Raw size | ~12.17 GB (11.34 GiB) | ## Licensing Information This dataset is released under **CC BY-NC-SA 4.0**. ## Citation Information ```bibtex @inproceedings{ai-etal-2026-mosaic, title = {MOSAIC: A Large-Scale Multimodal Open-Course Segmentation and Alignment Corpus in Chinese}, author = {Ai, Yuming and Fan, Shuai and Xu, Hua and Kong, Fang}, booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics}, year = {2026} } ```
提供机构:
AmyIvan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作