five

A Chinese Multimodal Dataset for Music Theory Knowledge Generation

收藏
DataCite Commons2025-08-05 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=c41cdbae3b844dc0afb67311003eef6e
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was constructed through a systematic multi-stage process, with raw data sourced from 1.13 million heterogeneous samples in the MusicPile-sft English dataset and 50,000 supplementary entries generated via the Spark Platform API. Data processing employed a hybrid approach combining automated Python 3.9 scripts with manual verification, utilizing key tools including PyArrow (v12.0) for parquet format conversion, regular expression engines (re module) for musical notation validation, and DeepSeek API (v2.3) for specialized term translation. Temporally, the dataset spans musical literature from the 20th century to 2025, with spatial resolution at China's provincial level. Historical figures are annotated with birth-death years (e.g., "Huang Zi (1904-1938)"), achieving annual-scale temporal resolution.The dataset is stored as a single JSON file (multi-field_music_datasets.json) containing 41,000 rigorously aligned text-score pairs. Each record comprises three core fields: instruction (task directive), input (content input), and output (target output). Measurement units adhere to musical conventions, such as "M:3/4" denoting three quarter-note beats per measure and "A4" corresponding to standard 440Hz pitch. Notational errors are maintained below 0.7%, primarily originating from transposition marking inconsistencies (e.g., equivalent representations of "Bb" and "A#"), which were standardized through modal normalization rules.Utilizing UTF-8 encoded JSON format, the dataset is compatible with mainstream analytical tools (e.g., Python's json module, Pandas library). For ABC notation-specific symbols (e.g., "^F" indicating F♯), the dataset includes a comprehensive symbol reference table (containing over 1,200 terms) and enforces syntactic consistency through regular patterns (e.g., r"[_^=]?[A-Ga-g][#b]?[0-9]?"). Traditional Chinese music ornamentation markers (e.g., "~" for portamento) feature extended annotation fields detailing performance techniques. The dataset has undergone three-tier quality validation: primary (automated rule filtering), intermediate (cross-model consistency checking), and advanced (expert sampling review), achieving 96.5% text-score alignment accuracy and 98.2% terminological translation precision.
提供机构:
Science Data Bank
创建时间:
2025-07-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作