A Chinese Multimodal Dataset for Music Theory Knowledge Generation
收藏DataCite Commons2025-08-05 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=c41cdbae3b844dc0afb67311003eef6e
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was constructed through a systematic multi-stage process, with raw data sourced from 1.13 million heterogeneous samples in the MusicPile-sft English dataset and 50,000 supplementary entries generated via the Spark Platform API. Data processing employed a hybrid approach combining automated Python 3.9 scripts with manual verification, utilizing key tools including PyArrow (v12.0) for parquet format conversion, regular expression engines (re module) for musical notation validation, and DeepSeek API (v2.3) for specialized term translation. Temporally, the dataset spans musical literature from the 20th century to 2025, with spatial resolution at China's provincial level. Historical figures are annotated with birth-death years (e.g., "Huang Zi (1904-1938)"), achieving annual-scale temporal resolution.The dataset is stored as a single JSON file (multi-field_music_datasets.json) containing 41,000 rigorously aligned text-score pairs. Each record comprises three core fields: instruction (task directive), input (content input), and output (target output). Measurement units adhere to musical conventions, such as "M:3/4" denoting three quarter-note beats per measure and "A4" corresponding to standard 440Hz pitch. Notational errors are maintained below 0.7%, primarily originating from transposition marking inconsistencies (e.g., equivalent representations of "Bb" and "A#"), which were standardized through modal normalization rules.Utilizing UTF-8 encoded JSON format, the dataset is compatible with mainstream analytical tools (e.g., Python's json module, Pandas library). For ABC notation-specific symbols (e.g., "^F" indicating F♯), the dataset includes a comprehensive symbol reference table (containing over 1,200 terms) and enforces syntactic consistency through regular patterns (e.g., r"[_^=]?[A-Ga-g][#b]?[0-9]?"). Traditional Chinese music ornamentation markers (e.g., "~" for portamento) feature extended annotation fields detailing performance techniques. The dataset has undergone three-tier quality validation: primary (automated rule filtering), intermediate (cross-model consistency checking), and advanced (expert sampling review), achieving 96.5% text-score alignment accuracy and 98.2% terminological translation precision.
提供机构:
Science Data Bank
创建时间:
2025-07-08



