sander-wood/wikimusictext

Name: sander-wood/wikimusictext
Creator: sander-wood
Published: 2023-12-28 15:09:23
License: 暂无描述

Hugging Face2023-12-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/sander-wood/wikimusictext

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification - text2text-generation pretty_name: wikimt size_categories: - 1K<n<10K language: - en tags: - music --- ## Dataset Summary In [CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval](https://ai-muzic.github.io/clamp/), we introduce WikiMusicText (WikiMT), a new dataset for the evaluation of semantic search and music classification. It includes 1010 lead sheets in ABC notation sourced from Wikifonia.org, each accompanied by a title, artist, genre, and description. The title and artist information is extracted from the score, whereas the genre labels are obtained by matching keywords from the Wikipedia entries and assigned to one of the 8 classes (Jazz, Country, Folk, R&B, Pop, Rock, Dance, and Latin) that loosely mimic the GTZAN genres. The description is obtained by utilizing BART-large to summarize and clean the corresponding Wikipedia entry. Additionally, the natural language information within the ABC notation is removed. WikiMT is a unique resource to support the evaluation of semantic search and music classification. However, it is important to acknowledge that the dataset was curated from publicly available sources, and there may be limitations concerning the accuracy and completeness of the genre and description information. Further research is needed to explore the potential biases and limitations of the dataset and to develop strategies to address them. ## How to Access Music Score Metadata for ABC Notation To access metadata related to ABC notation music scores from the WikiMT dataset, follow these steps: 1. **Locate the Wikifonia MusicXML Data Link:** Start by visiting the discussion thread on the forum to find the download link for the Wikifonia dataset in MusicXML format (with a .mxl extension). You can find the discussion here: [Download for Wikifonia all 6,675 Lead Sheets](http://www.synthzone.com/forum/ubbthreads.php/topics/384909/Download_for_Wikifonia_all_6,6). 2. **Run the Provided Code:** Once you have found the Wikifonia MusicXML data link, execute the provided Python code below. This code will handle the following tasks: - Automatically download the "wikimusictext.jsonl" dataset, which contains metadata associated with music scores. - Automatically download the "xml2abc.py" conversion script, with special thanks to the author, Willem (Wim). - Prompt you for the Wikifonia data URL, as follows: ```python Enter the Wikifonia URL: [Paste your URL here] ``` Paste the URL pointing to the Wikifonia.zip file and press Enter. The below code will take care of downloading, processing, and extracting the music score metadata, making it ready for your research or applications. ```python import subprocess import os import json import zipfile import io # Install the required packages if they are not installed try: from unidecode import unidecode except ImportError: subprocess.check_call(["python", '-m', 'pip', 'install', 'unidecode']) from unidecode import unidecode try: from tqdm import tqdm except ImportError: subprocess.check_call(["python", '-m', 'pip', 'install', 'tqdm']) from tqdm import tqdm try: import requests except ImportError: subprocess.check_call(["python", '-m', 'pip', 'install', 'requests']) import requests def filter(lines): # Filter out all lines that include language information music = "" for line in lines: if line[:2] in ['A:', 'B:', 'C:', 'D:', 'F:', 'G', 'H:', 'I:', 'N:', 'O:', 'R:', 'r:', 'S:', 'T:', 'W:', 'w:', 'X:', 'Z:'] \ or line=='\n' \ or (line.startswith('%') and not line.startswith('%%score')): continue else: if "%" in line and not line.startswith('%%score'): line = "%".join(line.split('%')[:-1]) music += line[:-1] + '\n' else: music += line + '\n' return music def load_music(filename): # Convert the file to ABC notation p = subprocess.Popen( f'python xml2abc_145/xml2abc.py -m 2 -c 6 -x "{filename}"', stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True ) out, err = p.communicate() output = out.decode('utf-8').replace('\r', '') # Capture standard output music = unidecode(output).split('\n') music = filter(music).strip() return music def download_and_extract(url): print(f"Downloading {url}") # Send an HTTP GET request to the URL and get the response response = requests.get(url, stream=True) if response.status_code == 200: # Create a BytesIO object and write the HTTP response content into it zip_data = io.BytesIO() total_size = int(response.headers.get('content-length', 0)) with tqdm(total=total_size, unit='B', unit_scale=True) as pbar: for data in response.iter_content(chunk_size=1024): pbar.update(len(data)) zip_data.write(data) # Use the zipfile library to extract the file print("Extracting the zip file...") with zipfile.ZipFile(zip_data, "r") as zip_ref: zip_ref.extractall("") print("Done!") else: print("Failed to download the file. HTTP response code:", response.status_code) # URL of the JSONL file wikimt_url = "https://huggingface.co/datasets/sander-wood/wikimusictext/resolve/main/wikimusictext.jsonl" # Local filename to save the downloaded file local_filename = "wikimusictext.jsonl" # Download the file and save it locally response = requests.get(wikimt_url) if response.status_code == 200: with open(local_filename, 'wb') as file: file.write(response.content) print(f"Downloaded '{local_filename}' successfully.") else: print(f"Failed to download. Status code: {response.status_code}") # Download the xml2abc.py script (special thanks to Wim Vree for creating this script) download_and_extract("https://wim.vree.org/svgParse/xml2abc.py-145.zip") # Download the Wikifonia dataset wikifonia_url = input("Enter the Wikifonia URL: ") download_and_extract(wikifonia_url) wikimusictext = [] with open("wikimusictext.jsonl", "r", encoding="utf-8") as f: for line in f.readlines(): wikimusictext.append(json.loads(line)) updated_wikimusictext = [] for song in tqdm(wikimusictext): filename = song["artist"] + " - " + song["title"] + ".mxl" filepath = os.path.join("Wikifonia", filename) song["music"] = load_music(filepath) updated_wikimusictext.append(song) with open("wikimusictext.jsonl", "w", encoding="utf-8") as f: for song in updated_wikimusictext: f.write(json.dumps(song, ensure_ascii=False)+"\n") ``` By following these steps and running the provided code, you can efficiently access ABC notation music scores from the WikiMT dataset. Just ensure you have the metadata, the `xml2abc.py` script, and the correct download link before starting. Enjoy your musical journey! ## Copyright Disclaimer WikiMT was curated from publicly available sources, and all rights to the original content and data remain with their respective copyright holders. The dataset is made available for research and educational purposes, and any use, distribution, or modification of the dataset should comply with the terms and conditions set forth by the original data providers. ## BibTeX entry and citation info ``` @misc{wu2023clamp, title={CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval}, author={Shangda Wu and Dingyao Yu and Xu Tan and Maosong Sun}, year={2023}, eprint={2304.11029}, archivePrefix={arXiv}, primaryClass={cs.SD} } ```

提供机构：

sander-wood

原始信息汇总

数据集概述

数据集名称: WikiMusicText (WikiMT)

数据集用途: 用于评估语义搜索和音乐分类。

数据集内容:

音乐文件: 包含1010个以ABC记谱法编写的领奏表，源自Wikifonia.org。
元数据: 每个领奏表附带标题、艺术家、流派和描述。
- 标题和艺术家: 从乐谱中提取。
- 流派: 通过匹配维基百科条目中的关键词，并归类为8个类别（爵士、乡村、民谣、R&B、流行、摇滚、舞曲、拉丁）。
- 描述: 使用BART-large模型从相应的维基百科条目中总结和清理。

数据集特点:

独特的资源，支持跨模态符号音乐信息检索的评估。
数据集从公开可用资源中精选，可能存在关于流派和描述信息的准确性和完整性的限制。

数据集大小: 1K<n<10K

语言: 英语 (en)

许可证: MIT

任务类别:

文本分类
文本到文本生成

标签:

音乐

如何访问音乐乐谱元数据

获取Wikifonia MusicXML数据链接: 通过访问论坛讨论找到下载链接。
运行提供的Python代码: 自动下载"wikimusictext.jsonl"数据集和"xml2abc.py"转换脚本，并处理音乐乐谱元数据。

版权声明

WikiMT数据集从公开可用资源中精选，所有原始内容和数据的权利归各自版权持有者所有。数据集仅供研究和教育目的使用，任何使用、分发或修改应遵守原始数据提供者设定的条款和条件。

5,000+

优质数据集

54 个

任务类型

进入经典数据集