seungheondoh/musical-word-embedding

Name: seungheondoh/musical-word-embedding
Creator: seungheondoh
Published: 2024-04-23 11:09:39
License: 暂无描述

Hugging Face2024-04-23 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/seungheondoh/musical-word-embedding

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: token dtype: string - name: content dtype: string - name: vector sequence: float32 splits: - name: tag num_bytes: 2740380 num_examples: 2227 - name: artist num_bytes: 46025354 num_examples: 37002 - name: track num_bytes: 898952880 num_examples: 697812 download_size: 1387722409 dataset_size: 947718614 configs: - config_name: default data_files: - split: tag path: data/tag-* - split: artist path: data/artist-* - split: track path: data/track-* tags: - music --- # Musical Word Embedding > [**Musical Word Embedding for Music Tagging and Retrieval**](https://arxiv.org/abs/2404.13569) > SeungHeon Doh, Jongpil Lee, Dasaem Jeong, Juhan Nam > To appear IEEE Transactions on Audio, Speech and Language Processing (submitted) <p align = "center"> <img src = "https://i.imgur.com/Yw4UPnM.png"> </p> Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. ### Resources: Using Musical Word Embedding - [Pre-trained Embedding Vector](https://huggingface.co/datasets/seungheondoh/musical-word-embedding) - [Paper](https://arxiv.org/abs/2404.13569) - [Blog](https://seungheondoh.github.io/musical_word_embedding_demo/) - [**notebook**-query_recommendation](https://github.com/seungheondoh/musical-word-embedding/blob/main/notebook/query_recommendation.ipynb) - [**notebook**-music_retrieval](https://github.com/seungheondoh/musical-word-embedding/blob/main/notebook/music_retrieval.ipynb) ### Run the download script for embedding vector: Check our huggingface dataset: You can download important embedding vectors such as tag, artist, and track from the Hugging Face dataset. ```python from datasets import load_dataset dataset = load_dataset("seungheondoh/musical-word-embedding") ``` ``` { "token": "happy", "content": "happy", "vector": [0.011484057642519474, -0.07818693667650223, -0.02778349258005619, 0.052311971783638, -0.1324823945760727, 0.03757447376847267, 0.007125925272703171, ...] },{ "token": "ARYZTJS1187B98C555", "content": "Faster Pussycat", "vector": [-0.13004058599472046, -1.3509420156478882, -0.3012666404247284, -0.34076201915740967, -0.8142353296279907, 0.3902665972709656, -0.1903497576713562, 0.6163021922111511, ...] } ``` For other general 10M word vectors, you can also download them using the script below. ``` bash scripts/download.sh ``` ### Citation If you find this work useful, please cite it as: ``` @article{doh2024musical, title={Musical Word Embedding for Music Tagging and Retrieval}, author={Doh, SeungHeon and Lee, Jongpil and Jeong, Dasaem and Nam, Juhan}, journal={update_soon}, year={2024} } @inproceedings{doh2021million, title={Million song search: Web interface for semantic music search using musical word embedding}, author={Doh, S and Lee, Jongpil and Nam, Juhan}, booktitle={International Society for Music Information Retrieval Conference, ISMIR}, year={2021} } @article{doh2020musical, title={Musical word embedding: Bridging the gap between listening contexts and music}, author={Doh, Seungheon and Lee, Jongpil and Park, Tae Hong and Nam, Juhan}, journal={arXiv preprint arXiv:2008.01190}, year={2020} } ``` Feel free to reach out for any questions or feedback!

提供机构：

seungheondoh

原始信息汇总

数据集概述

数据集特征

token: 数据类型为字符串。
content: 数据类型为字符串。
vector: 数据类型为浮点数序列。

数据集分割

tag: 包含2227个样本，总大小为2740380字节。
artist: 包含37002个样本，总大小为46025354字节。
track: 包含697812个样本，总大小为898952880字节。

数据集大小

下载大小: 1387722409字节。
数据集总大小: 947718614字节。

配置文件

default: 包含针对tag、artist和track的文件路径配置。

数据示例

json { "token": "happy", "content": "happy", "vector": [0.011484057642519474, -0.07818693667650223, -0.02778349258005619, 0.052311971783638, -0.1324823945760727, 0.03757447376847267, 0.007125925272703171, ...] }, { "token": "ARYZTJS1187B98C555", "content": "Faster Pussycat", "vector": [-0.13004058599472046, -1.3509420156478882, -0.3012666404247284, -0.34076201915740967, -0.8142353296279907, 0.3902665972709656, -0.1903497576713562, 0.6163021922111511, ...] }

5,000+

优质数据集

54 个

任务类型

进入经典数据集