seungheondoh/music-wiki
收藏Hugging Face2023-08-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/seungheondoh/music-wiki
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- music
- wiki
size_categories:
- 100K<n<1M
---
# Dataset Card for "music-wiki"
📚🎵 Introducing **music-wiki**
📊🎶 Our data collection process unfolds as follows:
1) Starting with a seed page from Wikipedia's music section, we navigate through a referenced page graph, employing recursive crawling up to a depth of 20 levels.
2) Simultaneously, tapping into the rich MusicBrainz dump, we encounter a staggering 11 million unique music entities spanning 10 distinct categories. These entities serve as the foundation for utilizing the Wikipedia API to meticulously crawl corresponding pages.
The culmination of these efforts results in the assembly of data: 167k pages from the first method and an additional 193k pages through the second method.
While totaling at 361k pages, this compilation provides a substantial groundwork for establishing a Music-Text-Database. 🎵📚🔍
- **Repository:** [music-wiki](https://github.com/seungheondoh/music-wiki)
[](https://github.com/seungheondoh/music-wiki)
### splits
- wikipedia_music: 167890
- musicbrainz_genre: 1459
- musicbrainz_instrument: 872
- musicbrainz_artist: 7002
- musicbrainz_release: 163068
- musicbrainz_release_group: 15942
- musicbrainz_label: 158
- musicbrainz_work: 4282
- musicbrainz_series: 12
- musicbrainz_place: 49
- musicbrainz_event: 16
- musicbrainz_area: 360
提供机构:
seungheondoh
原始信息汇总
数据集概述
数据集名称
- 名称: music-wiki
数据集描述
- 描述: 该数据集通过两种方法收集数据:首先从Wikipedia的音乐部分开始,通过递归爬虫技术,深度达到20层;其次,利用MusicBrainz的数据,涉及1100万独特的音乐实体,涵盖10个不同类别。这两种方法共收集了361,000页数据,用于构建音乐文本数据库。
数据集内容
- 内容:
- 来自Wikipedia的音乐页面: 167,890页
- 来自MusicBrainz的各类别页面:
- 音乐类型: 1459页
- 乐器: 872页
- 艺术家: 7002页
- 发行: 163,068页
- 发行组: 15,942页
- 标签: 158页
- 作品: 4282页
- 系列: 12页
- 地点: 49页
- 事件: 16页
- 地区: 360页
数据集规模
- 规模: 100,000 < n < 1,000,000
数据集语言
- 语言: 英语
数据集许可证
- 许可证: MIT
数据集标签
- 标签: 音乐, wiki



