MorVentura/TRBLLmaker

Name: MorVentura/TRBLLmaker
Creator: MorVentura
Published: 2022-03-26 18:44:51
License: 暂无描述

Hugging Face2022-03-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MorVentura/TRBLLmaker

下载链接

链接失效反馈

官方服务：

资源简介：

--- TODO: Add YAML tags here. --- name: **TRBLLmaker** annotations_creators: found language_creators: found languages: en-US licenses: Genius-Ventura-Toker multilinguality: monolingual source_datasets: original task_categories: sequence-modeling task_ids: sequence-modeling-seq2seq_generate # Dataset Card for TRBLLmaker Dataset ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Split info](#Split-info) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/venturamor/TRBLLmaker-NLP - **Paper:** in git ### Dataset Summary TRBLLmaker - To Read Between Lyrics Lines. Dataset used in order to train a model to get as an input - several lines of song's lyrics and generate optional interpretation / meaning of them or use the songs' metdata for various tasks such as classification. This dataset is based on 'Genius' website's data, which contains global collection of songs lyrics and provides annotations and interpretations to songs lyrics and additional music knowledge. We used 'Genius' API, created private client and extracted the relevant raw data from Genius servers. We extracted the songs by the most popular songs in each genre - pop, rap, rock, country and r&b. Afterwards, we created a varied pool of 150 artists that associated with different music styles and periods, and extracted maximum of 100 samples from each. We combined all the data, without repetitions, into one final database. After preforming a cleaning of non-English lyrics, we got our final corpus that contains 8,808 different songs with over all of 60,630 samples, while each sample is a specific sentence from the song's lyrics and its top rated annotation. ### Supported Tasks and Leaderboards Seq2Seq ### Languages [En] - English ## Dataset Structure ### Data Fields We stored each sample in a 'SongInfo' structure with the following attributes: title, genre, annotations and song's meta data. The meta data contains the artist's name, song id in the server, lyrics and statistics such page views. ### Data Splits train train_songs test test_songs validation validation songs ## Split info - songs - samples train [0.64 (0.8 * 0.8)], test[0.2], validation [0.16 (0.8 * 0.2)] ## Dataset Creation ### Source Data Genius - https://genius.com/ ### Annotations #### Who are the annotators? top-ranked annotations by users in Genoius websites / Official Genius annotations ## Considerations for Using the Data ### Social Impact of Dataset We are excited about the future of applying attention-based models on task such as meaning generation. We hope this dataset will encourage more NLP researchers to improve the way we understand and enjoy songs, since achieving artistic comprehension is another step that progress us to the goal of robust AI. ### Other Known Limitations The artists list can be found here. ## Additional Information ### Dataset Curators This Dataset created by Mor Ventura and Michael Toker. ### Licensing Information All source of data belongs to Genius. ### Contributions Thanks to [@venturamor, @tokeron](https://github.com/venturamor/TRBLLmaker-NLP) for adding this dataset.

提供机构：

MorVentura

原始信息汇总

数据集概述

名称: TRBLLmaker
语言: 英语（en-US）
许可证: Genius-Ventura-Toker
多语言性: 单语种
数据来源: 原创数据
任务类别: 序列建模
任务ID: 序列建模-seq2seq_generate

数据集描述

数据集总结

TRBLLmaker用于训练模型，输入为歌曲歌词的几行，生成其可选的解释或含义，或使用歌曲元数据进行分类等任务。该数据集基于Genius网站的数据，包含全球歌曲歌词的收集，并提供歌词的注释和解释。

支持的任务和排行榜

任务: 序列到序列（Seq2Seq）

数据集结构

数据字段

每个样本存储在SongInfo结构中，包含以下属性：标题、流派、注释和歌曲元数据。元数据包括艺术家名称、服务器中的歌曲ID、歌词和统计信息如页面浏览量。

数据分割

训练集: train, train_songs
测试集: test, test_songs
验证集: validation, validation_songs

分割信息

比例: 训练集[0.64 (0.8 * 0.8)], 测试集[0.2], 验证集[0.16 (0.8 * 0.2)]

数据集创建

源数据

来源: Genius网站（https://genius.com/）

注释

注释者: Genius网站上的用户排名靠前的注释或官方Genius注释

使用数据的考虑

数据集的社会影响

该数据集鼓励NLP研究者改进我们对歌曲的理解和欣赏方式，推动艺术理解的进步。

其他已知限制

艺术家列表: 可在此处找到。

5,000+

优质数据集

54 个

任务类型

进入经典数据集