five

Dr3dre/Genius-song-lyrics-cleaned

收藏
Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dr3dre/Genius-song-lyrics-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: title dtype: string - name: tag dtype: string - name: artist dtype: string - name: year dtype: int64 - name: views dtype: int64 - name: features dtype: string - name: lyrics dtype: string - name: id dtype: int64 - name: language_cld3 dtype: string - name: language_ft dtype: string - name: language dtype: string - name: lyrics_clean dtype: string - name: char_len dtype: int64 - name: word_len dtype: int64 splits: - name: train num_bytes: 15892159664 num_examples: 5134856 download_size: 9356490174 dataset_size: 15892159664 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 task_categories: - text-classification - sentence-similarity language: - en - it - ia pretty_name: Genius Song Lyrics cleaned size_categories: - 1M<n<10M --- # 🎵 Genius Song Lyrics cleaned Dataset ## Dataset Description This dataset is originally taken from [Genius Song Lyrics](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information) and it contains **cleaned and normalized song lyrics** for more than **5 million songs**, designed for **large-scale topic modeling, clustering, and semantic analysis**. The dataset was specifically preprocessed to be compatible with **embedding-based models** (e.g. Sentence Transformers, BERTopic) while preserving lyrical meaning and thematic content. Repetitive structures typical of song lyrics (e.g. choruses) have been handled carefully to avoid semantic bias. --- ## 🧾 Dataset Structure Each example corresponds to **one song**. ### Fields | Field | Type | Description | |------|----|------------| | `lyrics_clean` | `string` | Cleaned full lyrics | | `char_len` | `int32` | Number of characters in `lyrics_clean` | | `word_len` | `int32` | Number of whitespace-separated words | | `artist` | `string` | Artist name *(optional)* | | `title` | `string` | Song title *(optional)* | | `genre` | `string` | Genre label *(if available)* | | `language` | `string` | Language code (ISO-639-1) | > Some metadata fields may be missing depending on source availability. --- ## 🧼 Preprocessing The following preprocessing steps were applied: 1. **Removed structural tags** - `[Chorus]`, `[Verse 1]`, `[Bridge]`, etc. 2. **Removed parenthetical repetition markers** - `(x2)`, `(repeat)`, etc. 3. **Deduplicated repeated lyric lines** - Common in choruses and hooks 4. **Lowercased text** 5. **Whitespace normalization** No stemming, lemmatization, or stopword removal was applied to preserve semantic meaning for embedding-based models.
提供机构:
Dr3dre
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作