EhsanShahbazi/goodreads-quotes
收藏Hugging Face2025-12-02 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EhsanShahbazi/goodreads-quotes
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Goodreads Quotes
language:
- en
license: other
license_name: mit
tags:
- text
- quotes
- authors
- scraping
task_categories:
- text-generation
- text-retrieval
---
# 💡 Goodreads Quotes Dataset
**Goodreads Quotes** is a dataset of quotes, authors, tags, and like-counts scraped from publicly available pages on [Goodreads.com]. It captures literary quotes along with metadata that may be useful for NLP, recommendation systems, analysis of quote popularity, author-centric studies, and more.
## 📚 Dataset Description
- **What is the dataset?**
A collection of quotes from Goodreads authors, with associated metadata: author name, tags (topics), and number of likes.
- **What does it contain?**
Each record has:
- `quote` (text) — the content of the quote
- `author` (text) — the quote’s author name
- `tags` (text) — comma-separated tags/topics
- `likes` (integer) — number of likes the quote had on Goodreads
- **Supported language(s):**
English (since Goodreads quotes are predominantly in English)
- **Motivation / Recommended Use Cases:**
- Training or fine-tuning language models with quote-style data
- Generating or analyzing literary / motivational quotes
- Recommendation or retrieval systems (e.g. quote search, author similarity)
- Sentiment or thematic analysis via tags
- Sociolinguistic or cultural analysis on popularity of quotes
## 📂 Data Structure
The dataset is stored as a single SQLite database file: `goodreads-quotes.db`. The main table is:
```
quotes
├ id — INTEGER (primary key)
├ quote — TEXT (quote content)
├ author — TEXT (author name)
├ tags — TEXT (comma-separated tags/topics)
└ likes — INTEGER (number of likes)
````
Uniqueness is enforced on `(quote, author)`, preventing duplicate quotes.
There are no explicit train/validation/test splits.
## 🧠 How It Was Created / Scraping Methodology
- Data scraped from publicly visible author quote pages on Goodreads.
- Custom Python scraper using `requests`, `BeautifulSoup4`, with randomized user-agents and polite delays.
- Concurrency using `concurrent.futures.ThreadPoolExecutor`, scraping up to **100 pages per author** (stopping when no quotes found).
- Extracted quote text, author name, tags (topics), and like counts.
- Cleaned text: removed smart quotes, newlines; stripped whitespace; normalized author names (removed commas).
- Stored data in a SQLite database. Duplicates are ignored via database constraints.
See the repository’s scraper code for full details.
## ⚠️ Considerations & Limitations
- **Copyright & Licensing:** all quotes originate from Goodreads — they may be subject to copyright. Use for research / educational / non-commercial purposes — check Goodreads’ Terms of Service before commercial use.
- **Biases:**
- Quotes are limited to authors present in your `unique_author_links.txt` — likely English-language and Goodreads-populated authors.
- Like counts may reflect Goodreads user-base biases (popularity, recency, social influence), not objective “quality.”
- **Completeness:** some authors may not have been scraped fully (depending on page limit, empty-page detection, or scraper errors).
- **Data Format:** the dataset is in SQLite; while flexible, users may want to convert it (e.g. to CSV, JSON, Parquet) for easier ingestion in ML pipelines.
## ✅ Example Usage
Load the dataset with Python (using Pandas):
```python
import sqlite3
import pandas as pd
conn = sqlite3.connect("goodreads-quotes.db")
df = pd.read_sql_query("SELECT * FROM quotes", conn)
print(df.head())
````
Filter quotes by tag:
```sql
SELECT * FROM quotes WHERE tags LIKE '%life%';
```
Find most-liked quotes:
```sql
SELECT quote, author, likes
FROM quotes
ORDER BY likes DESC
LIMIT 10;
```
## 📄 License & Terms of Use
Use this dataset for **research and educational purposes**.
Please respect the original content’s copyright. For any commercial use or redistribution, check with Goodreads and appropriate copyright holders.
Licensed under **MIT**.
## 🙏 Credits
Scraper and dataset assembly by: **Ehsan Shahbazi**
Based on publicly available Goodreads data scraped using Python, BeautifulSoup4, and SQLite.
提供机构:
EhsanShahbazi



