five

EhsanShahbazi/goodreads-quotes

收藏
Hugging Face2025-12-02 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EhsanShahbazi/goodreads-quotes
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Goodreads Quotes language: - en license: other license_name: mit tags: - text - quotes - authors - scraping task_categories: - text-generation - text-retrieval --- # 💡 Goodreads Quotes Dataset **Goodreads Quotes** is a dataset of quotes, authors, tags, and like-counts scraped from publicly available pages on [Goodreads.com]. It captures literary quotes along with metadata that may be useful for NLP, recommendation systems, analysis of quote popularity, author-centric studies, and more. ## 📚 Dataset Description - **What is the dataset?** A collection of quotes from Goodreads authors, with associated metadata: author name, tags (topics), and number of likes. - **What does it contain?** Each record has: - `quote` (text) — the content of the quote - `author` (text) — the quote’s author name - `tags` (text) — comma-separated tags/topics - `likes` (integer) — number of likes the quote had on Goodreads - **Supported language(s):** English (since Goodreads quotes are predominantly in English) - **Motivation / Recommended Use Cases:** - Training or fine-tuning language models with quote-style data - Generating or analyzing literary / motivational quotes - Recommendation or retrieval systems (e.g. quote search, author similarity) - Sentiment or thematic analysis via tags - Sociolinguistic or cultural analysis on popularity of quotes ## 📂 Data Structure The dataset is stored as a single SQLite database file: `goodreads-quotes.db`. The main table is: ``` quotes ├ id — INTEGER (primary key) ├ quote — TEXT (quote content) ├ author — TEXT (author name) ├ tags — TEXT (comma-separated tags/topics) └ likes — INTEGER (number of likes) ```` Uniqueness is enforced on `(quote, author)`, preventing duplicate quotes. There are no explicit train/validation/test splits. ## 🧠 How It Was Created / Scraping Methodology - Data scraped from publicly visible author quote pages on Goodreads. - Custom Python scraper using `requests`, `BeautifulSoup4`, with randomized user-agents and polite delays. - Concurrency using `concurrent.futures.ThreadPoolExecutor`, scraping up to **100 pages per author** (stopping when no quotes found). - Extracted quote text, author name, tags (topics), and like counts. - Cleaned text: removed smart quotes, newlines; stripped whitespace; normalized author names (removed commas). - Stored data in a SQLite database. Duplicates are ignored via database constraints. See the repository’s scraper code for full details. ## ⚠️ Considerations & Limitations - **Copyright & Licensing:** all quotes originate from Goodreads — they may be subject to copyright. Use for research / educational / non-commercial purposes — check Goodreads’ Terms of Service before commercial use. - **Biases:** - Quotes are limited to authors present in your `unique_author_links.txt` — likely English-language and Goodreads-populated authors. - Like counts may reflect Goodreads user-base biases (popularity, recency, social influence), not objective “quality.” - **Completeness:** some authors may not have been scraped fully (depending on page limit, empty-page detection, or scraper errors). - **Data Format:** the dataset is in SQLite; while flexible, users may want to convert it (e.g. to CSV, JSON, Parquet) for easier ingestion in ML pipelines. ## ✅ Example Usage Load the dataset with Python (using Pandas): ```python import sqlite3 import pandas as pd conn = sqlite3.connect("goodreads-quotes.db") df = pd.read_sql_query("SELECT * FROM quotes", conn) print(df.head()) ```` Filter quotes by tag: ```sql SELECT * FROM quotes WHERE tags LIKE '%life%'; ``` Find most-liked quotes: ```sql SELECT quote, author, likes FROM quotes ORDER BY likes DESC LIMIT 10; ``` ## 📄 License & Terms of Use Use this dataset for **research and educational purposes**. Please respect the original content’s copyright. For any commercial use or redistribution, check with Goodreads and appropriate copyright holders. Licensed under **MIT**. ## 🙏 Credits Scraper and dataset assembly by: **Ehsan Shahbazi** Based on publicly available Goodreads data scraped using Python, BeautifulSoup4, and SQLite.
提供机构:
EhsanShahbazi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作