pszemraj/goodreads-bookgenres
收藏Hugging Face2023-10-04 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/pszemraj/goodreads-bookgenres
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于文本分类任务的数据集,主要包含书籍的描述和对应的分类标签。数据集分为三个配置:default、initial-aggregated-genres和original-genres,每个配置都包含书籍名称、描述和分类标签。数据集的大小在1K到10K之间,分为训练集、验证集和测试集。分类标签涵盖了多个领域,如历史、健康、科幻、文学等。
This dataset is designed for text classification tasks, mainly containing book descriptions and their corresponding classification labels. It comprises three configurations: default, initial-aggregated-genres, and original-genres. Each configuration includes book titles, descriptions, and classification labels. The size of the dataset ranges from 1K to 10K, and it is split into training, validation, and test sets. The classification labels cover multiple domains such as history, health, science fiction, literature, and others.
提供机构:
pszemraj
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 许可证: Apache 2.0
- 数据量: 1K < n < 10K
- 任务类别: 文本分类
数据集配置
配置名称: default
- 特征:
- Book: 字符串
- Description: 字符串
- Genres: 序列,类别标签
- 0: History & Politics
- 1: Health & Medicine
- 2: Mystery & Thriller
- 3: Arts & Design
- 4: Self-Help & Wellness
- 5: Sports & Recreation
- 6: Non-Fiction
- 7: Science Fiction & Fantasy
- 8: Countries & Geography
- 9: Other
- 10: Nature & Environment
- 11: Business & Finance
- 12: Romance
- 13: Philosophy & Religion
- 14: Literature & Fiction
- 15: Science & Technology
- 16: Children & Young Adult
- 17: Food & Cooking
- 分割:
- train: 9082425 字节, 7914 样本
- validation: 1113236 字节, 989 样本
- test: 1125038 字节, 990 样本
- 下载大小: 6785302 字节
- 数据集大小: 11320699 字节
配置名称: initial-aggregated-genres
- 特征:
- Book: 字符串
- Description: 字符串
- Genres: 序列,类别标签
- 0: History & Politics
- 1: Health & Medicine
- 2: Mystery & Thriller
- 3: Arts & Design
- 4: Self-Help & Wellness
- 5: Sports & Recreation
- 6: Non-Fiction
- 7: Science Fiction & Fantasy
- 8: Countries & Geography
- 9: Other
- 10: Nature & Environment
- 11: Business & Finance
- 12: Romance
- 13: Philosophy & Religion
- 14: Literature & Fiction
- 15: Science & Technology
- 16: Children & Young Adult
- 17: Food & Cooking
- 分割:
- train: 9082425 字节, 7914 样本
- validation: 1113236 字节, 989 样本
- test: 1125038 字节, 990 样本
- 下载大小: 6784892 字节
- 数据集大小: 11320699 字节
配置名称: original-genres
- 特征:
- Book: 字符串
- Description: 字符串
- Genres: 序列,类别标签
- 0: Superheroes
- 1: The United States Of America
- 2: Read For School
- 3: Asia
- 4: Romanticism
- 5: Technical
- 6: Journal
- 7: American Revolution
- 8: Collections
- 9: Police
- 10: Angels
- 11: Historical Mystery
- 12: Chinese Literature
- 13: International
- 14: Sierra Leone
- 15: African American Literature
- 16: Sword and Planet
- 17: Graphic Novels Comics
- 18: Urbanism
- 19: Research
- 20: Polish Literature
- 21: Transgender
- 22: Russian Literature
- 23: Canada
- 24: Young Adult Fantasy
- 25: Counselling
- 26: Pakistan
- 27: LGBT
- 28: Liberia
- 29: Science Fiction Fantasy
- 30: Star Trek
- 31: Basketball
- 32: Parenting
- 33: Lds
- 34: Dinosaurs
- 35: Prostitution
- 36: Americana
- 37: Danish
- 38: Law
- 39: Alternate History
- 40: Short Stories
- 41: Crafts
- 42: Comedian
- 43: Womens Fiction
- 44: Alchemy
- 45: Rabbits
- 46: Teaching
- 47: Womens Studies
- 48: Christian Fantasy
- 49: Journaling
- 50: Light Novel
- 51: Nigeria
- 52: Poetry
- 53: School
- 54: Astronomy
- 55: 15th Century
- 56: Government
- 57: Poland
- 58: Media Tie In
- 59: Theatre
- 60: Communication
- 61: Steampunk
- 62: Us Presidents
- 63: Time Travel
- 64: Ghost Stories
- 65: Art Design
- 66: Horses
- 67: Urban Planning
- 68: Dutch Literature
- 69: Soccer
- 70: Emotion
- 71: Drawing
- 72: Jewish
- 73: Christian Romance
- 74: Witches
- 75: Political Science
- 76: Musicals
- 77: New Adult
- 78: Romania
- 79: Tea
- 80: Travel
- 81: Money
- 82: Irish Literature
- 83: Genetics
- 84: Epic Fantasy
- 85: Latin American Literature
- 86: Mermaids
- 87: Sports
- 88: Gay
- 89: Japanese Literature
- 90: Clean Romance
- 91: Comedy
- 92: Ghana
- 93: Productivity
- 94: Bande Dessinée
- 95: Dungeons and Dragons
- 96: Social Issues
- 97: Biblical Fiction
- 98: Design
- 99: Chick Lit
- 100: Christian Historical Fiction
- 101: Skepticism
- 102: Fostering
- 103: Romanian Literature
- 104: Geology
- 105: Hungary
- 106: M M F
- 107: Nutrition
- 108: Japan
- 109: Juvenile
- 110: International Development
- 111: Thriller
- 112: Disability
- 113: Transport
- 114: Africa
- 115: Erotic Romance
- 116: Satanism
- 117: Engineering
- 118: Travelogue
- 119: Tarot
- 120: Poverty
- 121: Anthropology
- 122: Kenya
- 123: Family
- 124: Lovecraftian
- 125: Criticism
- 126: Christian Non Fiction
- 127: Fantasy Romance
- 128: China
- 129: Portugal
- 130: Hip Hop
- 131: Amazon
- 132: Drama
- 133: Presidents
- 134: Divination
- 135: World War I
- 136: Rock N Roll
- 137: Italy
- 138: Unicorns
- 139: Gardening
- 140: Queer
- 141: Halloween
- 142: Taoism
- 143: Lesbian Romance
- 144: Shapeshifters
- 145: Spirituality
- 146: Paranormal
- 147: Foodie
- 148: Westerns
- 149: Young Adult Paranormal
- 150: Greece
- 151: 19th Century
- 152: Childrens
- 153: Space
- 154: Fiction
- 155: Tudor Period
- 156: Comics
- 157: Military History
- 158: Agriculture
- 159: Animals
- 160: Batman
- 161: Civil War
- 162: French Literature
- 163: South Africa
- 164: Historical
- 165: Outdoors
- 166: Fighters
- 167: Coming Of Age
- 168: Eugenics
- 169: Regency Romance
- 170: Counting
- 171: Fat Studies
- 172: Asexual
- 173: Internet
- 174: Literary Criticism
- 175: Sword and Sorcery
- 176: Horse Racing
- 177: Art
- 178: Naval History
- 179: Holocaust
- 180: Czech Literature
- 181: Mystery Thriller
- 182: Birds
- 183: Inspirational
- 184: Death
- 185: 21st Century
- 186: Ancient
- 187: Spy Thriller
- 188: Theology
- 189: Climate Change
- 190: Far Right
- 191: Psychiatry
- 192: Romantic
- 193: Faith
- 194: Christian Fiction
- 195: Technology
- 196: Chapter Books
- 197: Lesbian
- 198: Historical Romance
- 199: Archaeology
- 200: New York
- 201: Surreal
- 202: Israel
- 203: Adventure
- 204: Reference
- 205: Science Fiction Romance
- 206: International Relations
- 207: Folklore
- 208: Flash Fiction
- 209: Ukrainian Literature
- 210: Health Care
- 211: Neuroscience
- 212: Supernatural
- 213: Language
- 214: Management
- 215: Climate Change Fiction
- 216: Science Fiction
- 217: Young Readers
- 218: Aliens
- 219: Mystery
- 220: Medical
- 221: Alternate Universe
- 222: Menage
- 223: How To
- 224: 16th Century
- 225: Gay Fiction
- 226: Occult
- 227: Buisness
- 228: Military Romance
- 229: Fairy Tales
- 230: Book Club
- 231: Self Help
- 232: Murder Mystery
- 233: Church
- 234: Sweden
- 235: France
- 236: Serbian Literature
- 237: Gender Studies
- 238: Modern
- 239: War
- 240: Academia
- 241: Prehistory
- 242: Erotica
- 243: Picture Books
- 244: Gods
- 245: Noir
- 246: Ethiopia
- 247: Mountaineering
- 248: Indian Literature
- 249: Russian History
- 250: Textbooks
- 251: Urban
- 252: Hockey
- 253: Adult
- 254: Short Story Collection
- 255: Futurism
- 256: Computer Science
- 257: Gaming
- 258: Psychoanalysis
- 259: Punk
- 260: Werewolves
- 261: Psychological Thriller
- 262: High School
- 263: Cities
- 264: Robots
- 265: Love
- 266: Writing
- 267: Denmark
- 268: Mental Illness
- 269: Iran
- 270: Monsters
- 271: Cyberpunk
- 272: Manga
- 273: Tasmania
- 274: Love Inspired
- 275: Turkish Literature
- 276: Anti Racist
- 277: 17th Century
- 278: Adhd
- 279: Mental Health
- 280: Atheism
- 281: Polygamy
- 282: Mauritius
- 283: Indonesian Literature
- 284: Film
- 285: Abuse
- 286: Logic
- 287: Terrorism
- 288: New Adult Romance
- 289: Counter Culture
- 290: Post Apocalyptic
- 291: Christianity
- 292: 12th Century
- 293: Gothic Horror
- 294: Superman
- 295: Medieval
- 296: Rwanda
- 297: Realistic Fiction
- 298: Womens
- 299: Religion
- 300: Prayer
- 301: Splatterpunk
- 302: Classic Literature
- 303: Crime
- 304: Dragonlance
- 305: Hungarian Literature
- 306: Chemistry
- 307: Video Games
- 308: Ghosts
- 309: American Civil War
- 310: Thelema
- 311: Boarding School
- 312: Autistic Spectrum Disorder
- 313: Romantic Suspense
- 314: Microhistory
- 315: Romance
- 316: Folk Tales
- 317: Vegetarian
- 318: Food and Drink
- 319: American Revolutionary War
- 320: Music
- 321: Illness
- 322: Star Wars
- 323: Figure Skating
- 324: Theory
- 325: Amish
- 326: Adoption
- 327: Shojo
- 328: Health
- 329: Literary Fiction
- 330: Cults
- 331: Futuristic
- 332: Programming
- 333: Social Work
- 334: M F Romance
- 335: Economics
- 336: British Literature
- 337: Aviation
- 338: World History
- 339: Food
- 340: Nursery Rhymes
- 341: Islam
- 342: Zombies
- 343: Maritime
- 344: Military Fiction
搜集汇总
数据集介绍

构建方式
在图书分类与自然语言处理交叉领域,pszemraj/goodreads-bookgenres数据集通过系统化方法构建。其核心数据源自知名图书社区Goodreads,涵盖近万部作品的标题与描述文本。构建过程涉及原始细粒度标签的聚合与映射,形成两个主要配置:一是包含18个广义类别的聚合版本,二是保留600余个原始精细标签的完整版本。数据经过标准划分,生成训练集、验证集与测试集,确保机器学习任务的可靠性。
使用方法
该数据集适用于文本分类模型的训练与评估,尤其适合探索多标签分类与层次分类问题。研究人员可加载不同配置:使用“default”或“initial-aggregated-genres”进行广义类别分类,或使用“original-genres”进行细粒度标签预测。典型流程包括利用描述文本作为输入,对应的“Genres”序列作为目标,通过微调预训练语言模型如BERT进行学习。数据已预分割为训练、验证与测试集,便于直接用于模型训练、验证与性能测试,推动图书元数据自动化标注技术的发展。
背景与挑战
背景概述
在自然语言处理领域,文本分类任务长期依赖于高质量、细粒度的标注数据集以推动模型性能的边界。pszemraj/goodreads-bookgenres数据集由独立研究者pszemraj于近年构建,旨在通过Goodreads平台的海量书籍信息,解决多标签文本分类中书籍描述与复杂流派映射的核心研究问题。该数据集精心整合了书籍名称、描述文本及对应的流派标签,涵盖了从传统文学虚构到专业非虚构的广泛类别,为机器学习模型提供了丰富的语义理解与分类基准,显著促进了推荐系统、内容分析及数字图书馆管理等相关领域的研究进展。
当前挑战
该数据集致力于解决书籍描述文本的多标签流派分类问题,其挑战在于流派标签的层次性与重叠性,例如同一本书可能同时属于‘历史小说’与‘奇幻’类别,要求模型具备捕捉复杂语义关联的能力。在构建过程中,主要挑战源自原始数据的噪声处理与标签标准化:Goodreads的用户生成标签存在大量同义、歧义及非标准表述,需通过人工与自动结合的方式进行清洗、归并与映射,以形成统一且互斥的流派体系,这一过程对数据一致性提出了较高要求。
常用场景
经典使用场景
在文本分类领域,该数据集为多标签分类任务提供了丰富的实验基础。通过书籍描述文本与多类别标签的对应关系,研究者能够构建和评估分类模型在复杂语义场景下的性能。数据集涵盖从文学虚构到科学技术等广泛主题,为模型理解文本深层语义并准确分配多个相关类别创造了条件。这种多标签分类框架尤其适用于模拟现实世界中内容的多维度属性识别。
解决学术问题
该数据集有效解决了多标签文本分类中的类别不平衡与语义重叠问题。通过提供精细的书籍描述与多层次类别体系,它支持研究者在特征提取、损失函数设计以及评估指标等方面进行创新。数据集的存在促进了对于文本语义多义性以及类别间依赖关系建模的探索,为自然语言处理领域提供了重要的基准资源,推动了分类算法在复杂真实场景下的鲁棒性提升。
实际应用
在数字内容管理与推荐系统中,该数据集具有直接的应用价值。基于书籍描述自动生成或校验其所属的多种题材标签,能够显著提升在线图书馆、电子书平台以及零售网站的目录组织效率与内容发现体验。自动化分类技术可以辅助编辑人员进行元数据标注,并为个性化推荐引擎提供更精确的内容理解基础,从而优化用户的内容探索路径与参与度。
数据集最近研究
最新研究方向
在自然语言处理领域,书籍多标签分类任务正逐渐成为研究热点,pszemraj/goodreads-bookgenres数据集凭借其精细的书籍描述与广泛的流派标签,为前沿探索提供了丰富资源。当前研究聚焦于利用预训练语言模型进行细粒度流派预测,尤其在处理跨领域、多标签的复杂分类场景中,该数据集支持模型学习语义层次与流派关联性。随着个性化推荐系统的兴起,该数据集在增强推荐算法理解书籍内容深度方面展现出重要价值,推动了跨模态分析与可解释人工智能的发展,为数字阅读生态的智能化演进奠定了数据基础。
以上内容由遇见数据集搜集并总结生成



