five

ACOSharma/literature

收藏
Hugging Face2024-05-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ACOSharma/literature
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 --- # Literature Dataset ## Files A dataset containing novels, epics and essays. The files are as follows: - main.txt, a file with all the texts, every text on a newline, all English - vocab.txt, a file with the trained (BERT) vocab, a newline a new word - DatasetDistribution.png, a file with all the texts and a plot with character length There are some 7 million tokens in total. ## Texts The texts used are these: - Wuthering Heights - Ulysses - Treasure Island - The War of the Worlds - The Republic - The Prophet - The Prince - The Picture of Dorian Gray - The Odyssey - The Great Gatsby - The Brothers Karamazov - Second Treatise of Goverment - Pride and Prejudice - Peter Pan - Moby Dick - Metamorphosis - Little Women - Les Misérables - Japanese Girls and Women - Iliad - Heart of Darkness - Grimms' Fairy Tales - Great Expectations - Frankenstein - Emma - Dracula - Don Quixote - Crime and Punishment - Christmas Carol - Beyond Good and Evil - Anna Karenina - Adventures of Sherlock Holmes - Adventures of Huckleberry Finn - Adventures in Wonderland - A Tale of Two Cities - A Room with A View
提供机构:
ACOSharma
原始信息汇总

Literature Dataset 概述

数据集内容

  • 文件组成:
    • main.txt: 包含所有文本的文件,每行一个文本,均为英文。
    • vocab.txt: 包含经过BERT训练的词汇表,每行一个新词。
    • train.csv: 包含长度为129的令牌序列,CSV格式,整数类型,共有48,758个样本(6,289,782个令牌)。
    • test.csv: 测试集,与训练集相同格式,包含5,417个样本(698,793个令牌)。
    • DatasetDistribution.png: 显示所有文本及字符长度分布的图表。

文本列表

  • 包含以下文学作品:
    • Wuthering Heights
    • Ulysses
    • Treasure Island
    • The War of the Worlds
    • The Republic
    • The Prophet
    • The Prince
    • The Picture of Dorian Gray
    • The Odyssey
    • The Great Gatsby
    • The Brothers Karamazov
    • Second Treatise of Goverment
    • Pride and Prejudice
    • Peter Pan
    • Moby Dick
    • Metamorphosis
    • Little Women
    • Les Misérables
    • Japanese Girls and Women
    • Iliad
    • Heart of Darkness
    • Grimms Fairy Tales
    • Great Expectations
    • Frankenstein
    • Emma
    • Dracula
    • Don Quixote
    • Crime and Punishment
    • Christmas Carol
    • Beyond Good and Evil
    • Anna Karenina
    • Adventures of Sherlock Holmes
    • Adventures of Huckleberry Finn
    • Adventures in Wonderland
    • A Tale of Two Cities
    • A Room with A View
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作