five

RyokoExtra/TvTroper

收藏
Hugging Face2023-06-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoExtra/TvTroper
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - training - text task_categories: - text-classification - text-generation pretty_name: TvTroper size_categories: - 100K<n<1M --- # Dataset Card for TvTroper *TvTroper is a public raw dataset on TvTropes.org page.* ## Dataset Description - **Homepage:** (TODO) - **Repository:** N/A - **Paper:** N/A - **Leaderboard:** N/A - **Point of Contact:** KaraKaraWitch ### Dataset Summary TvTroper is a raw dataset dump consisting of text from at most 651,522 wiki pages (excluding namespaces and date-grouped pages) from tvtropes.org. ### Supported Tasks and Leaderboards This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes. - text-classification - text-generation ### Languages - English ## Dataset Structure All the files are located in jsonl files that has been compressed into a 20GB .zip archive. ### Data Instances ```json ["https://tvtropes.org/pmwiki/pmwiki.php/HaruhiSuzumiya/TropesJToN","<!DOCTYPE html>\n\t<html>\n\t\t<head lang=\"en\">\n...<TRUNCATED>"] ``` ### Data Fields There is only 2 fields in the list. URL and content retrieved. Content retrieved may contain errors. If the page does not exist, the 404 error page is scraped. For the case of 1 specific URL: `https://tvtropes.org/pmwiki/pmwiki.php/JustForFun/RedirectLoop` will endlessly redirect to the page. As such we have used the following html as placeholder for such occurances: ```html <!DOCTYPE html><html><head lang=\"en\"><title>Error: URL Exceeds maximum allowed redirects.</title></head><body class=\"\"><div>Error: URL Exceeds maximum allowed redirects.</div></body></html> ``` URLs may not match the final url in which the page was retrieved from. As they may be redirects present while scraping. #### Q-Score Distribution Not Applicable ### Data Splits The jsonl files are split by their namespaces. ## Dataset Creation ### Curation Rationale We have curated TvTropes.org as it serves as one of the best resource for common themes, narrative devices, and character archetypes that shape our various stories around the world. ### Source Data #### Initial Data Collection and Normalization None. No normalization is performed as this is a raw dump of the dataset. #### Who are the source language producers? The related editors/users of TvTropes.org ### Annotations #### Annotation process No Annotations are present. #### Who are the annotators? No human annotators. ### Personal and Sensitive Information We are certain there is no PII included in the dataset. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content. It may also be useful for other languages depending on your language model. ### Discussion of Biases This dataset contains mainly TV Tropes used in media. ### Other Known Limitations N/A ## Additional Information ### Dataset Curators KaraKaraWitch ### Licensing Information Apache 2.0, for all parts of which KaraKaraWitch may be considered authors. All other material is distributed under fair use principles. Ronsor Labs additionally is allowed to relicense the dataset as long as it has gone through processing. ### Citation Information ``` @misc{tvtroper, title = {TvTroper: Tropes & Others.}, author = {KaraKaraWitch}, year = {2023}, howpublished = {\url{https://huggingface.co/datasets/RyokoExtra/TvTroper}}, } ``` ### Name Etymology N/A ### Contributions - [@KaraKaraWitch (Twitter)](https://twitter.com/KaraKaraWitch) for gathering this dataset.
提供机构:
RyokoExtra
原始信息汇总

数据集概述

数据集名称

TvTroper

数据集描述

TvTroper是一个包含最多651,522个wiki页面的原始数据集,这些页面来自tvtropes.org,不包括命名空间和按日期分组的页面。

数据集用途

该数据集主要用于无监督的文本生成模型训练,也可用于其他目的,如文本分类。

语言

  • 英语

数据集结构

数据集的所有文件都存储在jsonl文件中,这些文件被压缩成一个20GB的.zip存档。

数据实例

数据实例包括URL和检索到的内容,内容可能包含错误。如果页面不存在,则会抓取404错误页面。

数据字段

数据集中只有两个字段:URL和检索到的内容。

数据集创建

数据收集

数据集是从TvTropes.org直接收集的原始数据,未进行任何数据标准化处理。

数据来源

数据来源是TvTropes.org的相关编辑者和用户。

数据注释

数据集不包含任何注释。

使用数据集的注意事项

社会影响

该数据集旨在帮助训练能够生成“更有趣”内容的模型,也可能对其他语言模型有用。

偏见讨论

数据集主要包含媒体中使用的电视情节。

其他已知限制

无。

附加信息

数据集维护者

KaraKaraWitch

许可证信息

数据集遵循Apache 2.0许可证。

引用信息

@misc{tvtroper, title = {TvTroper: Tropes & Others.}, author = {KaraKaraWitch}, year = {2023}, howpublished = {url{https://huggingface.co/datasets/RyokoExtra/TvTroper}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作