Qqcf16426/mangaupdates
收藏Hugging Face2023-12-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Qqcf16426/mangaupdates
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- manga
- tags
- genres
- scraped
size_categories:
- 100K<n<1M
---
I scraped [mangaupdates](https://www.mangaupdates.com) for a project and i am sharing the data. There is a tar file which contians the json response from every infos entry.
I parsed it and added it to a postgres database. The pgdump was uploaded too. There are some entries that do not exist anymore. It can be found in the removed ids json.
<details>
<summary>SQL structure</summary>
I didnt try to make it a optimal strucure, but i tried to remove the redundancy of strings.
### Info
```sql
create table info
(
id serial primary key,
private_id int,
public_id bigint not null,
forum_id bigint not null,
url_key text not null,
url_name text,
titles text[] not null,
description text,
image_name text,
typ int not null,
year int,
latest_chapter integer not null,
rating integer not null,
bayesian_rating float,
genres int[] not null,
tags int[] not null,
tags_upvotes int[] not null,
tags_downvotes int[] not null,
tags_uploader bigint[] not null,
status text,
licensed boolean not null,
completed boolean not null,
author int[] not null,
artist int[] not null,
publisher_original int[] not null,
publisher_english int[] not null,
publication text[] not null,
publication_publisher int[] not null,
relations text[] not null,
anime_start text,
anime_end text,
last_updated_mu TIMESTAMP,
last_updated TIMESTAMP not null,
created TIMESTAMP not null
);
```
### Types
```sql
create table if not exists mtypes
(
id serial primary key,
name text not null
);
```
### Genres
```sql
create table if not exists genres
(
id serial primary key,
name text not null
);
```
### Tags
```sql
create table if not exists tags
(
id serial primary key,
name text not null
);
```
### People
```sql
create table if not exists ppl
(
id serial primary key,
mu_id bigint,
name text not null
);
```
</details>
提供机构:
Qqcf16426
原始信息汇总
数据集概述
数据来源
- 数据来源于mangaupdates网站的爬取。
数据内容
- 包含一个tar文件,内含从每个信息条目获取的JSON响应。
- 数据已被解析并添加到PostgreSQL数据库中,pgdump文件也已上传。
- 存在一些已删除的条目,这些条目的ID记录在removed ids json文件中。
数据库结构
- 数据库结构非最优化,但尽量减少了字符串冗余。
表结构
-
Info表
- 字段包括:id, private_id, public_id, forum_id, url_key, url_name, titles, description, image_name, typ, year, latest_chapter, rating, bayesian_rating, genres, tags, tags_upvotes, tags_downvotes, tags_uploader, status, licensed, completed, author, artist, publisher_original, publisher_english, publication, publication_publisher, relations, anime_start, anime_end, last_updated_mu, last_updated, created。
-
Types表
- 字段包括:id, name。
-
Genres表
- 字段包括:id, name。
-
Tags表
- 字段包括:id, name。
-
People表
- 字段包括:id, mu_id, name。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是从MangaUpdates网站抓取的漫画信息集合,包含约10万至100万条记录,涵盖漫画标题、作者、类型、标签和评分等结构化数据。数据已解析并存储在PostgreSQL数据库中,但存在生成错误,可能导致部分数据不完整或格式异常,适用于漫画内容分析和推荐系统研究。
以上内容由遇见数据集搜集并总结生成



