dfdddddddd/archiveofourown-meta
收藏Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/dfdddddddd/archiveofourown-meta
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
tags:
- art
pretty_name: Archive Of Our Own (metadata only)
size_categories:
- 10M<n<100M
---
# Archive of Our Own (AO3) Fanfiction Metadata Archive Dataset
## Introduction
This dataset contains metadata associated with approximately 13 million publicly available works hosted on Archive of Our Own (AO3). It has been compiled for research purposes, including natural language processing, data analysis, understanding trends within online creative communities, and the **training of generative AI language models**.
**Crucially, this dataset contains *only* metadata and identifiers. It does *not* contain the full text or content of the referenced works from AO3.**
To fetch the data yourself, you can take the id value from any row in the dataset and find the original work at `https://archiveofourown.org/works/{ID}`.
## Data Format
The data is provided in JSONL format. Each line in the `combined_metadata.jsonl` file represents a single record and is a valid JSON object.
Each JSON object has the following top-level structure:
* `id`: (String) A unique identifier referencing the original work on AO3.
* `title`: (String) The title of the work.
* `metadata`: (Object) A nested JSON object containing various metadata fields extracted or associated with the work from AO3.
The `metadata` object contains key-value pairs describing the work. Common fields include, but are not limited to:
* `author`: (String) The author's listed name.
* `Fandom`/`Fandoms`: (String or Array of Strings) The source material fandom(s).
* `Rating`: (String) The content rating assigned.
* `Category`: (String) The primary relationship category (e.g., M/M, F/F, Gen).
* `Relationship`: (String or Array of Strings) Key relationship pairings.
* `Characters`: (String or Array of Strings) Key characters featured.
* `Archive Warning`: (String or Array of Strings) Content warnings provided by the author.
* `Additional Tags`: (String or Array of Strings) Freeform tags added by the author.
* `Language`: (String) The language of the work.
* `published`: (String) The publication date.
* `completed`: (String) The completion date, if applicable.
* `chapters`: (String) Chapter count (e.g., "1/1", "5/10").
* `words`: (String) The word count.
## Legal Considerations and Copyright
This dataset consists solely of metadata and ID numbers that reference works publicly available on the internet via Archive of Our Own (AO3). **No copyrighted creative works are stored or distributed in this dataset.**
The collection and distribution of this metadata is based on established legal principles regarding hyperlinking and indexing under copyright law, particularly within the United States.
1. **Linking vs. Copying:** Providing a reference, identifier, or hyperlink to publicly accessible content on Archive of Our Own (AO3) is fundamentally different from reproducing or distributing the copyrighted content itself. US courts have generally recognized that linking does not constitute direct copyright infringement, as the entity providing the link does not host or transmit a copy of the work. This principle is exemplified by the "Server Test" articulated in cases like *Perfect 10, Inc. v. Amazon.com, Inc.*, which distinguishes providing navigational instructions (like a link or ID) from the infringing act of hosting and displaying the content itself from one's own server. Distributing this dataset does not implicate the exclusive rights of reproduction or distribution under 17 U.S.C. § 106.
2. **No Distribution of Works:** This dataset contains factual metadata and identifiers associated with works on AO3, not the expressive content of the referenced works themselves. The legal analysis applicable to distributing copies of copyrighted material does not apply here.
3. **Purpose:** This dataset is intended as a resource for research, analysis, and potentially non-consumptive uses like training AI models on metadata patterns or language modeling based on publicly accessible data, consistent with principles of fair use where applicable.
**Regarding Copyright Complaints:**
This project operates under the well-established legal understanding that providing metadata and references (akin to links or citations) to publicly available online content, specifically works hosted on Archive of Our Own (AO3), does not constitute copyright infringement. The structure of this dataset—providing only metadata and identifiers, not the works themselves—is a deliberate choice to facilitate research while respecting copyright boundaries as understood under prevailing law.
Therefore, communications asserting copyright infringement based *solely* on the inclusion of metadata or an identifier referencing a work publicly hosted on Archive of Our Own (AO3) in this dataset will be considered legally unfounded. Such assertions demonstrate a misunderstanding of the distinction between linking/indexing and the distribution of copyrighted works. **We will not engage in correspondence regarding claims based on the mere presence of metadata or an identifier for a work publicly hosted on Archive of Our Own (AO3) within this index.**
Users who utilize the identifiers in this dataset to access the referenced works directly from Archive of Our Own (AO3) are solely responsible for ensuring their own subsequent use of those works complies with copyright law and the terms of service of Archive of Our Own (AO3).
## Contact
For technical questions regarding the dataset format or structure, please open a discussion thread. Please review the legal considerations outlined above before initiating contact regarding copyright matters related to linking or indexing content from Archive of Our Own (AO3).
license: CC BY-NC 4.0
tags:
- 艺术
pretty_name: Archive of Our Own(仅元数据)
size_categories:
- 1000万 < 数据量 < 1亿
---
# Archive of Our Own(AO3)同人作品元数据归档数据集
## 简介
本数据集收录了来自Archive of Our Own(AO3)平台上约1300万部公开可用作品的关联元数据,其构建初衷为服务各类研究工作,包括自然语言处理、数据分析、在线创意社区趋势洞察,以及**生成式大语言模型(Large Language Model)的训练**。
至关重要的是,本数据集**仅包含元数据与标识符**,并未收录AO3平台上对应作品的完整文本或实质内容。
若需自行获取原始作品,可从数据集中任意一行提取id值,通过`https://archiveofourown.org/works/{ID}`访问对应原创作品。
## 数据格式
本数据集采用JSONL格式存储,`combined_metadata.jsonl`文件中的每一行均对应一条独立记录,且为合法JSON对象。
每个JSON对象具备如下顶层结构:
* `id`(字符串类型):指向AO3平台原创作品的唯一标识符。
* `title`(字符串类型):作品标题。
* `metadata`(对象类型):嵌套JSON对象,存储从AO3平台提取或与该作品关联的各类元数据字段。
`metadata`对象以键值对形式描述作品,常见字段包括但不限于:
* `author`(字符串类型):作者公开署名。
* `Fandom`/`Fandoms`(字符串或字符串数组):作品所属的同人题材(Fandom)。
* `Rating`(字符串类型):作品内容分级。
* `Category`(字符串类型):核心关系分类(例如M/M、F/F、Gen)。
* `Relationship`(字符串或字符串数组):核心角色关系配对。
* `Characters`(字符串或字符串数组):登场核心角色。
* `Archive Warning`(字符串或字符串数组):作者标注的内容警告。
* `Additional Tags`(字符串或字符串数组):作者添加的自由格式标签。
* `Language`(字符串类型):作品使用语言。
* `published`(字符串类型):作品发布日期。
* `completed`(字符串类型):作品完结日期(如适用)。
* `chapters`(字符串类型):章节数(例如"1/1"、"5/10")。
* `words`(字符串类型):作品字数。
## 法律考量与版权说明
本数据集仅包含指向AO3平台公开网络作品的元数据与ID编号,**未存储或分发任何受版权保护的原创创作内容**。
本元数据的收集与分发基于版权法中关于超链接与索引的既定法律原则,尤其适用于美国司法辖区。
1. **链接与复制的区别**:向AO3平台公开可访问的内容提供引用、标识符或超链接,本质上与复制或分发受版权保护的作品本身截然不同。美国法院普遍认定,提供链接并不构成直接版权侵权,因为提供链接的主体并未托管或传输作品副本。这一原则在*Perfect 10, Inc. v. Amazon.com, Inc.*等案件中确立的“服务器测试(Server Test)”中得到体现,该测试将提供导航指引(如链接或ID)与通过自有服务器托管并展示内容的侵权行为明确区分。分发本数据集并不触发美国法典第17编第106条规定的专有复制权与分发权。
2. **未分发作品**:本数据集仅包含与AO3平台作品相关的事实性元数据与标识符,而非对应作品的表达性实质内容。适用于分发受版权保护材料副本的法律分析,不适用于本数据集。
3. **使用目的**:本数据集旨在为研究、分析提供支持,亦可用于非消耗性用途,例如基于元数据模式训练AI模型,或基于公开可访问数据开展语言建模,此类使用符合合理使用原则(如适用)。
**关于版权投诉**:
本项目基于既定法律认知开展:向公开在线内容(尤其是AO3平台托管的作品)提供元数据与引用(类似于链接或引文)并不构成版权侵权。本数据集仅提供元数据与标识符、不收录作品本身的结构设计,是为了在符合现行法律对版权边界的认知前提下,推动研究工作的审慎选择。
因此,任何仅以本数据集包含指向AO3平台公开托管作品的元数据或标识符为由主张版权侵权的联络,将被认定为无法律依据。此类主张源于对链接/索引与受版权保护作品分发之间区别的误解。**对于仅基于本索引中包含AO3平台公开托管作品的元数据或标识符的主张,我们将不予回应。**
用户若使用本数据集中的标识符直接从AO3平台访问对应作品,需自行确保后续对该作品的使用符合版权法及AO3平台的服务条款。
## 联系方式
若有关于数据集格式或结构的技术问题,请开启讨论帖。在就与AO3平台内容链接或索引相关的版权事宜发起联络前,请先审阅上述法律考量说明。
提供机构:
dfdddddddd



