FelipeRochaMartins/SoulsWikiChunks
收藏Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/FelipeRochaMartins/SoulsWikiChunks
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-generation
- feature-extraction
- text-retrieval
- text-ranking
- sentence-similarity
language:
- en
tags:
- soulslike
- game
- rpg
- dark-fantasy
- lore
- rag
- knowledge-base
- nlp
- web-scraping
- text-chunking
- passages
- elden-ring
- shadow-of-the-erdtree
- nightreign
- dark-souls
- bloodborne
- sekiro
- demons-souls
- fextralife
pretty_name: Soulslike Wiki Chunks (RAG Corpus)
size_categories:
- 100K<n<1M
---
# Soulslike Wiki Chunks (RAG Corpus)
## Dataset Description
This dataset contains **chunked passages** derived from the raw Soulslike wiki scrapes
published in [`FelipeRochaMartins/SoulsWikiScrapping`](https://huggingface.co/datasets/FelipeRochaMartins/SoulsWikiScrapping).
Whereas the original dataset stores **full pages** as JSON documents, this repository
focuses on **RAG-friendly text chunks** (passages) that are ready to be fed into:
- Vector stores / embedding pipelines
- Retrieval-Augmented Generation (RAG) systems
- Fine-tuning / continual pretraining workflows that operate on passages instead of full pages
The current release is stored as a hierarchical JSON snapshot where:
```json
{
"<Project/Game>": {
"<ChunkUUID>": {
"content": "...",
"metadata": {
"raw_path": "raw/<Project>/<SourcePage>.json",
"project": "<Project>",
"source_url": "https://...",
"model": "<model-id-used-to-generate-the-chunk>",
"category": "<domain-tag>",
"chunk_headline": "<short title or heading for the chunk>"
}
}
}
}
```
- **Upstream dependency:** All chunks are derived from `SoulsWikiScrapping` raw pages.
- **Format:** JSON snapshots uploaded under `FelipeRochaMartins/SoulsWikiChunks`
(compatible with Hugging Face `datasets`).
- **Games Covered:** Elden Ring (including Shadow of the Erdtree & Nightreign),
Dark Souls Trilogy, Bloodborne, Sekiro, Demon's Souls.
> The exact set of metadata fields (e.g., `model`, `category`, `chunk_headline`)
> is defined by the chunking pipeline in this repository. The snapshot preserves
> them as-is so downstream consumers can decide how to use or ignore them.
---
## Dataset Structure
### Snapshot Layout
At the top level, the JSON snapshot preserves the same project-level
organization as the raw dataset:
- Top-level keys: game / project identifiers
- e.g. `"Elden Ring"`, `"Bloodborne"`, `"Dark Souls III"`, etc.
- Second level: **chunk IDs** (UUIDs) that uniquely identify each passage.
Example (simplified, similar to the real export structure):
```json
{
"DemonsSouls": {
"00499e9a-c16a-41bf-a795-3d03f7bbba4b": {
"content": "Official's Leggings | Demons Souls Wiki...",
"metadata": {
"raw_path": "raw\\DemonsSouls\\Official's_Leggings.json",
"project": "DemonsSouls",
"source_url": "https://demonssouls.wiki.fextralife.com/Official's+Leggings",
"model": "llama3.1:8b",
"category": "ARMOR",
"chunk_headline": "Official's Leggings | Demons Souls Wiki"
}
},
"008d5f1d-3bab-4175-9100-38cd9cbc1dbe": {
"content": "Phosphorescent Slug Drops and Strategies...",
"metadata": {
"raw_path": "raw\\DemonsSouls\\Phosphorescent_Slug.json",
"project": "DemonsSouls",
"source_url": "https://demonssouls.wiki.fextralife.com/Phosphorescent+Slug",
"model": "llama3.1:8b",
"category": "ENEMY",
"chunk_headline": "Phosphorescent Slug Drops and Strategies"
}
}
}
}
```
> **Note:** Each UUID key corresponds to exactly one chunk, containing its
> textual `content` and a `metadata` object with pointers back to the source
> page plus additional annotations.
Typical `metadata` fields:
- `raw_path`: Relative path to the original raw JSON file under `data/raw`,
useful for debugging or re-processing.
- `project`: Game identifier (e.g., `DemonsSouls`, `EldenRing`).
- `source_url`: Original wiki URL from which this chunk was derived.
- `model`: Identifier of the model used to generate or rewrite this chunk
(e.g., `llama3.1:8b`).
- `category`: Semantic category for the source page/chunk, This can be
used to build filtered indices (e.g., only enemies, only weapons).
- All categories here:
- LORE
- LOCATION
- NPC
- BOSS
- ENEMY
- WEAPON
- ARMOR
- ACCESSORY
- MAGIC_ABILITY
- ITEM
- MECHANIC
- QUEST_GUIDE
- BUILD_CLASS
- OTHER
- `chunk_headline`: Short human-readable headline or section title that
summarizes what the chunk is about.
### Relationship to the Raw Dataset
- `SoulsWikiScrapping`: stores **full pages** under `meta` + `data`.
- `SoulsWikiChunks`: stores **passages/chunks** derived from those pages,
preserving enough metadata to trace back to the original source page.
This separation allows you to:
- Rebuild / re-chunk from raw data if you change the strategy.
- Consume a stable, RAG-ready corpus directly from this dataset.
---
## Chunk Length Statistics
To help you understand the typical passage size in this corpus, here are
summary statistics over the `content` field lengths (in characters).
### Chunk content length statistics (characters)
| Project | count | mean | std | min | 25% | 50% (median) | 75% | 90% | 99% | max |
|------------------------|--------|-------|-------|-----|------|--------------|------|------|-------|-------|
| Bloodborne | 2,085 | 500.7 | 179.8 | 137 | 406 | 480 | 550 | - | - | 2,044 |
| DarkSouls | 3,581 | 489.7 | 168.9 | 72 | 397 | 472 | 549 | - | - | 2,185 |
| DarkSouls2 | 4,945 | 496.1 | 142.2 | 93 | 417 | 484 | 554 | - | - | 2,022 |
| DarkSouls3 | 4,029 | 471.9 | 163.2 | 101 | 381 | 456 | 533 | - | - | 2,029 |
| DemonsSouls | 1,430 | 498.2 | 128.8 | 107 | 431 | 493 | 560 | - | - | 1,670 |
| EldenRing | 15,219 | 470.5 | 177.4 | 78 | 376 | 455 | 529 | - | - | 2,214 |
| EldenRingNightreign | 10,623 | 477.7 | 227.5 | 105 | 348 | 440 | 539 | - | - | 2,350 |
| SekiroShadowsDieTwice | 1,536 | 526.3 | 181.2 | 102 | 428 | 507 | 587 | - | - | 2,112 |
| Global | 43,448 | 481.2 | 185.0 | 72 | 379 | 463 | 541 | 632 | 1,305 | 2,350 |
---
## Usage
### How to Load in Python
You can load the latest snapshot using `datasets` and then access the nested
JSON payload:
```python
from datasets import load_dataset
dataset = load_dataset(
"FelipeRochaMartins/SoulsWikiChunks",
data_files="latest.json"
)
payload = dataset["train"][0]
# Access a project and iterate over all its chunks
demons_souls = payload["DemonsSouls"]
for chunk_id, chunk_payload in demons_souls.items():
text = chunk_payload.get("content", "")
meta = chunk_payload.get("metadata", {})
print(chunk_id, meta.get("chunk_headline"))
print(text[:200])
break
```
You can then feed `chunk_payload["content"]` to an embedding model or a RAG
indexer of your choice.
### Example: Building a Passage List
```python
all_passages = []
for project, project_chunks in payload.items():
for chunk_id, chunk_payload in project_chunks.items():
meta = chunk_payload.get("metadata", {})
all_passages.append({
"project": project,
"chunk_id": chunk_id,
"source_url": meta.get("source_url"),
"category": meta.get("category"),
"headline": meta.get("chunk_headline"),
"content": chunk_payload.get("content", ""),
})
print("Total passages:", len(all_passages))
```
---
## Source & Licensing
**Source:** The chunks are derived from the raw wiki scrapes in
[`SoulsWikiScrapping`](https://huggingface.co/datasets/FelipeRochaMartins/SoulsWikiScrapping),
which in turn were scraped from [Fextralife Wiki](https://wiki.fextralife.com/).
**License:** This dataset is distributed under the **CC BY-NC-SA 4.0** license
(Creative Commons Attribution-NonCommercial-ShareAlike), consistent with the
licensing terms of the source wikis.
- **Attribution:** All narrative content belongs to **FromSoftware** and the
respective wiki contributors.
- **Non-Commercial:** This dataset should not be used for commercial purposes
without permission from the copyright holders.
---
## Intended Use
- **RAG Systems:**
Build retrieval-augmented QA/chatbots about Elden Ring, Bloodborne, Dark Souls,
Sekiro, and Demon's Souls on top of chunked passages instead of full pages.
- **Embeddings & Vector Stores:**
Generate dense embeddings per chunk and index them in a vector database for
semantic search and context retrieval.
- **Lore & Narrative Analysis:**
Analyze relationships between entities, NPCs, locations, and items at a
passage-level granularity.
- **Evaluation / Benchmarking:**
Use the chunked corpus to build custom retrieval benchmarks or to evaluate
LLM grounding on Soulslike lore.
提供机构:
FelipeRochaMartins



