bloomsirenix/dcxbible
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bloomsirenix/dcxbible
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
- name: type
dtype: string
- name: author
dtype: string
- name: author_id
dtype: string
- name: timestamp
dtype: string
- name: timestamp_unix
dtype: float64
- name: message_id
dtype: string
- name: channel_id
dtype: string
- name: attachments_count
dtype: int64
- name: embeds_count
dtype: int64
splits:
- name: train
num_bytes: 37866886
num_examples: 182916
- name: validation
num_bytes: 4193825
num_examples: 20325
download_size: 12067643
dataset_size: 42060711
---
# Discord & Stories Dataset
A cleaned, privacy-preserving dataset combining Discord messages and creative writing for language model training.
## Dataset Description
This dataset contains text from multiple sources:
- Discord message exports (sanitized for privacy)
- Creative writing and stories
- The Bible (King James Version)
## Data Sources
| Source | Type | Description |
|--------|------|-------------|
| Discord Data Packages | Chat | Sanitized Discord messages from multiple servers |
| Raw Chat Exports | Chat | Additional Discord conversation data |
| Story Files | Creative | Original creative writing and character guides |
| Bible | Religious | Complete King James Version |
## Privacy & Sanitization
All Discord content has been aggressively sanitized to protect privacy:
### Discord-Specific
- **User mentions**: `<@123456789012345678>` → `@user`
- **Role mentions**: `<@&123456789012345678>` → `@role`
- **Channel mentions**: `<#123456789012345678>` → `#channel`
- **Custom emotes**: `<:name:123456789012345678>` → `:name:`
### Links & URLs (ALL REMOVED)
- **HTTP/HTTPS URLs**: `https://...` → `[link]`
- **WWW URLs**: `www.example.com` → `[link]`
- **Discord invites**: `discord.gg/xxx` → `[link]`
- **Discord CDN**: `cdn.discordapp.com/...` → `[link]`
- **Media embeds**: Tenor, Imgur, Giphy, YouTube, etc. → `[link]`
- **Social media**: Twitter/X, Reddit, etc. → `[link]`
- **Email addresses**: `user@example.com` → `[email]`
### Personal Information
- **Phone numbers**: Replaced with `[phone]`
- **Discord IDs**: 18-20 digit IDs → `[id]`
- **IP addresses**: Replaced with `[ip]`
- **Crypto addresses**: ETH/BTC addresses → `[crypto-address]`
- **API keys/tokens**: Long hex strings → `[hex]`
- **@username mentions**: Generic `@user`
## Dataset Structure
```python
from datasets import load_from_disk
dataset = load_from_disk("hf_dataset/dataset_split")
# Access train/validation splits
train = dataset["train"]
val = dataset["validation"]
# Each example contains:
{
"text": "The message or document content",
"source": "Source identifier (e.g., 'discord:server:channel')",
"type": "discord_message | story | transcript | bible_verse",
"author": "Author name (sanitized)",
"author_id": "Hashed user ID",
"timestamp": "ISO timestamp (Discord only)",
"timestamp_unix": "Unix timestamp as float",
"message_id": "Truncated message ID",
"channel_id": "[redacted]",
"attachments_count": 0,
"embeds_count": 0
}
```
## Statistics
- **Total examples**: ~203K
- **Train split**: 182,916 (90%)
- **Validation split**: 20,325 (10%)
- **Average text length**: ~50-200 characters
### By Source
| Source | Examples | Type |
|--------|----------|------|
| Discord Data Packages | ~178K | Chat |
| Raw Chat Exports | ~84 | Chat |
| Bible | ~25K | Religious text |
| Stories/Transcripts | ~103 | Creative writing |
## Usage
### Loading with Hugging Face Datasets
```python
from datasets import load_from_disk, load_dataset
# From local directory
dataset = load_from_disk("./hf_dataset/dataset_split")
# Or from Parquet
dataset = load_dataset("parquet", data_files="hf_dataset/dataset.parquet")
```
### For Language Model Training
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized = dataset.map(tokenize_function, batched=True)
```
## License
- Bible text: Public Domain (King James Version)
- Discord content: Personal data, used with permission, heavily anonymized
- Stories: Original content
## Citation
```bibtex
@dataset{discord_stories_dataset,
title = {Discord and Stories Dataset},
author = {Anonymous},
year = {2024},
url = {https://huggingface.co/datasets/USERNAME/dataset-name}
}
```
## Limitations
- Discord content may contain informal language, slang, and typos
- Some context may be lost due to sanitization
- Not all Discord messages are high-quality prose
- Dataset is biased toward the specific servers and channels included
## Contributing
To regenerate this dataset:
```bash
pip install -r requirements.txt
python convert_to_hf_dataset.py
```
## Contact
For questions about this dataset, please open an issue on the Hugging Face Hub.
提供机构:
bloomsirenix



