five

bloomsirenix/dcxbible

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bloomsirenix/dcxbible
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: text dtype: string - name: source dtype: string - name: type dtype: string - name: author dtype: string - name: author_id dtype: string - name: timestamp dtype: string - name: timestamp_unix dtype: float64 - name: message_id dtype: string - name: channel_id dtype: string - name: attachments_count dtype: int64 - name: embeds_count dtype: int64 splits: - name: train num_bytes: 37866886 num_examples: 182916 - name: validation num_bytes: 4193825 num_examples: 20325 download_size: 12067643 dataset_size: 42060711 --- # Discord & Stories Dataset A cleaned, privacy-preserving dataset combining Discord messages and creative writing for language model training. ## Dataset Description This dataset contains text from multiple sources: - Discord message exports (sanitized for privacy) - Creative writing and stories - The Bible (King James Version) ## Data Sources | Source | Type | Description | |--------|------|-------------| | Discord Data Packages | Chat | Sanitized Discord messages from multiple servers | | Raw Chat Exports | Chat | Additional Discord conversation data | | Story Files | Creative | Original creative writing and character guides | | Bible | Religious | Complete King James Version | ## Privacy & Sanitization All Discord content has been aggressively sanitized to protect privacy: ### Discord-Specific - **User mentions**: `<@123456789012345678>` → `@user` - **Role mentions**: `<@&123456789012345678>` → `@role` - **Channel mentions**: `<#123456789012345678>` → `#channel` - **Custom emotes**: `<:name:123456789012345678>` → `:name:` ### Links & URLs (ALL REMOVED) - **HTTP/HTTPS URLs**: `https://...` → `[link]` - **WWW URLs**: `www.example.com` → `[link]` - **Discord invites**: `discord.gg/xxx` → `[link]` - **Discord CDN**: `cdn.discordapp.com/...` → `[link]` - **Media embeds**: Tenor, Imgur, Giphy, YouTube, etc. → `[link]` - **Social media**: Twitter/X, Reddit, etc. → `[link]` - **Email addresses**: `user@example.com` → `[email]` ### Personal Information - **Phone numbers**: Replaced with `[phone]` - **Discord IDs**: 18-20 digit IDs → `[id]` - **IP addresses**: Replaced with `[ip]` - **Crypto addresses**: ETH/BTC addresses → `[crypto-address]` - **API keys/tokens**: Long hex strings → `[hex]` - **@username mentions**: Generic `@user` ## Dataset Structure ```python from datasets import load_from_disk dataset = load_from_disk("hf_dataset/dataset_split") # Access train/validation splits train = dataset["train"] val = dataset["validation"] # Each example contains: { "text": "The message or document content", "source": "Source identifier (e.g., 'discord:server:channel')", "type": "discord_message | story | transcript | bible_verse", "author": "Author name (sanitized)", "author_id": "Hashed user ID", "timestamp": "ISO timestamp (Discord only)", "timestamp_unix": "Unix timestamp as float", "message_id": "Truncated message ID", "channel_id": "[redacted]", "attachments_count": 0, "embeds_count": 0 } ``` ## Statistics - **Total examples**: ~203K - **Train split**: 182,916 (90%) - **Validation split**: 20,325 (10%) - **Average text length**: ~50-200 characters ### By Source | Source | Examples | Type | |--------|----------|------| | Discord Data Packages | ~178K | Chat | | Raw Chat Exports | ~84 | Chat | | Bible | ~25K | Religious text | | Stories/Transcripts | ~103 | Creative writing | ## Usage ### Loading with Hugging Face Datasets ```python from datasets import load_from_disk, load_dataset # From local directory dataset = load_from_disk("./hf_dataset/dataset_split") # Or from Parquet dataset = load_dataset("parquet", data_files="hf_dataset/dataset.parquet") ``` ### For Language Model Training ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") def tokenize_function(examples): return tokenizer(examples["text"], truncation=True, max_length=512) tokenized = dataset.map(tokenize_function, batched=True) ``` ## License - Bible text: Public Domain (King James Version) - Discord content: Personal data, used with permission, heavily anonymized - Stories: Original content ## Citation ```bibtex @dataset{discord_stories_dataset, title = {Discord and Stories Dataset}, author = {Anonymous}, year = {2024}, url = {https://huggingface.co/datasets/USERNAME/dataset-name} } ``` ## Limitations - Discord content may contain informal language, slang, and typos - Some context may be lost due to sanitization - Not all Discord messages are high-quality prose - Dataset is biased toward the specific servers and channels included ## Contributing To regenerate this dataset: ```bash pip install -r requirements.txt python convert_to_hf_dataset.py ``` ## Contact For questions about this dataset, please open an issue on the Hugging Face Hub.
提供机构:
bloomsirenix
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作