thwrhrt/MDFiles
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/thwrhrt/MDFiles
下载链接
链接失效反馈官方服务:
资源简介:
---
license: bsd-2-clause
---
---
# Markdown Pretraining Dataset
A synthetic dataset of **2,400 prompt/completion pairs** designed to teach language models to produce clean, well-structured **Markdown output**.
---
## Dataset Summary
This dataset targets a specific and common failure mode in small language models: producing plain, unformatted prose when structured Markdown would be more appropriate. Every completion in this dataset is rich with Markdown syntax, making it suitable as a formatting signal for pretraining or fine-tuning.
Each entry follows the standard chat format with a `user` prompt and an `assistant` completion. Completions are dense with real Markdown — not just occasional bold words, but full documents with headings, tables, code blocks, lists, blockquotes, and horizontal rules used naturally and contextually.
---
## Format
The dataset is in **JSONL** format. Each line is a JSON object:
```json
{
"messages": [
{ "role": "user", "content": "Write a Markdown note on binary search." },
{ "role": "assistant", "content": "# Binary Search\n\n## Definition\n\n..." }
]
}
```
Compatible with:
- **Unsloth** (`train_on_responses_only`)
- **HuggingFace TRL** (`SFTTrainer`)
- **LLaMA-Factory**
- Any trainer that accepts the `messages` chat format
---
## Statistics
| Property | Value |
|---|---|
| Total examples | 2,400 |
| Format | JSONL (chat/messages) |
| Language | English |
| Avg completion length | ~400–800 tokens |
| License | BSD-2-CLAUSE |
### Markdown Symbol Coverage
| Symbol | Total Occurrences |
|---|---|
| `#` Headings (H1–H4) | 31,782 |
| `**bold**` | 13,799 |
| ` ``` ` Fenced code blocks | 9,224 |
| `\|` Table pipes | 49,794 |
| `>` Blockquotes | 1,653 |
| `- [ ]` Task checklists | 7,061 |
| `---` Horizontal rules | 35,514 |
---
## Topic Coverage
The dataset spans **30+ technical topic areas** across 2,400 examples:
**Computer Science & Algorithms**
- Big-O notation, binary search, sorting algorithms, dynamic programming
- Graph theory, BFS/DFS, Dijkstra's algorithm
- Data structures: linked lists, hash tables, binary trees, stacks/queues
**Programming Languages**
- Python (decorators, generators, asyncio, type hints, dataclasses, itertools, gotchas)
- Rust (ownership, error handling)
- Go (goroutines, channels, error handling)
- JavaScript / TypeScript (promises, closures, event loop)
- C (pointers, memory allocation, structs)
- Bash scripting (loops, string ops, awk/sed)
**Security & Reverse Engineering**
- OWASP Top 10, XSS, SQLi, CSRF, SSRF, directory traversal
- Buffer overflows, ROP, format string vulnerabilities
- Malware analysis (static/dynamic, PE format, persistence)
- Cryptography: AES, RSA, ECC, TLS 1.3, Diffie-Hellman, ZKPs
- Kerberos, LDAP/AD, JWT, OAuth 2.0, password hashing
- Tools: Nmap, Wireshark, tcpdump, GDB, Ghidra
**Systems & OS**
- Linux boot process, FHS, file permissions, signals, syscalls
- Virtual memory, processes vs threads, mutexes, semaphores
- Windows internals: registry, handles, DLL injection
- x86-64 assembly, registers, call stack, NASM vs AT&T syntax
**Networking**
- TCP/IP, OSI model, DNS, subnetting, BGP, VPN
- HTTP methods, HTTPS, CORS, WebSockets, SSH, SMTP
- Firewalls, load balancing, iptables
**Databases**
- SQL: indexing, ACID, transactions, isolation levels
- NoSQL vs SQL, CAP theorem
- Query optimization
**Cloud & Infrastructure**
- Docker, Docker Compose, Kubernetes, Terraform, CI/CD
- Message queues, caching strategies, microservices, gRPC, REST API design
**Machine Learning**
- Supervised/unsupervised learning, gradient descent, overfitting
- Neural networks, transformers, embeddings, vector databases
**Hardware & Embedded**
- Logic gates, Boolean algebra, electronic components
- UART, SPI, I2C, single-board computer comparison
**Markdown Format Types Used**
- Full README documents
- API specification docs
- Changelogs
- Obsidian-style wiki notes with `[[cross-links]]`
- Map of Content (MOC) notes
- Study notes and cheat sheets
- Algorithm walkthroughs
---
## Intended Use
### Fine-tuning (recommended)
Train a model to default to Markdown formatting in its outputs:
```python
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("your-username/markdown-pretraining", split="train")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
max_seq_length=2048,
num_train_epochs=2,
per_device_train_batch_size=4,
),
)
trainer.train()
```
### With Unsloth
```python
from unsloth import FastLanguageModel
from trl import SFTTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-1.5B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=64,
)
```
---
## What This Dataset Teaches
Models trained on this dataset learn to:
- Open responses with an appropriate **H1 heading**
- Use **H2/H3** to structure multi-part answers
- Wrap all code in **fenced code blocks** with language tags
- Use **tables** for comparisons, references, and structured data
- Apply **bold** to key terms and important concepts
- Use **blockquotes** for warnings, tips, and callouts
- Add **task checklists** for procedural content
- Include `[[wiki-links]]` in note-style outputs
- Use `---` to separate major sections
---
## Limitations
- Completions are English-only
- Not suitable as a sole training signal — best combined with a general instruction dataset
---
提供机构:
thwrhrt



