five

thwrhrt/MDFiles

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/thwrhrt/MDFiles
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: bsd-2-clause --- --- # Markdown Pretraining Dataset A synthetic dataset of **2,400 prompt/completion pairs** designed to teach language models to produce clean, well-structured **Markdown output**. --- ## Dataset Summary This dataset targets a specific and common failure mode in small language models: producing plain, unformatted prose when structured Markdown would be more appropriate. Every completion in this dataset is rich with Markdown syntax, making it suitable as a formatting signal for pretraining or fine-tuning. Each entry follows the standard chat format with a `user` prompt and an `assistant` completion. Completions are dense with real Markdown — not just occasional bold words, but full documents with headings, tables, code blocks, lists, blockquotes, and horizontal rules used naturally and contextually. --- ## Format The dataset is in **JSONL** format. Each line is a JSON object: ```json { "messages": [ { "role": "user", "content": "Write a Markdown note on binary search." }, { "role": "assistant", "content": "# Binary Search\n\n## Definition\n\n..." } ] } ``` Compatible with: - **Unsloth** (`train_on_responses_only`) - **HuggingFace TRL** (`SFTTrainer`) - **LLaMA-Factory** - Any trainer that accepts the `messages` chat format --- ## Statistics | Property | Value | |---|---| | Total examples | 2,400 | | Format | JSONL (chat/messages) | | Language | English | | Avg completion length | ~400–800 tokens | | License | BSD-2-CLAUSE | ### Markdown Symbol Coverage | Symbol | Total Occurrences | |---|---| | `#` Headings (H1–H4) | 31,782 | | `**bold**` | 13,799 | | ` ``` ` Fenced code blocks | 9,224 | | `\|` Table pipes | 49,794 | | `>` Blockquotes | 1,653 | | `- [ ]` Task checklists | 7,061 | | `---` Horizontal rules | 35,514 | --- ## Topic Coverage The dataset spans **30+ technical topic areas** across 2,400 examples: **Computer Science & Algorithms** - Big-O notation, binary search, sorting algorithms, dynamic programming - Graph theory, BFS/DFS, Dijkstra's algorithm - Data structures: linked lists, hash tables, binary trees, stacks/queues **Programming Languages** - Python (decorators, generators, asyncio, type hints, dataclasses, itertools, gotchas) - Rust (ownership, error handling) - Go (goroutines, channels, error handling) - JavaScript / TypeScript (promises, closures, event loop) - C (pointers, memory allocation, structs) - Bash scripting (loops, string ops, awk/sed) **Security & Reverse Engineering** - OWASP Top 10, XSS, SQLi, CSRF, SSRF, directory traversal - Buffer overflows, ROP, format string vulnerabilities - Malware analysis (static/dynamic, PE format, persistence) - Cryptography: AES, RSA, ECC, TLS 1.3, Diffie-Hellman, ZKPs - Kerberos, LDAP/AD, JWT, OAuth 2.0, password hashing - Tools: Nmap, Wireshark, tcpdump, GDB, Ghidra **Systems & OS** - Linux boot process, FHS, file permissions, signals, syscalls - Virtual memory, processes vs threads, mutexes, semaphores - Windows internals: registry, handles, DLL injection - x86-64 assembly, registers, call stack, NASM vs AT&T syntax **Networking** - TCP/IP, OSI model, DNS, subnetting, BGP, VPN - HTTP methods, HTTPS, CORS, WebSockets, SSH, SMTP - Firewalls, load balancing, iptables **Databases** - SQL: indexing, ACID, transactions, isolation levels - NoSQL vs SQL, CAP theorem - Query optimization **Cloud & Infrastructure** - Docker, Docker Compose, Kubernetes, Terraform, CI/CD - Message queues, caching strategies, microservices, gRPC, REST API design **Machine Learning** - Supervised/unsupervised learning, gradient descent, overfitting - Neural networks, transformers, embeddings, vector databases **Hardware & Embedded** - Logic gates, Boolean algebra, electronic components - UART, SPI, I2C, single-board computer comparison **Markdown Format Types Used** - Full README documents - API specification docs - Changelogs - Obsidian-style wiki notes with `[[cross-links]]` - Map of Content (MOC) notes - Study notes and cheat sheets - Algorithm walkthroughs --- ## Intended Use ### Fine-tuning (recommended) Train a model to default to Markdown formatting in its outputs: ```python from trl import SFTTrainer, SFTConfig from datasets import load_dataset dataset = load_dataset("your-username/markdown-pretraining", split="train") trainer = SFTTrainer( model=model, train_dataset=dataset, args=SFTConfig( max_seq_length=2048, num_train_epochs=2, per_device_train_batch_size=4, ), ) trainer.train() ``` ### With Unsloth ```python from unsloth import FastLanguageModel from trl import SFTTrainer model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen2.5-1.5B-Instruct", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model( model, r=64, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=64, ) ``` --- ## What This Dataset Teaches Models trained on this dataset learn to: - Open responses with an appropriate **H1 heading** - Use **H2/H3** to structure multi-part answers - Wrap all code in **fenced code blocks** with language tags - Use **tables** for comparisons, references, and structured data - Apply **bold** to key terms and important concepts - Use **blockquotes** for warnings, tips, and callouts - Add **task checklists** for procedural content - Include `[[wiki-links]]` in note-style outputs - Use `---` to separate major sections --- ## Limitations - Completions are English-only - Not suitable as a sole training signal — best combined with a general instruction dataset ---
提供机构:
thwrhrt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作