stackexchange-markdown

Name: stackexchange-markdown
Creator: maas
Published: 2026-01-06 16:51:02
License: 暂无描述

魔搭社区2026-01-06 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/marin-community/stackexchange-markdown

下载链接

链接失效反馈

官方服务：

资源简介：

# Marin Markdownified StackExchange Markdownified Stack Exchange transforms the Stack Exchange's question-answer pairs into Markdown format consisting of **20.4B tokens**. This dataset preserves the content contained in technical discussions while organizing it into a thread format for language model training. | | Value | |---------------------|-------| | Tokens | 20 413 785 853 | | Primary source | https://archive.org/details/stackexchange | | File format | JSONL | | License | CC (mirrors upstream SE licenses) | ## Processing and Cleaning Pipeline Our conversion pipeline combines several sophisticated techniques to transform raw Wikipedia HTML into high-quality Markdown: 1. **HTML Preprocessing:** We start with the Stack Exachange Dump which provides XML representations of Posts 2. **Structured Q&A Format** Each thread is formatted with clear section headings: - "# Question" with title and body - "# Answer" for each response - Vote counts are preserved next to each answer - Tags are appended at the bottom with a separator 3. **Template Variations** - Answer templates are randomly varied - Vote counts may appear either before or after answer content - This randomization is seeded deterministically based on the question ID 4. **DOM Simplification:** We employ a [custom-enhanced version of Resiliparse](https://github.com/stanford-crfm/chatnoir-resiliparse) that preserves semantic HTML structure. Rather than flattening to plain text, we retain important elements like headings, paragraphs, lists while removing scripts, tracking code, and boilerplate. 5. **Markdown Conversion:** Our [custom Markdownify](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650) implementation transforms the simplified DOM into clean Markdown. The final output stores each article as a JSON object containing the Markdown text and essential metadata. ## Template Structure Each entry in the dataset contains a complete question-answer thread with: - Original question title - Question body in full Markdown - Multiple answers (when available) with vote counts - Original tags - Creation date - URL reference Example template: ``` # Question Title: What is the h-index exactly and how does it work? What is the h-index, and how does it work ? # Answer The h-index is a measure of the impact of someone's publication list. An h-index of 10 for example means that the person has published 10 papers with at least 10 citations. The total number of papers published may be higher, but only 10 will have 10 or more citations. Critics argue that this measure disadvantages young researchers who did not have time to publish a lot and whose work has not been published for long and thus may not have attracted many citations. Other criticisms include that it makes a researcher focus on how to increase the citation count for a paper that may be not that good but would increase the h-index. For more explanation, see for example the Wikipedia article. > 35 votes --- Tags: bibliometrics, methodology, ranking --- ``` ## Usage Example ```python from datasets import load_dataset ds = load_dataset( "marin-community/stackexchange-markdown", split="train", streaming=True ) for article in ds.take(3): print(article["text"]) ``` ## Citation If you use this dataset in your research, please cite both the original Wikipedia contributors and our work: ``` @misc{markdownified_ar5iv_2024, title = {Markdownified StackExchange}, author = {The Marin Community}, year = {2024}, url = {https://huggingface.co/datasets/marin-community/stackexchange-markdown} } ``` ## License All content inherits StackExachange's licensing: CC. Our conversion tools and pipeline are released under Apache 2.0. ## Acknowledgement We extend our gratitude to: - The Stack Exchange network and its many contributors - Janek Bevendorff for the [Resiliparse project](https://github.com/chatnoir-eu/chatnoir-resiliparse) - Matthew Dapena-Tretter for [Markdownify](https://github.com/matthewwithanm/python-markdownify)

# Marin 格式化Markdown版StackExchange数据集 Marin 格式化Markdown版StackExchange数据集将Stack Exchange平台的问答对转换为Markdown格式，总Token(Token)量达204亿。本数据集完整保留技术讨论的核心内容，并将其整理为线程化格式，适配大语言模型(Large Language Model, LLM)训练需求。 | 指标 | 数值 | |---------------------|-------------------| | Token(Token)总量 | 20413785853 | | 原始数据源 | https://archive.org/details/stackexchange | | 文件格式 | JSONL | | 许可证 | CC（与上游Stack Exchange许可证保持一致） | ## 处理与清洗流程我们的转换流程结合了多种先进技术，将原始Stack Exchange导出的HTML内容转换为高质量Markdown格式： 1. **HTML预处理**：我们以提供帖子XML表示形式的Stack Exchange导出数据集作为起始处理源。 2. **结构化问答格式** 每个线程均通过清晰的章节标题进行格式化： - 以`# 问题`作为标题与正文的章节标识 - 以`# 回答`作为每条回复的章节标识 - 每条回答旁保留点赞数 - 最终在底部通过分隔符追加标签 3. **模板变体处理** - 回答模板采用随机化变体 - 点赞数可位于回答内容的前或后位置 - 该随机化过程基于问题ID生成确定性种子，确保可复现 4. **DOM结构简化**：我们采用[经定制增强的Resiliparse版本](https://github.com/stanford-crfm/chatnoir-resiliparse)，该工具可保留语义化HTML结构。相较于将内容扁平化为纯文本，我们保留了标题、段落、列表等关键元素，同时移除了脚本、追踪代码与冗余模板代码。 5. **Markdown格式转换**：我们的[定制化Markdownify实现](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650)可将简化后的DOM转换为整洁的Markdown格式。最终输出将每篇文章存储为JSON对象，包含Markdown文本与必要元数据。 ## 模板结构数据集中的每条条目均包含完整的问答线程，具体包括： - 原始问题标题 - 完整Markdown格式的问题正文 - 多条（若存在）附带点赞数的回答 - 原始标签 - 创建时间 - 来源URL引用示例模板： # Question Title: What is the h-index exactly and how does it work? What is the h-index, and how does it work ? # Answer The h-index is a measure of the impact of someone's publication list. An h-index of 10 for example means that the person has published 10 papers with at least 10 citations. The total number of papers published may be higher, but only 10 will have 10 or more citations. Critics argue that this measure disadvantages young researchers who did not have time to publish a lot and whose work has not been published for long and thus may not have attracted many citations. Other criticisms include that it makes a researcher focus on how to increase the citation count for a paper that may be not that good but would increase the h-index. For more explanation, see for example the Wikipedia article. > 35 votes --- Tags: bibliometrics, methodology, ranking --- ## 使用示例 python from datasets import load_dataset ds = load_dataset( "marin-community/stackexchange-markdown", split="train", streaming=True ) for article in ds.take(3): print(article["text"]) ## 引用规范若您在研究中使用本数据集，请同时引用原始Stack Exchange贡献者与本项目： @misc{markdownified_ar5iv_2024, title = {Markdownified StackExchange}, author = {The Marin Community}, year = {2024}, url = {https://huggingface.co/datasets/marin-community/stackexchange-markdown} } ## 许可证所有内容沿用Stack Exchange的许可证协议：CC。本项目的转换工具与处理流程采用Apache 2.0许可证发布。 ## 致谢我们谨向以下对象致谢： - Stack Exchange网络及其众多贡献者 - Janek Bevendorff及其[Resiliparse项目](https://github.com/chatnoir-eu/chatnoir-resiliparse) - Matthew Dapena-Tretter及其[Markdownify工具](https://github.com/matthewwithanm/python-markdownify)

提供机构：

maas

创建时间：

2025-10-30

搜集汇总

数据集介绍