arena-stackexchange
收藏魔搭社区2025-11-12 更新2024-09-07 收录
下载链接:
https://modelscope.cn/datasets/MTEB/arena-stackexchange
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset used for Stackexchange in MTEB/arena
## Overview
The `mteb/arena-stackexchange` dataset is a curated collection of Stack Exchange questions and answers, designed for use in the MTEB (Massive Text Embedding Benchmark) Arena. This dataset allows various embedding models to compete and be ranked based on their performance on Stack Exchange content.
## What is Stack Exchange?
Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The most well-known of these sites is Stack Overflow, which focuses on computer programming questions.
## Dataset Structure
Each instance in the dataset represents a question-answer pair from Stack Exchange and contains the following fields:
1. **id** (string): A unique identifier for the question-answer pair.
2. **text** (string): The processed content, including the question and the top-scoring answer.
3. **original_text** (string): The original, unprocessed content of the question.
4. **subdomain** (string): The specific Stack Exchange site the question came from (e.g., "apple" for Apple Stack Exchange).
5. **metadata** (dict): Additional information about the post, including language, length, provenance, and question score.
## Dataset Creation Process
1. The dataset is derived from the Stack Exchange data dump available on the Internet Archive.
2. Only posts from the 25 largest Stack Exchange sites are included.
3. HTML tags are removed from the content.
4. Questions and answers are grouped into pairs.
5. Only questions with a score of 3 or higher are retained.
6. Only the top-scoring answer for each question is included.
7. Non-English Stack Exchange sites are excluded.
8. The subdomain (Stack Exchange site name) is added to the beginning of each document.
9. Questions and Answers that are more than 200 words or 2000 chars are excluded.
## Example Instance
Here's an example of what a single instance in the dataset might look like:
```json
{
"id": "69fa4eabe8a1513845e0d82f945947dedba685d0",
"text": "Apple Stackexchange Q: Why doesn't Microsoft Office/2008(& later) support RTL languages? I have Microsoft Office/2008 on my...",
"original_text": "Q: Why doesn't Microsoft Office/2008(& later) support RTL languages? I have Microsoft Office/2008 on my...",
"subdomain": "apple",
"metadata": {
"language": "en",
"length": 304,
"provenance": "stackexchange_00000.jsonl.gz:3",
"question_score": 5
}
}
```
## Ethical Considerations
When using this dataset, please be aware of potential biases, including:
1. Selection bias due to the inclusion criteria (score ≥ 3, English-only).
2. Domain bias, as only the 25 largest Stack Exchange sites are represented.
3. Temporal bias, as the dataset represents Stack Exchange content up to a specific date, as released by RedPajamas/Dolma.
4. Possible biases in the original Stack Exchange communities themselves.
## Updates and Maintenance
This dataset is based on a specific snapshot of Stack Exchange data. For instructions on how to create this dataset again with newer data, please refer to the [create_index_chunks.py script](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107) in the embeddings-benchmark/arena repository.
## License and Citation
The dataset is subject to Stack Exchange's licensing terms. Users should comply with these terms when using the dataset.
This dataset is derived from the RedPajama dataset. To cite RedPajama, please use:
```
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
```
This dataset was also included in Dolma. To cite Dolma, please use:
```
@article{dolma,
title = {{Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
author={
Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and
Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and
Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and
Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and
Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
year = {2024},
journal={arXiv preprint},
}
```
# 用于MTEB竞技场的Stackexchange数据集
## 概述
`mteb/arena-stackexchange` 数据集是一套经过精选的Stack Exchange问答内容集合,专为大规模文本嵌入基准(Massive Text Embedding Benchmark,简称MTEB)竞技场设计。该数据集可支持各类嵌入模型基于Stack Exchange内容的性能表现进行比拼与排名。
## 什么是Stack Exchange?
Stack Exchange是一个覆盖多领域主题的问答(Question and Answer,简称Q&A)网站网络,每个站点专注于特定主题,其中的问题、答案与用户均遵循声誉积分机制。其中最广为人知的站点为Stack Overflow,专注于计算机编程相关问题。
## 数据集结构
数据集中的每个实例均代表一个来自Stack Exchange的问答对,包含以下字段:
1. **id**(字符串类型):问答对的唯一标识符。
2. **text**(字符串类型):经过处理的内容,包含原问题与得分最高的答案。
3. **original_text**(字符串类型):原始未处理的问题内容。
4. **subdomain**(字符串类型):该问题所属的特定Stack Exchange站点(例如,苹果专区站点的subdomain为"apple")。
5. **metadata**(字典类型):帖文的附加信息,包含语言、长度、来源以及问题得分等。
## 数据集构建流程
1. 该数据集源自互联网档案馆(Internet Archive)公开的Stack Exchange数据转储文件。
2. 仅收录Stack Exchange排名前25的站点的帖文。
3. 移除内容中的HTML标签。
4. 将问题与答案配对分组。
5. 仅保留得分不低于3分的问题。
6. 仅收录每个问题对应的得分最高的答案。
7. 排除非英语的Stack Exchange站点内容。
8. 将站点子域名(即Stack Exchange站点名称)添加至每篇文档的开头。
9. 排除长度超过200词或2000字符的问答内容。
## 实例示例
以下是该数据集中单个实例的示例格式:
json
{
"id": "69fa4eabe8a1513845e0d82f945947dedba685d0",
"text": "Apple Stackexchange Q: Why doesn't Microsoft Office/2008(& later) support RTL languages? I have Microsoft Office/2008 on my...",
"original_text": "Q: Why doesn't Microsoft Office/2008(& later) support RTL languages? I have Microsoft Office/2008 on my...",
"subdomain": "apple",
"metadata": {
"language": "en",
"length": 304,
"provenance": "stackexchange_00000.jsonl.gz:3",
"question_score": 5
}
}
## 伦理考量
使用该数据集时,请留意潜在的偏倚问题,包括:
1. 因收录标准(得分≥3分、仅英语内容)带来的选择偏倚。
2. 领域偏倚:仅收录了Stack Exchange排名前25的站点内容。
3. 时间偏倚:该数据集仅包含截至特定日期的Stack Exchange内容,由RedPajamas/Dolma发布。
4. Stack Exchange原始社区本身可能存在的各类偏倚。
## 更新与维护
该数据集基于Stack Exchange数据的特定快照版本构建。如需使用最新数据重新生成该数据集,请参阅embeddings-benchmark/arena仓库中的[create_index_chunks.py脚本](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107)。
## 许可与引用
该数据集需遵循Stack Exchange的许可条款,使用者在使用该数据集时应遵守相关规定。
本数据集源自RedPajama数据集,引用RedPajama时请使用如下格式:
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
该数据集也被收录于Dolma语料库中,引用Dolma时请使用如下格式:
@article{dolma,
title = {{Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
author={
Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and
Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and
Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and
Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and
Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
year = {2024},
journal={arXiv preprint},
}
提供机构:
maas
创建时间:
2024-09-06



