mteb/arena-stackexchange
收藏Hugging Face2024-07-29 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mteb/arena-stackexchange
下载链接
链接失效反馈官方服务:
资源简介:
`mteb/arena-stackexchange`数据集是一个精心策划的Stack Exchange问题和答案集合,专为MTEB(大规模文本嵌入基准)竞技场设计。该数据集允许各种嵌入模型在Stack Exchange内容上进行竞争和排名。数据集中的每个实例代表一个Stack Exchange的问题-答案对,包含唯一标识符、处理后的内容(包括问题和最高得分答案)、原始未处理内容、特定Stack Exchange站点的子域以及关于帖子的附加信息(如语言、长度、来源和问题得分)。数据集创建过程包括从Internet Archive上的Stack Exchange数据转储中提取数据,仅包含25个最大的Stack Exchange站点,去除HTML标签,将问题和答案分组为对,仅保留得分3及以上的问题和每个问题的最高得分答案,排除非英语站点,并在每个文档的开头添加子域(Stack Exchange站点名称),排除超过200字或2000字符的问题和答案。
The `mteb/arena-stackexchange` dataset is a curated collection of Stack Exchange questions and answers, designed for use in the MTEB (Massive Text Embedding Benchmark) Arena. Each instance in the dataset represents a question-answer pair from Stack Exchange and contains a unique identifier, processed content, original content, subdomain (specific Stack Exchange site), and metadata (including language, length, provenance, and question score). The dataset creation process involves extracting data from the Stack Exchange data dump on the Internet Archive, including only posts from the 25 largest Stack Exchange sites, removing HTML tags, pairing questions and answers, retaining only questions with a score of 3 or higher, and including only the top-scoring answer for each question. Additionally, non-English Stack Exchange sites are excluded, the subdomain (Stack Exchange site name) is added to the beginning of each document, and questions and answers exceeding 200 words or 2000 characters are excluded. The use of the dataset requires awareness of potential biases, including selection bias, domain bias, temporal bias, and biases within the original Stack Exchange communities themselves.
提供机构:
mteb



