five

arena-wikipedia-7-15-24

收藏
魔搭社区2025-12-05 更新2024-09-07 收录
下载链接:
https://modelscope.cn/datasets/MTEB/arena-wikipedia-7-15-24
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset used for mteb/Arena Wikipedia ## Overview The `mteb/arena-wikipedia-7-15-24` dataset is a comprehensive collection of Wikipedia articles up to July 15, 2024. It is designed for use in the MTEB (Massive Text Embedding Benchmark) Arena, where various embedding models compete and are ranked based on their performance. ## What is Wikipedia? Wikipedia is a free online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors. It is the largest and most popular general reference work on the World Wide Web. Wikipedia contains articles on a vast array of topics, making it an invaluable resource for general knowledge and research. ## Dataset Structure Each instance in the dataset represents a chunk of a Wikipedia article and contains the following fields: 1. **title** (string): The title of the Wikipedia article. 2. **id** (string): A unique identifier for the chunk, typically in the format "XXXXXXXX-Y" where XXXXXXXX is a number and Y is the chunk number. 3. **text** (string): The content of the article chunk, including headings and paragraphs. Note that tables are currently not present. ## Dataset Creation Process 1. The dataset is created from a Wikipedia dump in CirrusSearch format. 2. The content is parsed using the `mwparserfromhell` library to extract clean text. 3. Articles are chunked into segments of approximately 200 words, with flexibility to expand up to 400 words to preserve paragraph boundaries and keep headings intact. 4. Only the top 500,000 most popular articles are included, based on a popularity score derived from page view data in the CirrusSearch file. ## Example Instance Here's an example of what a single instance in the dataset might look like: ```json { "title": "Albert Einstein", "id": "10000123-0", "text": "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the most influential and best-known scientists ..." } ``` ## Ethical Considerations When using this dataset, please be aware of potential biases in Wikipedia content, including: 1. Cultural and linguistic biases, as Wikipedia's coverage may vary across different languages and cultures. 2. Temporal biases, as the dataset represents Wikipedia at a specific point in time (July 15, 2024). 3. Popularity biases, as only the top 500,000 articles are included based on page views. Users should also be mindful of Wikipedia's own policies regarding neutral point of view and verifiability. ## Updates and Maintenance This dataset represents Wikipedia articles up to July 15, 2024. For instructions on how to create this dataset again with newer data, please refer to the [create_index_chunks.py script](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107) in the embeddings-benchmark/arena repository. ## License The dataset is subject to Wikipedia's license terms. As of the dataset creation date, Wikipedia content is generally available under the Creative Commons Attribution-ShareAlike License (CC-BY-SA). Users of this dataset should comply with the terms of this license.

# 用于MTEB竞技场的维基百科数据集 ## 数据集概览 `mteb/arena-wikipedia-7-15-24` 数据集是截至2024年7月15日的维基百科文章综合合集,专为大规模文本嵌入基准测试(Massive Text Embedding Benchmark,以下简称MTEB)竞技场设计,该竞技场用于各类嵌入模型的性能比拼与排名。 ## 维基百科简介 维基百科是由志愿编辑社区协作创建与维护的免费在线百科全书,是万维网上规模最大、最受欢迎的通用参考资源。其内容覆盖海量主题,是获取通识知识与开展研究的宝贵资源。 ## 数据集结构 数据集的每个实例对应一篇维基百科文章的分段,包含以下字段: 1. **标题(title)**:字符串类型,对应维基百科文章的标题。 2. **标识符(id)**:字符串类型,为该分段的唯一标识,格式通常为"XXXXXXXX-Y",其中XXXXXXXX为数字串,Y为分段编号。 3. **文本(text)**:字符串类型,包含文章分段的内容,涵盖标题与段落。注:当前版本暂未包含表格内容。 ## 数据集构建流程 数据集的创建流程如下: 1. 基于CirrusSearch格式的维基百科转储文件构建数据集。 2. 使用`mwparserfromhell`库解析内容,提取纯净文本。 3. 将文章切分为约200词的分段,可灵活扩展至400词以保留段落边界与标题完整性。 4. 仅收录基于CirrusSearch文件中的页面浏览数据计算得到的访问量排名前50万的热门文章。 ## 实例示例 以下为该数据集单个实例的示例: json { "title": "阿尔伯特·爱因斯坦", "id": "10000123-0", "text": "阿尔伯特·爱因斯坦是德裔理论物理学家,被广泛认为是最具影响力且知名度最高的科学家之一……" } ## 伦理考量 使用本数据集时,需注意维基百科内容可能存在的各类偏差: 1. **文化与语言偏差**:维基百科的内容覆盖在不同语言与文化间存在差异。 2. **时间偏差**:本数据集仅反映2024年7月15日时点的维基百科内容。 3. **热门度偏差**:仅收录访问量排名前50万的文章,存在偏向热门主题的局限。 同时使用者需遵守维基百科关于中立观点与可验证性的相关政策。 ## 更新与维护 本数据集收录的维基百科文章截至2024年7月15日。如需使用更新数据重新生成该数据集,请参考embeddings-benchmark/arena仓库中的[create_index_chunks.py脚本](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107)。 ## 许可协议 本数据集受维基百科许可条款约束。截至数据集创建时点,维基百科内容通常采用知识共享署名-相同方式共享许可协议(Creative Commons Attribution-ShareAlike License,简称CC-BY-SA)发布,数据集使用者需遵守该协议的相关条款。
提供机构:
maas
创建时间:
2024-09-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作