wikipedia_en

Name: wikipedia_en
Creator: maas
Published: 2025-12-05 12:06:02
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/Intelligent-Internet/wikipedia_en

下载链接

链接失效反馈

官方服务：

资源简介：

# `wikipedia_en` This is a curated Wikipedia English dataset for use with the [II-Commons](https://github.com/Intelligent-Internet/II-Commons) project. ## Dataset Details ### Dataset Description This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0). All vector embeddings are 16-bit half-precision vectors optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). ### Dataset Sources Based on the [wikipedia dumps](https://dumps.wikimedia.org/). Please check this page for the [LICENSE](https://dumps.wikimedia.org/legal.html) of the page data. ## Dataset Structure 1. Metadata Table - id: A unique identifier for the page. - revid: The revision ID of the page. - url: The URL of the page. - title: The title of the page. - ignored: Whether the page is ignored. - created_at: The creation time of the page. - updated_at: The update time of the page. 2. Chunking Table - id: A unique identifier for the chunk. - title: The title of the page. - url: The URL of the page. - source_id: The source ID of the page. - chunk_index: The index of the chunk. - chunk_text: The text of the chunk. - vector: The vector embedding of the chunk. - created_at: The creation time of the chunk. - updated_at: The update time of the chunk. ## Prerequisite PostgreSQL 17 with extensions: [vectorchord](https://github.com/tensorchord/VectorChord) and [pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search) The easiest way is to use our [Docker image](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db), or build your own. Then load the [psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup) branch, following the [Quick Start](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start) Ensure extensions are enabled, connect to the database using the psql, and run the following SQL: ```sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ``` ## Uses This dataset is available for a wide range of applications. Here is a demo of how to use the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons). ### Create the metadata and chunking tables in PostgreSQL ```sql CREATE TABLE IF NOT EXISTS ts_wikipedia_en ( id BIGSERIAL PRIMARY KEY, revid BIGINT NOT NULL, url VARCHAR NOT NULL, title VARCHAR NOT NULL DEFAULT '', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, ignored BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed ( id BIGSERIAL PRIMARY KEY, title VARCHAR NOT NULL, url VARCHAR NOT NULL, chunk_index BIGINT NOT NULL, chunk_text VARCHAR NOT NULL, source_id BIGINT NOT NULL, vector halfvec(768) DEFAULT NULL, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ); ``` ### Load csv files to database 1. Load the dataset from local file system to a remote PostgreSQL server: ```sql \copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ``` 2. Load the dataset from the PostgreSQL server's file system: ```sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ``` ### Create Indexes You need to create the following indexes for the best performance. The `vector` column is a halfvec(768) column, which is a 16-bit half-precision vector optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). You can get more information about the vector index from the [vectorchord](https://docs.vectorchord.ai/vectorchord/usage/indexing.html) documentation. 1. Create the metadata table index: ```sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at); ``` 2. Create the chunking table index: ```sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id'); CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$ [build.internal] lists = [20000] build_threads = 6 spherical_centroids = true $$); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL; SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index'); ``` ### Query with II-Commons Click this link to learn how to query the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).

# `wikipedia_en` 本数据集为经精选处理的英文维基百科数据集，供[II-Commons](https://github.com/Intelligent-Internet/II-Commons)项目使用。 ## 数据集详情 ### 数据集概述本数据集包含经精选处理的英文维基百科页面。数据直接取自官方英文维基百科数据库转储文件。我们先提取页面内容，将其切分为更小的片段，再使用[Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0)生成向量嵌入（vector embedding）。所有向量嵌入均为16位半精度向量，适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度（cosine）索引优化。 ### 数据集来源本数据集基于[维基百科转储文件](https://dumps.wikimedia.org/)制作，请查阅该页面获取页面数据的[授权协议](https://dumps.wikimedia.org/legal.html)。 ## 数据集结构 1. 元数据表 - `id`：页面的唯一标识符 - `revid`：页面的修订版本ID - `url`：页面的访问链接 - `title`：页面标题 - `ignored`：是否跳过该页面 - `created_at`：页面创建时间 - `updated_at`：页面更新时间 2. 分块数据表 - `id`：分块的唯一标识符 - `title`：所属页面标题 - `url`：所属页面的访问链接 - `source_id`：所属页面的源ID - `chunk_index`：分块的索引序号 - `chunk_text`：分块的文本内容 - `vector`：分块的向量嵌入 - `created_at`：分块创建时间 - `updated_at`：分块更新时间 ## 前置依赖需使用搭载以下扩展的PostgreSQL 17：[vectorchord](https://github.com/tensorchord/VectorChord)与[pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)。最简部署方式为使用我们提供的[Docker镜像](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db)，或自行构建镜像。随后请按照[快速入门指南](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)加载[psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup)分支。请确保已启用上述扩展，通过psql连接至数据库后，执行以下SQL语句： sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ## 应用场景本数据集可应用于多种场景。以下演示如何结合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)使用本数据集。 ### 在PostgreSQL中创建元数据与分块表 sql CREATE TABLE IF NOT EXISTS ts_wikipedia_en ( id BIGSERIAL PRIMARY KEY, revid BIGINT NOT NULL, url VARCHAR NOT NULL, title VARCHAR NOT NULL DEFAULT '', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, ignored BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed ( id BIGSERIAL PRIMARY KEY, title VARCHAR NOT NULL, url VARCHAR NOT NULL, chunk_index BIGINT NOT NULL, chunk_text VARCHAR NOT NULL, source_id BIGINT NOT NULL, vector halfvec(768) DEFAULT NULL, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ); ### 将CSV文件导入数据库 1. 将本地文件系统中的数据集导入远程PostgreSQL服务器： sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... 2. 从PostgreSQL服务器的本地文件系统导入数据集： sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ### 创建索引您需要创建以下索引以获得最佳性能。 `vector`列为`halfvec(768)`类型，即16位半精度向量，适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度索引优化。您可查阅[vectorchord文档](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)获取更多向量索引相关信息。 1. 创建元数据表索引： sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at); 2. 创建分块数据表索引： sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id'); CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$ [build.internal] lists = [20000] build_threads = 6 spherical_centroids = true $$); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL; SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index'); ### 结合II-Commons进行查询请点击此链接了解如何通过[II-Commons](https://github.com/Intelligent-Internet/II-Commons)查询本数据集。

提供机构：

maas

创建时间：

2025-05-20

搜集汇总

数据集介绍