wikipedia_en
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/Intelligent-Internet/wikipedia_en
下载链接
链接失效反馈官方服务:
资源简介:
# `wikipedia_en`
This is a curated Wikipedia English dataset for use with the [II-Commons](https://github.com/Intelligent-Internet/II-Commons) project.
## Dataset Details
### Dataset Description
This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0). All vector embeddings are 16-bit half-precision vectors optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord).
### Dataset Sources
Based on the [wikipedia dumps](https://dumps.wikimedia.org/). Please check this page for the [LICENSE](https://dumps.wikimedia.org/legal.html) of the page data.
## Dataset Structure
1. Metadata Table
- id: A unique identifier for the page.
- revid: The revision ID of the page.
- url: The URL of the page.
- title: The title of the page.
- ignored: Whether the page is ignored.
- created_at: The creation time of the page.
- updated_at: The update time of the page.
2. Chunking Table
- id: A unique identifier for the chunk.
- title: The title of the page.
- url: The URL of the page.
- source_id: The source ID of the page.
- chunk_index: The index of the chunk.
- chunk_text: The text of the chunk.
- vector: The vector embedding of the chunk.
- created_at: The creation time of the chunk.
- updated_at: The update time of the chunk.
## Prerequisite
PostgreSQL 17 with extensions: [vectorchord](https://github.com/tensorchord/VectorChord) and [pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)
The easiest way is to use our [Docker image](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db), or build your own. Then load the [psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup) branch, following the [Quick Start](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)
Ensure extensions are enabled, connect to the database using the psql, and run the following SQL:
```sql
CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;
```
## Uses
This dataset is available for a wide range of applications.
Here is a demo of how to use the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).
### Create the metadata and chunking tables in PostgreSQL
```sql
CREATE TABLE IF NOT EXISTS ts_wikipedia_en (
id BIGSERIAL PRIMARY KEY,
revid BIGINT NOT NULL,
url VARCHAR NOT NULL,
title VARCHAR NOT NULL DEFAULT '',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
ignored BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed (
id BIGSERIAL PRIMARY KEY,
title VARCHAR NOT NULL,
url VARCHAR NOT NULL,
chunk_index BIGINT NOT NULL,
chunk_text VARCHAR NOT NULL,
source_id BIGINT NOT NULL,
vector halfvec(768) DEFAULT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
```
### Load csv files to database
1. Load the dataset from local file system to a remote PostgreSQL server:
```sql
\copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER;
\copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER;
\copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER;
\copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER;
...
```
2. Load the dataset from the PostgreSQL server's file system:
```sql
copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER;
...
```
### Create Indexes
You need to create the following indexes for the best performance.
The `vector` column is a halfvec(768) column, which is a 16-bit half-precision vector optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). You can get more information about the vector index from the [vectorchord](https://docs.vectorchord.ai/vectorchord/usage/indexing.html) documentation.
1. Create the metadata table index:
```sql
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at);
```
2. Create the chunking table index:
```sql
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id');
CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$
[build.internal]
lists = [20000]
build_threads = 6
spherical_centroids = true
$$);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL;
SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index');
```
### Query with II-Commons
Click this link to learn how to query the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).
# `wikipedia_en`
本数据集为经精选处理的英文维基百科数据集,供[II-Commons](https://github.com/Intelligent-Internet/II-Commons)项目使用。
## 数据集详情
### 数据集概述
本数据集包含经精选处理的英文维基百科页面。数据直接取自官方英文维基百科数据库转储文件。我们先提取页面内容,将其切分为更小的片段,再使用[Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0)生成向量嵌入(vector embedding)。所有向量嵌入均为16位半精度向量,适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度(cosine)索引优化。
### 数据集来源
本数据集基于[维基百科转储文件](https://dumps.wikimedia.org/)制作,请查阅该页面获取页面数据的[授权协议](https://dumps.wikimedia.org/legal.html)。
## 数据集结构
1. 元数据表
- `id`:页面的唯一标识符
- `revid`:页面的修订版本ID
- `url`:页面的访问链接
- `title`:页面标题
- `ignored`:是否跳过该页面
- `created_at`:页面创建时间
- `updated_at`:页面更新时间
2. 分块数据表
- `id`:分块的唯一标识符
- `title`:所属页面标题
- `url`:所属页面的访问链接
- `source_id`:所属页面的源ID
- `chunk_index`:分块的索引序号
- `chunk_text`:分块的文本内容
- `vector`:分块的向量嵌入
- `created_at`:分块创建时间
- `updated_at`:分块更新时间
## 前置依赖
需使用搭载以下扩展的PostgreSQL 17:[vectorchord](https://github.com/tensorchord/VectorChord)与[pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)。
最简部署方式为使用我们提供的[Docker镜像](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db),或自行构建镜像。随后请按照[快速入门指南](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)加载[psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup)分支。
请确保已启用上述扩展,通过psql连接至数据库后,执行以下SQL语句:
sql
CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;
## 应用场景
本数据集可应用于多种场景。以下演示如何结合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)使用本数据集。
### 在PostgreSQL中创建元数据与分块表
sql
CREATE TABLE IF NOT EXISTS ts_wikipedia_en (
id BIGSERIAL PRIMARY KEY,
revid BIGINT NOT NULL,
url VARCHAR NOT NULL,
title VARCHAR NOT NULL DEFAULT '',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
ignored BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed (
id BIGSERIAL PRIMARY KEY,
title VARCHAR NOT NULL,
url VARCHAR NOT NULL,
chunk_index BIGINT NOT NULL,
chunk_text VARCHAR NOT NULL,
source_id BIGINT NOT NULL,
vector halfvec(768) DEFAULT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
### 将CSV文件导入数据库
1. 将本地文件系统中的数据集导入远程PostgreSQL服务器:
sql
copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER;
...
2. 从PostgreSQL服务器的本地文件系统导入数据集:
sql
copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER;
copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER;
...
### 创建索引
您需要创建以下索引以获得最佳性能。
`vector`列为`halfvec(768)`类型,即16位半精度向量,适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度索引优化。您可查阅[vectorchord文档](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)获取更多向量索引相关信息。
1. 创建元数据表索引:
sql
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at);
2. 创建分块数据表索引:
sql
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id');
CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$
[build.internal]
lists = [20000]
build_threads = 6
spherical_centroids = true
$$);
CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL;
SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index');
### 结合II-Commons进行查询
请点击此链接了解如何通过[II-Commons](https://github.com/Intelligent-Internet/II-Commons)查询本数据集。
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个经过整理的英文维基百科数据集,专为II-Commons项目设计,数据源自官方英文维基百科转储。其特点包括将页面分块并嵌入为16位半精度向量,优化用于余弦索引,适用于向量搜索等应用,并提供了完整的PostgreSQL集成指南。
以上内容由遇见数据集搜集并总结生成



