five

wikipedia_en

收藏
魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/Intelligent-Internet/wikipedia_en
下载链接
链接失效反馈
官方服务:
资源简介:
# `wikipedia_en` This is a curated Wikipedia English dataset for use with the [II-Commons](https://github.com/Intelligent-Internet/II-Commons) project. ## Dataset Details ### Dataset Description This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0). All vector embeddings are 16-bit half-precision vectors optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). ### Dataset Sources Based on the [wikipedia dumps](https://dumps.wikimedia.org/). Please check this page for the [LICENSE](https://dumps.wikimedia.org/legal.html) of the page data. ## Dataset Structure 1. Metadata Table - id: A unique identifier for the page. - revid: The revision ID of the page. - url: The URL of the page. - title: The title of the page. - ignored: Whether the page is ignored. - created_at: The creation time of the page. - updated_at: The update time of the page. 2. Chunking Table - id: A unique identifier for the chunk. - title: The title of the page. - url: The URL of the page. - source_id: The source ID of the page. - chunk_index: The index of the chunk. - chunk_text: The text of the chunk. - vector: The vector embedding of the chunk. - created_at: The creation time of the chunk. - updated_at: The update time of the chunk. ## Prerequisite PostgreSQL 17 with extensions: [vectorchord](https://github.com/tensorchord/VectorChord) and [pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search) The easiest way is to use our [Docker image](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db), or build your own. Then load the [psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup) branch, following the [Quick Start](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start) Ensure extensions are enabled, connect to the database using the psql, and run the following SQL: ```sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ``` ## Uses This dataset is available for a wide range of applications. Here is a demo of how to use the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons). ### Create the metadata and chunking tables in PostgreSQL ```sql CREATE TABLE IF NOT EXISTS ts_wikipedia_en ( id BIGSERIAL PRIMARY KEY, revid BIGINT NOT NULL, url VARCHAR NOT NULL, title VARCHAR NOT NULL DEFAULT '', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, ignored BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed ( id BIGSERIAL PRIMARY KEY, title VARCHAR NOT NULL, url VARCHAR NOT NULL, chunk_index BIGINT NOT NULL, chunk_text VARCHAR NOT NULL, source_id BIGINT NOT NULL, vector halfvec(768) DEFAULT NULL, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ); ``` ### Load csv files to database 1. Load the dataset from local file system to a remote PostgreSQL server: ```sql \copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; \copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ``` 2. Load the dataset from the PostgreSQL server's file system: ```sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ``` ### Create Indexes You need to create the following indexes for the best performance. The `vector` column is a halfvec(768) column, which is a 16-bit half-precision vector optimized for `cosine` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). You can get more information about the vector index from the [vectorchord](https://docs.vectorchord.ai/vectorchord/usage/indexing.html) documentation. 1. Create the metadata table index: ```sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at); ``` 2. Create the chunking table index: ```sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id'); CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$ [build.internal] lists = [20000] build_threads = 6 spherical_centroids = true $$); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL; SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index'); ``` ### Query with II-Commons Click this link to learn how to query the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).

# `wikipedia_en` 本数据集为经精选处理的英文维基百科数据集,供[II-Commons](https://github.com/Intelligent-Internet/II-Commons)项目使用。 ## 数据集详情 ### 数据集概述 本数据集包含经精选处理的英文维基百科页面。数据直接取自官方英文维基百科数据库转储文件。我们先提取页面内容,将其切分为更小的片段,再使用[Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0)生成向量嵌入(vector embedding)。所有向量嵌入均为16位半精度向量,适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度(cosine)索引优化。 ### 数据集来源 本数据集基于[维基百科转储文件](https://dumps.wikimedia.org/)制作,请查阅该页面获取页面数据的[授权协议](https://dumps.wikimedia.org/legal.html)。 ## 数据集结构 1. 元数据表 - `id`:页面的唯一标识符 - `revid`:页面的修订版本ID - `url`:页面的访问链接 - `title`:页面标题 - `ignored`:是否跳过该页面 - `created_at`:页面创建时间 - `updated_at`:页面更新时间 2. 分块数据表 - `id`:分块的唯一标识符 - `title`:所属页面标题 - `url`:所属页面的访问链接 - `source_id`:所属页面的源ID - `chunk_index`:分块的索引序号 - `chunk_text`:分块的文本内容 - `vector`:分块的向量嵌入 - `created_at`:分块创建时间 - `updated_at`:分块更新时间 ## 前置依赖 需使用搭载以下扩展的PostgreSQL 17:[vectorchord](https://github.com/tensorchord/VectorChord)与[pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)。 最简部署方式为使用我们提供的[Docker镜像](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db),或自行构建镜像。随后请按照[快速入门指南](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)加载[psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en/tree/psql_basebackup)分支。 请确保已启用上述扩展,通过psql连接至数据库后,执行以下SQL语句: sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ## 应用场景 本数据集可应用于多种场景。以下演示如何结合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)使用本数据集。 ### 在PostgreSQL中创建元数据与分块表 sql CREATE TABLE IF NOT EXISTS ts_wikipedia_en ( id BIGSERIAL PRIMARY KEY, revid BIGINT NOT NULL, url VARCHAR NOT NULL, title VARCHAR NOT NULL DEFAULT '', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, ignored BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE IF NOT EXISTS ts_wikipedia_en_embed ( id BIGSERIAL PRIMARY KEY, title VARCHAR NOT NULL, url VARCHAR NOT NULL, chunk_index BIGINT NOT NULL, chunk_text VARCHAR NOT NULL, source_id BIGINT NOT NULL, vector halfvec(768) DEFAULT NULL, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ); ### 将CSV文件导入数据库 1. 将本地文件系统中的数据集导入远程PostgreSQL服务器: sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... 2. 从PostgreSQL服务器的本地文件系统导入数据集: sql copy ts_wikipedia_en FROM 'data/meta/ts_wikipedia_en.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000000.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000001.csv' CSV HEADER; copy ts_wikipedia_en_embed FROM 'data/chunks/0000002.csv' CSV HEADER; ... ### 创建索引 您需要创建以下索引以获得最佳性能。 `vector`列为`halfvec(768)`类型,即16位半精度向量,适配[vectorchord](https://github.com/tensorchord/vectorchord)的余弦相似度索引优化。您可查阅[vectorchord文档](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)获取更多向量索引相关信息。 1. 创建元数据表索引: sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_revid_index ON ts_wikipedia_en (revid); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_url_index ON ts_wikipedia_en (url); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_title_index ON ts_wikipedia_en (title); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_ignored_index ON ts_wikipedia_en (ignored); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_created_at_index ON ts_wikipedia_en (created_at); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_updated_at_index ON ts_wikipedia_en (updated_at); 2. 创建分块数据表索引: sql CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_id_index ON ts_wikipedia_en_embed (source_id); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_index_index ON ts_wikipedia_en_embed (chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_chunk_text_index ON ts_wikipedia_en_embed USING bm25 (id, title, chunk_text) WITH (key_field='id'); CREATE UNIQUE INDEX IF NOT EXISTS ts_wikipedia_en_embed_source_index ON ts_wikipedia_en_embed (source_id, chunk_index); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_index ON ts_wikipedia_en_embed USING vchordrq (vector halfvec_cosine_ops) WITH (options = $$ [build.internal] lists = [20000] build_threads = 6 spherical_centroids = true $$); CREATE INDEX IF NOT EXISTS ts_wikipedia_en_embed_vector_null_index ON ts_wikipedia_en_embed (vector) WHERE vector IS NULL; SELECT vchordrq_prewarm('ts_wikipedia_en_embed_vector_index'); ### 结合II-Commons进行查询 请点击此链接了解如何通过[II-Commons](https://github.com/Intelligent-Internet/II-Commons)查询本数据集。
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个经过整理的英文维基百科数据集,专为II-Commons项目设计,数据源自官方英文维基百科转储。其特点包括将页面分块并嵌入为16位半精度向量,优化用于余弦索引,适用于向量搜索等应用,并提供了完整的PostgreSQL集成指南。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作