karmiq/wikipedia-embeddings-cs-e5-base
收藏Hugging Face2024-01-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/karmiq/wikipedia-embeddings-cs-e5-base
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: chunks
sequence: string
- name: embeddings
sequence:
sequence: float32
splits:
- name: train
num_bytes: 5021489124
num_examples: 534044
download_size: 4750515911
dataset_size: 5021489124
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- cs
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- fill-mask
license:
- cc-by-sa-3.0
- gfdl
---
This dataset contains the Czech subset of the [`wikimedia/wikipedia`](https://huggingface.co/datasets/wikimedia/wikipedia) dataset. Each page is divided into paragraphs, stored as a list in the `chunks` column. For every paragraph, embeddings are created using the [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) model.
## Usage
Load the dataset:
```python
from datasets import load_dataset
ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train")
ds[1]
```
```
{
'id': '1',
'url': 'https://cs.wikipedia.org/wiki/Astronomie',
'title': 'Astronomie',
'chunks': [
'Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...',
'Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...',
...,
],
'embeddings': [
[0.09006806463003159, -0.009814552962779999, ...],
[0.10767366737127304, ...],
...
]
}
```
The structure makes it easy to use the dataset for implementing semantic search.
<details>
<summary>Load the data in Elasticsearch</summary>
```python
def doc_generator(data, batch_size=1000):
for batch in data.with_format("numpy").iter(batch_size):
for i, id in enumerate(batch["id"]):
output = {"id": id}
output["title"] = batch["title"][i]
output["url"] = batch["url"][i]
output["parts"] = [
{ "chunk": chunk, "embedding": embedding }
for chunk, embedding in zip(batch["chunks"][i], batch["embeddings"][i])
]
yield output
num_indexed, num_failed = 0, 0,
progress = tqdm(total=ds.num_rows, unit="doc", desc="Indexing")
for ok, info in parallel_bulk(
es,
index="wikipedia-search",
actions=doc_generator(ds),
raise_on_error=False,
):
if not ok:
print(f"ERROR {info['index']['status']}: "
f"{info['index']['error']['type']}: {info['index']['error']['caused_by']['type']}: "
f"{info['index']['error']['caused_by']['reason'][:250]}")
progress.update(1)
```
</details>
<details>
<summary>Use <code>sentence_transformers.util.semantic_search</code></summary>
```python
import sentence_transformers
model = sentence_transformers.SentenceTransformer("intfloat/multilingual-e5-base")
ds.set_format(type="torch", columns=["embeddings"], output_all_columns=True)
# Flatten the dataset
def explode_sequence(batch):
output = { "id": [], "url": [], "title": [], "chunk": [], "embedding": [] }
for id, url, title, chunks, embeddings in zip(
batch["id"], batch["url"], batch["title"], batch["chunks"], batch["embeddings"]
):
output["id"].extend([id for _ in range(len(chunks))])
output["url"].extend([url for _ in range(len(chunks))])
output["title"].extend([title for _ in range(len(chunks))])
output["chunk"].extend(chunks)
output["embedding"].extend(embeddings)
return output
ds_flat = ds.map(
explode_sequence,
batched=True,
remove_columns=ds.column_names,
num_proc=min(os.cpu_count(), 32),
desc="Flatten")
ds_flat
query = "Čím se zabývá fyzika?"
hits = sentence_transformers.util.semantic_search(
query_embeddings=model.encode(query),
corpus_embeddings=ds_flat["embedding"],
top_k=10)
for hit in hits[0]:
title = ds_flat[hit['corpus_id']]['title']
chunk = ds_flat[hit['corpus_id']]['chunk']
print(f"[{hit['score']:0.2f}] {textwrap.shorten(chunk, width=100, placeholder='…')} [{title}]")
# [0.90] Fyzika částic ( též částicová fyzika ) je oblast fyziky, která se zabývá částicemi. V širším smyslu… [Fyzika částic]
# [0.89] Fyzika ( z řeckého φυσικός ( fysikos ): přírodní, ze základu φύσις ( fysis ): příroda, archaicky… [Fyzika]
# ...
```
</details>
The embeddings generation took about 2 hours on an NVIDIA A100 80GB GPU.
## License
See license of the original dataset: <https://huggingface.co/datasets/wikimedia/wikipedia>.
数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:url(统一资源定位符,Uniform Resource Locator),数据类型:字符串
- 名称:标题,数据类型:字符串
- 名称:分块(chunks),类型为字符串序列
- 名称:嵌入向量(embeddings),类型为嵌套的32位浮点数序列
划分集:
- 名称:训练集,占用字节数:5021489124,样本数量:534044
下载大小:4750515911,数据集总占用大小:5021489124
配置项:
- 配置名称:默认
数据文件:
- 划分集:训练集,路径:data/train-*
语言:捷克语
规模类别:100000 < 样本数 < 1000000
任务类别:文本生成、掩码填充
许可协议:CC BY-SA 3.0(知识共享署名-相同方式共享3.0)、GFDL(GNU自由文档许可证)
---
本数据集包含[`wikimedia/wikipedia`](https://huggingface.co/datasets/wikimedia/wikipedia)数据集的捷克语子集。每篇维基百科页面均被拆分为段落,存储于`chunks`列的列表中。针对每个段落,均使用[`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base)模型生成嵌入向量。
## 使用方法
加载数据集:
python
from datasets import load_dataset
ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train")
ds[1]
返回的示例数据格式如下:
python
{
'id': '1',
'url': 'https://cs.wikipedia.org/wiki/Astronomie',
'title': 'Astronomie',
'chunks': [
'Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...',
'Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...',
...,
],
'embeddings': [
[0.09006806463003159, -0.009814552962779999, ...],
[0.10767366737127304, ...],
...
]
}
该数据集结构便于实现语义搜索相关应用。
<details>
<summary>在Elasticsearch中加载数据</summary>
python
def doc_generator(data, batch_size=1000):
for batch in data.with_format("numpy").iter(batch_size):
for i, id in enumerate(batch["id"]):
output = {"id": id}
output["title"] = batch["title"][i]
output["url"] = batch["url"][i]
output["parts"] = [
{ "chunk": chunk, "embedding": embedding }
for chunk, embedding in zip(batch["chunks"][i], batch["embeddings"][i])
]
yield output
num_indexed, num_failed = 0, 0
progress = tqdm(total=ds.num_rows, unit="doc", desc="Indexing")
for ok, info in parallel_bulk(
es,
index="wikipedia-search",
actions=doc_generator(ds),
raise_on_error=False,
):
if not ok:
print(f"ERROR {info['index']['status']}: "
f"{info['index']['error']['type']}: {info['index']['error']['caused_by']['type']}: "
f"{info['index']['error']['caused_by']['reason'][:250]}")
progress.update(1)
</details>
<details>
<summary>使用<code>sentence_transformers.util.semantic_search</code>实现语义搜索</summary>
python
import sentence_transformers
model = sentence_transformers.SentenceTransformer("intfloat/multilingual-e5-base")
ds.set_format(type="torch", columns=["embeddings"], output_all_columns=True)
# 展平嵌套序列数据
def explode_sequence(batch):
output = { "id": [], "url": [], "title": [], "chunk": [], "embedding": [] }
for id, url, title, chunks, embeddings in zip(
batch["id"], batch["url"], batch["title"], batch["chunks"], batch["embeddings"]
):
output["id"].extend([id for _ in range(len(chunks))])
output["url"].extend([url for _ in range(len(chunks))])
output["title"].extend([title for _ in range(len(chunks))])
output["chunk"].extend(chunks)
output["embedding"].extend(embeddings)
return output
ds_flat = ds.map(
explode_sequence,
batched=True,
remove_columns=ds.column_names,
num_proc=min(os.cpu_count(), 32),
desc="展平数据")
ds_flat
query = "物理学研究的内容是什么?"
hits = sentence_transformers.util.semantic_search(
query_embeddings=model.encode(query),
corpus_embeddings=ds_flat["embedding"],
top_k=10)
for hit in hits[0]:
title = ds_flat[hit['corpus_id']]['title']
chunk = ds_flat[hit['corpus_id']]['chunk']
print(f"[{hit['score']:0.2f}] {textwrap.shorten(chunk, width=100, placeholder='…')} [{title}]")
# [0.90] Fyzika částic ( též částicová fyzika ) je oblast fyziky, která se zabývá částicemi. V širším smyslu… [Fyzika částic]
# [0.89] Fyzika ( z řeckého φυσικός ( fysikos ): přírodní, ze základu φύσις ( fysis ): příroda, archaicky… [Fyzika]
# ...
</details>
嵌入向量的生成在NVIDIA A100 80GB GPU上耗时约2小时。
## 许可协议
详见原数据集的许可协议:<https://huggingface.co/datasets/wikimedia/wikipedia>.
提供机构:
karmiq
原始信息汇总
数据集概述
数据集信息
- 特征:
id: 字符串类型url: 字符串类型title: 字符串类型chunks: 字符串序列embeddings: 浮点数序列的序列
- 分割:
train: 包含534044个样本,总大小为5021489124字节
- 下载大小: 4750515911字节
- 数据集大小: 5021489124字节
配置
- 默认配置:
train分割的数据文件路径:data/train-*
语言
- 捷克语 (cs)
大小类别
- 100K < n < 1M
任务类别
- 文本生成
- 填充掩码
许可证
- CC BY-SA 3.0
- GFDL
数据集描述
- 该数据集包含来自维基百科的捷克语子集。每个页面被划分为段落,存储在
chunks列中。每个段落使用intfloat/multilingual-e5-base模型生成嵌入向量,存储在embeddings列中。
使用示例
-
加载数据集的示例代码如下: python from datasets import load_dataset
ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train") ds[1]
输出示例: json { "id": "1", "url": "https://cs.wikipedia.org/wiki/Astronomie", "title": "Astronomie", "chunks": [ "Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...", "Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...", ... ], "embeddings": [ [0.09006806463003159, -0.009814552962779999, ...], [0.10767366737127304, ...], ... ] }
嵌入生成时间
- 使用NVIDIA A100 80GB GPU生成嵌入大约需要2小时。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含捷克语维基百科页面的段落及其对应的嵌入向量,适用于语义搜索任务。数据集规模为534,044行,总大小4.75GB,使用`intfloat/multilingual-e5-base`模型生成嵌入。
以上内容由遇见数据集搜集并总结生成



