Laz4rz/wikipedia_stem_small_rag_embeddings
收藏Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/Laz4rz/wikipedia_stem_small_rag_embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
dataset_info:
features:
- name: text
dtype: string
- name: category
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: embeddings
sequence: float64
splits:
- name: train
num_bytes: 4949572549
num_examples: 518092
download_size: 3787534362
dataset_size: 4949572549
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- en
pretty_name: STEMWikiSmallRAG
tags:
- RAG
- Retrieval Augmented Generation
- Small Chunks
- Wikipedia
- Science
- Scientific
- Scientific Wikipedia
- Science Wikipedia
- 512 tokens
- STEM
task_categories:
- text-generation
- text-classification
- question-answering
---
# STEMWikiSmallRAG with embeddings
This dataset contains wikipedia entries from STEM field, unfortunately there is also Business&Economics... but I thought it may contain some useful data as well, even by accident.
Processed version of millawell/wikipedia_field_of_science, prepared to be used in small context length RAG systems. Chunk length is tokenizer dependent, but each chunk should be around 512 tokens. Longer wikipedia pages have been split into smaller entries, with title added as a prefix. Embedded using mixedbread-ai/mxbai-embed-large-v1, with truncation to 512 tokens.
There are also not embedded 256 and 512 tokens datasets available:
- Laz4rz/wikipedia_science_chunked_small_rag_512
- Laz4rz/wikipedia_science_chunked_small_rag_256
If you wish to prepare some other chunk length:
- use millawell/wikipedia_field_of_science
- adapt chunker function:
```
def chunker_clean(results, example, length=512, approx_token=3, prefix=""):
if len(results) == 0:
regex_pattern = r'[\n\s]*\n[\n\s]*'
example = re.sub(regex_pattern, " ", example).strip().replace(prefix, "")
chunk_length = length * approx_token
if len(example) > chunk_length:
first = example[:chunk_length]
chunk = ".".join(first.split(".")[:-1])
if len(chunk) == 0:
chunk = first
rest = example[len(chunk)+1:]
results.append(prefix+chunk.strip())
if len(rest) > chunk_length:
chunker_clean(results, rest.strip(), length=length, approx_token=approx_token, prefix=prefix)
else:
results.append(prefix+rest.strip())
else:
results.append(prefix+example.strip())
return results
```
提供机构:
Laz4rz
原始信息汇总
STEMWikiSmallRAG 数据集概述
数据集信息
- 许可证: cc-by-sa-3.0
- 语言: 英语 (en)
- 特征:
text: 文本数据,类型为字符串 (string)category: 类别,类型为字符串 (string)url: URL链接,类型为字符串 (string)title: 标题,类型为字符串 (string)embeddings: 嵌入向量,类型为浮点数序列 (sequence: float64)
- 分割:
train: 训练集,包含 518,092 个样本,总大小为 4,949,572,549 字节
- 下载大小: 3,787,534,362 字节
- 数据集大小: 4,949,572,549 字节
- 配置:
default: 默认配置,数据文件路径为data/train-*
数据集描述
- 名称: STEMWikiSmallRAG
- 标签:
- RAG
- Retrieval Augmented Generation
- Small Chunks
- Wikipedia
- Science
- Scientific
- Scientific Wikipedia
- Science Wikipedia
- 512 tokens
- STEM
- 任务类别:
- 文本生成
- 文本分类
- 问答系统
数据集内容
- 包含来自STEM领域的维基百科条目,部分条目可能包含商业和经济内容。
- 数据集经过处理,适用于小上下文长度的RAG系统。
- 每个块的长度取决于分词器,但每个块大约为512个token。
- 较长的维基百科页面已被分割为较小的条目,标题作为前缀添加。
- 使用
mixedbread-ai/mxbai-embed-large-v1进行嵌入,截断至512个token。
其他相关数据集
Laz4rz/wikipedia_science_chunked_small_rag_512: 包含512个token的STEM维基百科数据集。Laz4rz/wikipedia_science_chunked_small_rag_256: 包含256个token的STEM维基百科数据集。
数据处理方法
- 使用
millawell/wikipedia_field_of_science数据集进行处理。 - 可以使用以下函数进行块分割: python def chunker_clean(results, example, length=512, approx_token=3, prefix=""): if len(results) == 0: regex_pattern = r[ s]* [ s]* example = re.sub(regex_pattern, " ", example).strip().replace(prefix, "") chunk_length = length * approx_token if len(example) > chunk_length: first = example[:chunk_length] chunk = ".".join(first.split(".")[:-1]) if len(chunk) == 0: chunk = first rest = example[len(chunk)+1:] results.append(prefix+chunk.strip()) if len(rest) > chunk_length: chunker_clean(results, rest.strip(), length=length, approx_token=approx_token, prefix=prefix) else: results.append(prefix+rest.strip()) else: results.append(prefix+example.strip()) return results



