five

Laz4rz/wikipedia_stem_small_rag_embeddings

收藏
Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/Laz4rz/wikipedia_stem_small_rag_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 dataset_info: features: - name: text dtype: string - name: category dtype: string - name: url dtype: string - name: title dtype: string - name: embeddings sequence: float64 splits: - name: train num_bytes: 4949572549 num_examples: 518092 download_size: 3787534362 dataset_size: 4949572549 configs: - config_name: default data_files: - split: train path: data/train-* language: - en pretty_name: STEMWikiSmallRAG tags: - RAG - Retrieval Augmented Generation - Small Chunks - Wikipedia - Science - Scientific - Scientific Wikipedia - Science Wikipedia - 512 tokens - STEM task_categories: - text-generation - text-classification - question-answering --- # STEMWikiSmallRAG with embeddings This dataset contains wikipedia entries from STEM field, unfortunately there is also Business&Economics... but I thought it may contain some useful data as well, even by accident. Processed version of millawell/wikipedia_field_of_science, prepared to be used in small context length RAG systems. Chunk length is tokenizer dependent, but each chunk should be around 512 tokens. Longer wikipedia pages have been split into smaller entries, with title added as a prefix. Embedded using mixedbread-ai/mxbai-embed-large-v1, with truncation to 512 tokens. There are also not embedded 256 and 512 tokens datasets available: - Laz4rz/wikipedia_science_chunked_small_rag_512 - Laz4rz/wikipedia_science_chunked_small_rag_256 If you wish to prepare some other chunk length: - use millawell/wikipedia_field_of_science - adapt chunker function: ``` def chunker_clean(results, example, length=512, approx_token=3, prefix=""): if len(results) == 0: regex_pattern = r'[\n\s]*\n[\n\s]*' example = re.sub(regex_pattern, " ", example).strip().replace(prefix, "") chunk_length = length * approx_token if len(example) > chunk_length: first = example[:chunk_length] chunk = ".".join(first.split(".")[:-1]) if len(chunk) == 0: chunk = first rest = example[len(chunk)+1:] results.append(prefix+chunk.strip()) if len(rest) > chunk_length: chunker_clean(results, rest.strip(), length=length, approx_token=approx_token, prefix=prefix) else: results.append(prefix+rest.strip()) else: results.append(prefix+example.strip()) return results ```
提供机构:
Laz4rz
原始信息汇总

STEMWikiSmallRAG 数据集概述

数据集信息

  • 许可证: cc-by-sa-3.0
  • 语言: 英语 (en)
  • 特征:
    • text: 文本数据,类型为字符串 (string)
    • category: 类别,类型为字符串 (string)
    • url: URL链接,类型为字符串 (string)
    • title: 标题,类型为字符串 (string)
    • embeddings: 嵌入向量,类型为浮点数序列 (sequence: float64)
  • 分割:
    • train: 训练集,包含 518,092 个样本,总大小为 4,949,572,549 字节
  • 下载大小: 3,787,534,362 字节
  • 数据集大小: 4,949,572,549 字节
  • 配置:
    • default: 默认配置,数据文件路径为 data/train-*

数据集描述

  • 名称: STEMWikiSmallRAG
  • 标签:
    • RAG
    • Retrieval Augmented Generation
    • Small Chunks
    • Wikipedia
    • Science
    • Scientific
    • Scientific Wikipedia
    • Science Wikipedia
    • 512 tokens
    • STEM
  • 任务类别:
    • 文本生成
    • 文本分类
    • 问答系统

数据集内容

  • 包含来自STEM领域的维基百科条目,部分条目可能包含商业和经济内容。
  • 数据集经过处理,适用于小上下文长度的RAG系统。
  • 每个块的长度取决于分词器,但每个块大约为512个token。
  • 较长的维基百科页面已被分割为较小的条目,标题作为前缀添加。
  • 使用 mixedbread-ai/mxbai-embed-large-v1 进行嵌入,截断至512个token。

其他相关数据集

  • Laz4rz/wikipedia_science_chunked_small_rag_512: 包含512个token的STEM维基百科数据集。
  • Laz4rz/wikipedia_science_chunked_small_rag_256: 包含256个token的STEM维基百科数据集。

数据处理方法

  • 使用 millawell/wikipedia_field_of_science 数据集进行处理。
  • 可以使用以下函数进行块分割: python def chunker_clean(results, example, length=512, approx_token=3, prefix=""): if len(results) == 0: regex_pattern = r[ s]* [ s]* example = re.sub(regex_pattern, " ", example).strip().replace(prefix, "") chunk_length = length * approx_token if len(example) > chunk_length: first = example[:chunk_length] chunk = ".".join(first.split(".")[:-1]) if len(chunk) == 0: chunk = first rest = example[len(chunk)+1:] results.append(prefix+chunk.strip()) if len(rest) > chunk_length: chunker_clean(results, rest.strip(), length=length, approx_token=approx_token, prefix=prefix) else: results.append(prefix+rest.strip()) else: results.append(prefix+example.strip()) return results
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作