LumberChunker/GutenQA
收藏GutenQA 数据集概述
数据集基本信息
- 许可协议: MIT
- 任务类别: 问答 (question-answering)
- 语言: 英语 (en)
数据集配置
- 配置名称: gutenqa
- 数据文件:
- 分割: gutenqa_chunks
- 路径: gutenqa_chunks.parquet
- 数据文件:
- 配置名称: questions
- 数据文件:
- 分割: gutenqa_questions
- 路径: questions.parquet
- 数据文件:
数据集内容
GutenQA 数据集包含从 Project Gutenberg 手动提取的 100 本公共领域叙事书籍的段落,并使用 LumberChunker 进行分割。每本书包含 30 个问答对。
数据集列信息
- Book Name: 书籍标题
- Book ID: 书籍的唯一整数标识符
- Chunk ID: 书籍块的整数标识符,按其在书中的顺序列出
- Chapter: 块所属的章节名称,如果 LumberChunker 合并了来自多个章节的段落,则包含所有相关章节名称
- Chunk: 每个行包含一个书籍段落,即 LumberChunker 分组的语义相似段落
- Question: 与特定文本块相关的问题,并非每个块都有相关问题
- Answer: 与问题对应的答案
- Chunk Must Contain: 块中必须包含的特定子字符串,确保正确块包含此字符串
兼容性
GutenQA 旨在测试检索能力,因此兼容以下检索/嵌入模型:
- DPR
- Sentence Transformers
- Contriever
- OpenAI Embeddings
数据集加载与评估示例
以下是一个使用 Python 加载数据集并评估检索性能的示例代码:
python import pandas as pd import torch import numpy as np from transformers import AutoTokenizer, AutoModel
加载数据集
dataset = pd.read_parquet("hf://datasets/LumberChunker/GutenQA/GutenQA.parquet", engine="pyarrow")
过滤特定书籍的块
single_book_chunks = dataset[dataset[Book Name] == A_Christmas_Carol_-_Charles_Dickens].reset_index(drop=True)
过滤特定书籍的问答行
single_book_qa = single_book_chunks.dropna(subset=[Question, Answer, Chunk Must Contain]).reset_index(drop=True)
加载检索模型
tokenizer = AutoTokenizer.from_pretrained(facebook/contriever) model = AutoModel.from_pretrained(facebook/contriever)
计算嵌入
def mean_pooling(token_embeddings, mask): token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.) sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None] return sentence_embeddings
inputs_chunks = tokenizer(single_book_chunks["Chunk"].tolist(), padding=True, truncation=True, return_tensors=pt) inputs_questions = tokenizer(single_book_qa["Question"].tolist(), padding=True, truncation=True, return_tensors=pt)
outputs_chunks = model(**inputs_chunks) outputs_questions = model(**inputs_questions)
embeddings_chunks = mean_pooling(outputs_chunks[0], inputs_chunks[attention_mask]).detach().cpu().numpy() embeddings_questions = mean_pooling(outputs_questions[0], inputs_questions[attention_mask]).detach().cpu().numpy()
计算相关性
def find_index_of_match(answers, gold_label): relevance = [] gold_label = gold_label.lower() for _, item in enumerate(answers): if gold_label in item.lower(): relevance.append(1) relevance = relevance + ((len(answers) - len(relevance))* ([0])) break else: relevance.append(0) return relevance
def compute_DCG(rel): aux = 0 for i in range(1, len(rel)+1): aux = aux + (np.power(2,rel[i-1])-1) / (np.log2(i+1)) return(aux)
def get_top_k(top_k, query_individual_embedding_numpy): similarity = np.dot(embeddings_chunks, np.transpose(query_individual_embedding_numpy)) top_indices = np.argsort(similarity, axis=0)[-top_k:] top_indices = top_indices[::-1]
answers = []
for i in range(len(top_indices)):
answers.append(single_book_chunks.at[top_indices[i], Chunk])
return answers
计算 DCG@k
DCG_k_sweep = [] for j in [1, 2, 5, 10, 20]: DCG_list = []
for k in range(len(single_book_qa)):
query_embedding = embeddings_questions[k]
answers = get_top_k( top_k = j, query_individual_embedding_numpy= embeddings_questions[k])
gold_label = single_book_qa.loc[k, "Chunk Must Contain"]
rel = find_index_of_match(answers=answers, gold_label=gold_label)
DCG_list.append(compute_DCG(rel))
DCG_k_sweep.append(np.mean(DCG_list))
打印 DCG_k_sweep 列表
print(DCG_k_sweep)




