five

Muennighoff/babi

收藏
Hugging Face2023-02-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Muennighoff/babi
下载链接
链接失效反馈
官方服务:
资源简介:
Creation (Copied & adapted from https://github.com/stanford-crfm/helm/blob/0eaaa62a2263ddb94e9850ee629423b010f57e4a/src/helm/benchmark/scenarios/babi_qa_scenario.py): ```python !wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz !tar -xf tasks_1-20_v1-2.tar.gz import json from typing import List tasks = list(range(1, 20)) splits = ["train", "valid", "test"] def process_path(path: str) -> str: """Turn a path string (task 19) from the original format 's,w' to a verbal model-friendly format 'south west'""" steps: List[str] = path.split(",") directions = {"s": "south", "n": "north", "e": "east", "w": "west"} path = " ".join([directions[step] for step in steps]) return path for split in splits: with open(f"babi_{split}.jsonl", "w") as f_base: for task in tasks: split_path: str = f"./tasks_1-20_v1-2/en-valid/qa{task}_{split}.txt" with open(split_path, "r") as f: facts = list(f) story: List[str] = [] for fact in facts: fid = int(fact.split(" ")[0]) if fid == 1: story = [] fact = " ".join(fact.split(" ")[1:]) is_question = "?" in fact if is_question: question, answer = fact.split("\t")[:2] question, answer = question.strip(), answer.strip() # All tasks except task 19 have a verbal single-word answer (e.g. kitchen, apple, yes). # Task 19 (path finding) has a non verbal answer format ( if task == 19: answer = process_path(answer) f_base.write(json.dumps({ "passage": "".join(story), "question": question, "answer": answer, "task": task, }) + "\n") if "?" in story: print("STORY", "".join(story)) else: story.append(fact) ```
提供机构:
Muennighoff
原始信息汇总

数据集概述

数据集来源

  • 数据集来源于文件 tasks_1-20_v1-2.tar.gz,通过 wget 命令从指定URL下载并解压。

数据集结构

  • 数据集包含20个任务,编号从1到20。
  • 每个任务分为三个部分:训练集(train)、验证集(valid)和测试集(test)。

数据处理

  • 数据处理包括将原始文本文件转换为JSON格式,并存储在 babi_{split}.jsonl 文件中。
  • 对于任务19,其答案格式为方向序列(如s,w),通过 process_path 函数转换为文字描述(如south west)。

数据内容

  • 每个JSON记录包含以下字段:
    • passage: 故事文本
    • question: 问题
    • answer: 答案
    • task: 任务编号
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作