instruction-pretrain/medicine-instruction-augmented-corpora

Name: instruction-pretrain/medicine-instruction-augmented-corpora
Creator: instruction-pretrain
Published: 2024-07-15 08:41:01
License: 暂无描述

Hugging Face2024-07-15 更新2024-06-25 收录

下载链接：

https://hf-mirror.com/datasets/instruction-pretrain/medicine-instruction-augmented-corpora

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含生物医学领域的指令增强语料库，用于预训练语言模型。数据集通过指令合成器将PubMed子集转换为指令-响应对，并与通用指令语料库混合使用。数据集适用于文本分类、表格问答、问答和零样本分类任务。

This dataset contains instruction-augmented corpora in the biomedicine domain, used for pre-training language models. The dataset converts the PubMed subset into instruction-response pairs using an instruction synthesizer and mixes it with general instruction corpora. The dataset is suitable for text classification, table question answering, question answering, and zero-shot classification tasks.

提供机构：

instruction-pretrain

原始信息汇总

数据集概述

基本信息

许可证: other
任务类别:
- 文本分类
- 表格问答
- 问答
- 零样本分类
语言: 英语
标签:
- 化学
- 生物学

数据集描述

数据集名称: Instruction-Augmented Corpora in the Biomedicine Domain
数据集来源: 使用instruction-synthesizer将PubMed子集转换为指令增强语料库。
数据集用途: 用于指令预训练，验证指令预训练框架的有效性。

数据集内容

指令合成器: instruction-synthesizer
预训练数据:
- 通用模型预训练:
  - InstructLM-500M
  - InstructLM-1.3B
- 领域特定模型预训练:
  - Finance-Llama3-8B
  - Biomedicine-Llama3-8B
指令增强语料库:
- 通用指令增强语料库: general-instruction-augmented-corpora
- 领域特定指令增强语料库: medicine-instruction-augmented-corpora

数据集使用方法

设置依赖: bash git clone https://github.com/microsoft/LMOps.git cd LMOps/instruction_pretrain pip install tqdm pip install huggingface_hub
加载和模板化数据: python from huggingface_hub import snapshot_download from utils.read_compre import cook_pt_entries import glob from tqdm import tqdm import json

local_dir = "/tmp/hf_files/" # 本地目录

加载数据条目

snapshot_download(repo_id="instruction-pretrain/medicine-instruction-augmented-corpora", allow_patterns=["*00000.jsonl"], local_dir=local_dir, repo_type="dataset" )

data_paths=sorted(glob.glob(f{local_dir}/part-/shard/))

all_entries = [] for path in tqdm(data_paths): with open(path, r, encoding=utf8) as f: jsonls = f.read().strip().split( ) for jsonl in jsonls: all_entries.append(json.loads(jsonl))

模板化数据以进行后续预训练

instruction_augmented_texts = [] for idx, entry in enumerate(all_entries): texts = cook_pt_entries(read_collection=entry, random_seed=idx) instruction_augmented_texts.extend(texts)

输出结果

for idx, text in enumerate(instruction_augmented_texts[-2:]): print(f## Instruction-augmented Text {idx+1} {text} )

许可证信息

PubMed Central: MIT License

5,000+

优质数据集

54 个

任务类型

进入经典数据集