LongMIT-128K

Hugging Face2024-10-09 更新2024-12-12 收录

下载链接：

https://huggingface.co/datasets/donmaclean/LongMIT-128K

下载链接

链接失效反馈

官方服务：

资源简介：

LongMIT数据集是一个用于问答和文本生成任务的多跳指令数据集，支持中英文，包含10K到100K条数据。该数据集通过复杂的指令格式和推理过程生成高质量的问答对，构建过程包括文本嵌入、文档图构建、文档图遍历和多代理驱动的数据合成。

The LongMIT Dataset is a multi-hop instruction dataset tailored for question answering and text generation tasks, supporting both Chinese and English, with 10K to 100K data samples. It generates high-quality question-answer pairs via complex instruction formats and reasoning procedures, and its construction process includes text embedding, document graph construction, document graph traversal, and multi-agent-driven data synthesis.

创建时间：

2024-09-27

原始信息汇总

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets

数据集概述

许可证: Apache 2.0
任务类别:
- 问答
- 文本生成
语言:
- 英语
- 中文
数据规模: 10K<n<100K
友好名称: Long Context

数据集下载

python def download_longmit_datasets(dataset_name: str, save_dir: str): qa_pairs = [] dataset = load_dataset(dataset_name, split=train, cache_dir=HFCACHEDATASETS, trust_remote_code=True) for d in dataset: all_docs = d[all_docs]

    if d[type] in [inter_doc, intra_doc]:
        if d[language] == en:
            content_key = Passage {pi}:

            instruction_format = Answer the question based on the given passages.

The following are given passages. {concat_content}

Answer the question based on the given passages and provide a complete reasoning process. Question:{q} Answer: else: content_key = 文章 {pi}：

            instruction_format = 根据给定的段落回答问题。

以下是给定的段落。 {concat_content}

请结合上面材料回答以下问题，并且给出完整的推理过程。问题：{q} 答案： else: if d[language] == en: content_key = Passage {pi}:

            instruction_format = Answer the question based on the given passages. Only give me the answer and do not output any other words.

The following are given passages. {concat_content}

Answer the question based on the given passages. Only give me the answer and do not output any other words. Question:{q} Answer: else: content_key = 文章 {pi}：

            instruction_format = 根据给定的段落回答问题。只给答案，不要输出任何其他单词。

以下是给定的段落。 {concat_content}

请结合上面材料回答以下问题。只给答案，不要输出任何其他单词。问题：{q} 答案：

    concat_content =

.join([content_key.format(pi=di+1)+doc[content] for di, doc in enumerate(all_docs)]) question = d[question] answer = d[answer]

    qa_pairs.append(json.dumps(
        {
            prompt: instruction_format.format(concat_content=concat_content, q=question),
            output: answer
        }
    )+

)

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

with open(os.path.join(save_dir, train.jsonl), w) as fw:
    fw.write(.join(qa_pairs))

自定义数据集构建

参考: GitHub Repo

1. 组织私有文本语料库与嵌入模型

步骤1: 嵌入源文本语料库

shell python doc_process/embed_doc.py --config doc_process/config/embedding/embedding_example.yaml --num_process_nodes 8

配置: yaml data: domain: wiki input_dir: assets/example_datasets doc_glob: "*_text_corpus.jsonl" embed_output_dir: your_local_path

步骤2: 使用近似knn构建文档图

shell python doc_process/build_doc_graph.py --command train_index --config doc_process/config/faiss/example_knn.yaml --xb example wait

python doc_process/build_doc_graph.py --command index_shard --config doc_process/config/faiss/example_knn.yaml --xb example wait

python doc_process/build_doc_graph.py --command search --config doc_process/config/faiss/example_knn.yaml --xb example wait

步骤3: 遍历文档图

shell python doc_process/traverse_doc_graph.py

2. 多智能体驱动的LongMIT数据合成

shell python agent/distribute_run_agents.py --config agent/configs/longqa_example.yaml

引用

bibtex @article{chen2024essential, title={What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices}, author={Chen, Zhi and Chen, Qiguang and Qin, Libo and Guo, Qipeng and Lv, Haijun and Zou, Yicheng and Che, Wanxiang and Yan, Hang and Chen, Kai and Lin, Dahua}, journal={arXiv preprint arXiv:2409.01893}, year={2024} }

搜集汇总

数据集介绍

构建方式

LongMIT-128K数据集的构建过程体现了多步骤的复杂性与系统性。首先，通过嵌入模型对私有文本语料库进行处理，生成嵌入文件。接着，利用近似最近邻算法构建文档图，并通过遍历文档图生成结构化的文本路径。最后，采用多智能体驱动的数据合成方法，生成多跳问答数据集。整个过程不仅依赖于高效的算法，还结合了多智能体的协作，确保了数据的高质量和多样性。

特点

LongMIT-128K数据集以其长上下文和多跳问答的特性脱颖而出。该数据集涵盖了中英双语，适用于问答和文本生成任务。其独特之处在于通过多跳推理机制，要求模型在多个文档之间进行信息整合，从而生成准确的答案。此外，数据集的规模适中，介于10K到100K之间，既保证了数据的丰富性，又避免了过大的计算负担。

使用方法

使用LongMIT-128K数据集时，用户可以通过HuggingFace平台直接下载数据集，并利用提供的Python脚本进行数据处理。数据集支持多种任务类型，用户可以根据需求选择是否包含推理过程。此外，用户还可以参考GitHub仓库中的详细指南，构建自定义的长上下文数据集，进一步扩展数据集的应用场景。

背景与挑战

背景概述

LongMIT-128K数据集由陈志等研究人员于2024年提出，旨在解决长上下文多跳指令数据集的构建问题。该数据集的核心研究问题在于如何有效整合多篇文档中的信息，以生成复杂的多跳问答对。通过结合嵌入模型和文档图遍历技术，LongMIT-128K为自然语言处理领域提供了高质量的长上下文问答数据，显著推动了问答系统和文本生成模型的发展。该数据集的研究成果已在arXiv上发布，并受到广泛关注。

当前挑战

LongMIT-128K数据集在构建过程中面临多重挑战。首先，长上下文问答任务要求模型能够理解并整合多篇文档中的信息，这对数据集的构建提出了极高的要求。其次，文档图遍历和多跳问答对的生成需要复杂的算法支持，以确保数据的逻辑连贯性和信息完整性。此外，多语言支持（如中英文）进一步增加了数据处理的复杂性。这些挑战不仅体现在数据集的构建过程中，也对后续模型的训练和评估提出了更高的标准。

常用场景

经典使用场景

LongMIT-128K数据集在自然语言处理领域中被广泛用于多跳问答任务和长文本生成任务。该数据集通过提供跨文档和文档内的多跳问答对，帮助模型学习如何在长文本中提取和整合信息，从而生成准确的答案。其典型应用场景包括学术研究中的问答系统开发、长文本理解模型的训练与评估，以及多语言环境下的信息检索与生成任务。

衍生相关工作

LongMIT-128K数据集的发布推动了多跳问答和长文本生成领域的研究进展。基于该数据集，研究者开发了一系列经典模型，如基于图神经网络的跨文档问答模型、多语言长文本生成模型等。这些模型不仅在学术研究中取得了显著成果，还被广泛应用于实际场景中，进一步提升了长文本处理技术的实用性和效率。此外，该数据集还激发了多模态信息整合、跨语言迁移学习等新兴研究方向。

数据集最近研究