LongMIT-128K
收藏LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets
数据集概述
- 许可证: Apache 2.0
- 任务类别:
- 问答
- 文本生成
- 语言:
- 英语
- 中文
- 数据规模: 10K<n<100K
- 友好名称: Long Context
数据集下载
python def download_longmit_datasets(dataset_name: str, save_dir: str): qa_pairs = [] dataset = load_dataset(dataset_name, split=train, cache_dir=HFCACHEDATASETS, trust_remote_code=True) for d in dataset: all_docs = d[all_docs]
if d[type] in [inter_doc, intra_doc]:
if d[language] == en:
content_key = Passage {pi}:
instruction_format = Answer the question based on the given passages.
The following are given passages. {concat_content}
Answer the question based on the given passages and provide a complete reasoning process. Question:{q} Answer: else: content_key = 文章 {pi}:
instruction_format = 根据给定的段落回答问题。
以下是给定的段落。 {concat_content}
请结合上面材料回答以下问题,并且给出完整的推理过程。 问题:{q} 答案: else: if d[language] == en: content_key = Passage {pi}:
instruction_format = Answer the question based on the given passages. Only give me the answer and do not output any other words.
The following are given passages. {concat_content}
Answer the question based on the given passages. Only give me the answer and do not output any other words. Question:{q} Answer: else: content_key = 文章 {pi}:
instruction_format = 根据给定的段落回答问题。只给答案,不要输出任何其他单词。
以下是给定的段落。 {concat_content}
请结合上面材料回答以下问题。只给答案,不要输出任何其他单词。 问题:{q} 答案:
concat_content =
.join([content_key.format(pi=di+1)+doc[content] for di, doc in enumerate(all_docs)]) question = d[question] answer = d[answer]
qa_pairs.append(json.dumps(
{
prompt: instruction_format.format(concat_content=concat_content, q=question),
output: answer
}
)+
)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
with open(os.path.join(save_dir, train.jsonl), w) as fw:
fw.write(.join(qa_pairs))
自定义数据集构建
- 参考: GitHub Repo
1. 组织私有文本语料库与嵌入模型
步骤1: 嵌入源文本语料库
shell python doc_process/embed_doc.py --config doc_process/config/embedding/embedding_example.yaml --num_process_nodes 8
- 配置: yaml data: domain: wiki input_dir: assets/example_datasets doc_glob: "*_text_corpus.jsonl" embed_output_dir: your_local_path
步骤2: 使用近似knn构建文档图
shell python doc_process/build_doc_graph.py --command train_index --config doc_process/config/faiss/example_knn.yaml --xb example wait
python doc_process/build_doc_graph.py --command index_shard --config doc_process/config/faiss/example_knn.yaml --xb example wait
python doc_process/build_doc_graph.py --command search --config doc_process/config/faiss/example_knn.yaml --xb example wait
步骤3: 遍历文档图
shell python doc_process/traverse_doc_graph.py
2. 多智能体驱动的LongMIT数据合成
shell python agent/distribute_run_agents.py --config agent/configs/longqa_example.yaml
引用
bibtex @article{chen2024essential, title={What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices}, author={Chen, Zhi and Chen, Qiguang and Qin, Libo and Guo, Qipeng and Lv, Haijun and Zou, Yicheng and Che, Wanxiang and Yan, Hang and Chen, Kai and Lin, Dahua}, journal={arXiv preprint arXiv:2409.01893}, year={2024} }




