five

ARC_finetuning

收藏
魔搭社区2025-12-04 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Kyutai/ARC_finetuning
下载链接
链接失效反馈
官方服务:
资源简介:
# ARC-Encoder finetuning dataset This dataset gathers the sub-datasets of supervised and synthetized samples necessary to fine-tune on context compression tasks an ARC-Encoder as described in the paper *ARC-Encoder: learning compressed text representations for large language models* available [here](https://arxiv.org/abs/2510.20535). ## Dataset Details ### Dataset Description It consists in 12 jsonl files separated in 4 task categories: Translation, Question-Answering, Reading Comprehension and Summarization. To fine-tune your ARC-Encoder from the HF collection [ARC-Encoders](https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047) follow the recipe described in the paper and use the following codebase [ARC-Encoder](https://github.com/kyutai-labs/ARC-Encoder/tree/main). Proportion for sampling among these datasets are described in the Appendix. ### Dataset Sources We gathered already existing datasets which sources are listed below: - [AdversarialQA](https://adversarialqa.github.io), CC BY-SA 3.0 - [FreebaseQA](https://aclanthology.org/N19-1028/), - [ASQA](https://arxiv.org/abs/2204.06092), Apache 2.0 - [MS MARCO](https://arxiv.org/abs/1611.09268) - [SciQ](https://arxiv.org/abs/1707.06209), CC BY-NC 3.0 - [DROP](https://arxiv.org/abs/1903.00161), CC BY-SA 4.0 - [ParaSCI](https://github.com/dqxiu/ParaSCI) - [DialogSum](https://arxiv.org/abs/2105.06762), CC BY-NC-SA 4.0 - [SamSum](https://arxiv.org/abs/1911.12237), CC BY-NC-ND 4.0 - [WikiSum](https://aclanthology.org/2021.acl-short.28/), CC NC-SA-3.0 For the first 5 datasets (QA samples), we retrieved 5 passages of [KILT](https://huggingface.co/datasets/facebook/kilt_wikipedia) (MIT license) Wikipedia passage chunks using [NVEmbed v.2](https://arxiv.org/abs/2405.17428), CC BY-NC 4.0. For the translations, we used passages from [ATLAS](https://github.com/facebookresearch/atlas), CC-BY-SA, and translate them using [Gemma 3 27B](https://huggingface.co/google/gemma-3-27b-it), Gemma licence, in: - Spanish, French, German and Danish - Hindi, Russian, Swahili, Arabic, Turkish, Japanese, Finnish and Chinese (simplified) ### Uses Sub-datasets are kept separated as at training time we want to be able to gather in-context example from each dataset independantly to design the final fine-tuning samples. ### Licensing ARC-Encoder fine-tuning is licensed under the CC-BY 4.0 license. ## Citations If you use this dataset, please cite: ```bibtex @misc{pilchen2025arcencoderlearningcompressedtext, title={ARC-Encoder: learning compressed text representations for large language models}, author={Hippolyte Pilchen and Edouard Grave and Patrick Pérez}, year={2025}, eprint={2510.20535}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.20535}, } ```

# ARC-Encoder 微调数据集(ARC-Encoder) 本数据集汇集了监督与合成样本子数据集,用于针对上下文压缩任务微调ARC-Encoder,相关细节可参考论文《ARC-Encoder:面向大语言模型(Large Language Model)的压缩文本表示学习》,论文链接为[此处](https://arxiv.org/abs/2510.20535)。 ## 数据集详情 ### 数据集描述 本数据集包含12个jsonl文件,分为4个任务类别:机器翻译、问答(Question-Answering)、阅读理解(Reading Comprehension)与摘要生成(Summarization)。若需基于Hugging Face(HF)平台的[ARC-Encoders](https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047)模型集合微调您的ARC-Encoder,请遵循论文中描述的微调流程,并使用下述代码库[ARC-Encoder](https://github.com/kyutai-labs/ARC-Encoder/tree/main)。各数据集的采样比例详见论文附录。 ### 数据集来源 本数据集整合了若干现有公开数据集,其来源如下: - [AdversarialQA](https://adversarialqa.github.io),采用CC BY-SA 3.0协议 - [FreebaseQA](https://aclanthology.org/N19-1028/) - [ASQA](https://arxiv.org/abs/2204.06092),采用Apache 2.0协议 - [MS MARCO](https://arxiv.org/abs/1611.09268) - [SciQ](https://arxiv.org/abs/1707.06209),采用CC BY-NC 3.0协议 - [DROP](https://arxiv.org/abs/1903.00161),采用CC BY-SA 4.0协议 - [ParaSCI](https://github.com/dqxiu/ParaSCI) - [DialogSum](https://arxiv.org/abs/2105.06762),采用CC BY-NC-SA 4.0协议 - [SamSum](https://arxiv.org/abs/1911.12237),采用CC BY-NC-ND 4.0协议 - [WikiSum](https://aclanthology.org/2021.acl-short.28/),采用CC NC-SA-3.0协议 针对前5个数据集(问答样本),我们使用[NVEmbed v.2](https://arxiv.org/abs/2405.17428)(采用CC BY-NC 4.0协议)从[KILT](https://huggingface.co/datasets/facebook/kilt_wikipedia)的维基百科段落分块中检索得到5段文本。 对于机器翻译任务的样本,我们从[ATLAS](https://github.com/facebookresearch/atlas)(采用CC-BY-SA协议)中获取源文本,并使用[Gemma 3 27B](https://huggingface.co/google/gemma-3-27b-it)(遵循Gemma许可证)将其翻译为以下语言: - 西班牙语、法语、德语与丹麦语 - 印地语、俄语、斯瓦希里语、阿拉伯语、土耳其语、日语、芬兰语与简体中文 ### 数据集用途 各子数据集保持独立拆分,原因在于训练阶段我们需要能够从每个子数据集独立抽取上下文示例,以构建最终的微调样本。 ### 许可证协议 本ARC-Encoder微调数据集采用CC-BY 4.0许可证协议。 ## 引用信息 若您使用本数据集,请引用如下文献: bibtex @misc{pilchen2025arcencoderlearningcompressedtext, title={ARC-Encoder: learning compressed text representations for large language models}, author={Hippolyte Pilchen and Edouard Grave and Patrick Pérez}, year={2025}, eprint={2510.20535}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.20535}, }
提供机构:
maas
创建时间:
2025-10-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作