Character100
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/nuaa-nlp/Character100
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Character100,是为了构建具有特色的AI代理任务而构建的,其中包含了维基百科上浏览量最高的前100位个人的简介。该数据集包括用于训练大型语言模型的背景知识语料库,以及用于评估语言风格一致性的表述风格语料库。此外,该数据集主要包括两个主要子集:背景知识语料库和表述风格语料库,这对于训练和评估特色AI代理至关重要。数据集规模涉及106位个人,背景知识语料库共有10,605条条目,表述风格语料库包含17,119个句子。该数据集的任务是构建具有特色的AI代理。
This dataset, named Character100, was developed for constructing distinctive AI Agent tasks. It encompasses biographies of the top 100 most-viewed individuals on Wikipedia. The dataset includes two components: a background knowledge corpus for training large language models (LLMs), and an expression style corpus for evaluating linguistic style consistency. Moreover, the dataset primarily consists of two core subsets—the background knowledge corpus and the expression style corpus—which are essential for training and evaluating distinctive AI Agents. In terms of scale, the dataset covers 106 individuals, with 10,605 entries in the background knowledge corpus and 17,119 sentences in the expression style corpus. The primary objective of this dataset is to build distinctive AI Agents.



