Chatbot Conversations Corpus
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/RaquelFerrando/conversational_tokenizers.git
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一组用于评估对话环境中分词器性能的聊天机器人对话语料库。它是一个多语言数据集,其中语言的分布影响着分词的减少程度。分析显示,在该数据集中表现良好的语言,能更多地从针对对话优化的分词器中受益。任务是对聊天机器人对话中的分词器性能进行评估。
This dataset is a chatbot dialogue corpus developed for evaluating tokenizer performance in conversational environments. As a multilingual dataset, the distribution of languages within it influences the degree of token reduction. Analysis reveals that languages achieving strong performance on this dataset can derive greater benefits from dialogue-optimized tokenizers. The core task of this corpus is to evaluate the performance of tokenizers in chatbot dialogue scenarios.
提供机构:
Raquel Ferrando



