Chinese Simile (CS) dataset

Name: Chinese Simile (CS) dataset
Creator: 小米AI实验室
Published: 2020-12-15 14:39:54
License: 暂无描述

arXiv2020-12-15 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2012.08117v1

下载链接

链接失效反馈

官方服务：

资源简介：

本研究介绍了名为‘Chinese Simile (CS) dataset’的大型数据集，由小米AI实验室创建，包含约550万条带有上下文的比喻表达。该数据集从在线免费访问的小说中提取，涵盖科幻、都市小说、爱情故事等多种类型，旨在为研究比喻语言处理提供丰富的资源。数据集的创建过程涉及从长段落中分割句子，并重新组合成包含比喻的样本，确保每个样本包含多个连续句子或整个段落。此数据集特别适用于研究比喻的插入和生成，尤其是在文本润色任务中，以增强文本的表现力和吸引力。

This study introduces a large-scale dataset named 'Chinese Simile (CS) Dataset', developed by Xiaomi AI Lab, which contains approximately 5.5 million context-aware simile expressions. This dataset is extracted from freely accessible online novels, covering multiple genres such as science fiction, urban fiction, romance stories, and more. It aims to provide a rich resource for research on figurative language processing. The dataset construction process involves segmenting sentences from long paragraphs and recombining them into simile-containing samples, ensuring that each sample includes multiple consecutive sentences or an entire paragraph. This dataset is particularly applicable to research on simile insertion and generation, especially in text polishing tasks to enhance the expressiveness and attractiveness of texts.

提供机构：

小米AI实验室

创建时间：

2020-12-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集