Twitter100k
收藏arXiv2017-03-20 更新2024-06-21 收录
下载链接:
http://ngn.ee.tsinghua.edu.cn/members/yuting-hu/
下载链接
链接失效反馈官方服务:
资源简介:
Twitter100k是一个大规模的跨媒体检索数据集,由清华大学信息科学与技术国家实验室收集。该数据集包含100,000个从Twitter随机爬取的图像-文本对,覆盖了广泛的领域,如体育、建筑、食物等。数据集的特点是文本采用非正式语言,反映了真实网络环境中的表达方式。Twitter100k数据集旨在解决现有数据集在内容多样性和语言正式性方面的局限,提供一个更真实的跨媒体分析基准。此外,约1/4的图像包含与配对推文高度相关的文本,这为基于OCR的检索方法提供了新的应用场景。
Twitter100k is a large-scale cross-modal retrieval dataset collected by the State Key Laboratory of Information Science and Technology at Tsinghua University. This dataset contains 100,000 image-text pairs randomly crawled from Twitter, covering a wide range of domains such as sports, architecture, food and more. The dataset is characterized by its use of informal language in the text descriptions, which reflects the actual expression patterns in real-world online contexts. The Twitter100k dataset aims to address the limitations of existing datasets in terms of content diversity and linguistic formality, providing a more realistic benchmark for cross-modal analysis. Additionally, approximately one-quarter of the images contain text highly relevant to their paired tweets, which offers new application scenarios for OCR-based retrieval methods.
提供机构:
清华大学信息科学与技术国家实验室
创建时间:
2017-03-20
搜集汇总
背景与挑战
背景概述
Twitter100k是一个大规模的跨媒体检索数据集,包含10万个图像-文本对,覆盖多个领域,文本采用非正式语言,旨在提供更真实的跨媒体分析基准。
以上内容由遇见数据集搜集并总结生成



