five

MULTIEURLEX-DOC, WIKI-DOC

收藏
arXiv2023-10-25 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/AmazonScience/MultilingualMultiModalClassification
下载链接
链接失效反馈
官方服务:
资源简介:
MULTIEURLEX-DOC和WIKI-DOC是AWS AI Labs精心策划的两个多语言多标签文档图像分类数据集。MULTIEURLEX-DOC包含23种不同语言的欧盟法律文档,每个文档可能被分配到一个或多个标签,强调文档内容的深度理解。WIKI-DOC则包含非欧洲语言的文档,如中文、日文和阿拉伯文,这些文档不仅包含文本,还有丰富的非文本内容,如表格和图像。这两个数据集的创建旨在推动文档AI模型在多语言环境下的性能评估,特别是在零样本跨语言迁移设置中的应用,以解决现有模型在跨语言泛化能力上的局限性。

MULTIEURLEX-DOC and WIKI-DOC are two multilingual multi-label document image classification datasets carefully curated by AWS AI Labs. MULTIEURLEX-DOC consists of EU legal documents in 23 distinct languages, where each document may be assigned to one or more labels, emphasizing the in-depth understanding of document content. WIKI-DOC, by contrast, contains documents in non-European languages such as Chinese, Japanese and Arabic, which not only include text but also rich non-textual content such as tables and images. These two datasets were developed to advance the performance evaluation of document AI models in multilingual scenarios, particularly for applications under zero-shot cross-lingual transfer settings, so as to address the limitations of existing models in cross-lingual generalization capability.
提供机构:
AWS AI Labs
创建时间:
2023-10-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作