five

Yusser/CC-LARD-topics

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Yusser/CC-LARD-topics
下载链接
链接失效反馈
官方服务:
资源简介:
CC-LARD(Common Crawl语言和区域数据集)是一个多语言数据集,源自Common Crawl,专注于带有文化主题分配的语言和区域。该数据集包含1,239,169个文档,覆盖53个经过主题建模的地区,每个文档都带有地区标签、语言、FASTopic主题和CTO叶分配。数据集支持多种语言,包括阿拉伯语、德语、英语、西班牙语、法语、印地语、日语、韩语、葡萄牙语、俄语、土耳其语和中文等。数据集适用于文本分类、文本生成等任务,特别适合用于文化评估、预训练数据分析以及跨地区公平性分析。

CC-LARD (Common Crawl Language and Region Dataset) is a multilingual dataset derived from Common Crawl, focusing on language and region with cultural topic assignments. It includes 1,239,169 documents from 53 topic-modelled locales, each annotated with locale labels, language, FASTopic topics, and CTO leaf assignments. The dataset supports multiple languages, including Arabic, German, English, Spanish, French, Hindi, Japanese, Korean, Portuguese, Russian, Turkish, and Chinese, among others. It is suitable for tasks such as text classification and text generation, particularly for cultural evaluation, pretraining data analysis, and cross-locale fairness analysis.
提供机构:
Yusser
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作