GeniL
收藏arXiv2024-04-09 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2404.05866v1
下载链接
链接失效反馈官方服务:
资源简介:
GeniL是由谷歌研究院创建的多语言数据集,包含超过50,000个来自9种语言的句子,旨在检测语言中的泛化现象。该数据集通过从多语言公共爬虫(mC4)语言语料库中收集自然发生的句子,并由每种语言的母语者进行标注,以区分提及泛化和推广泛化的句子。GeniL数据集的应用领域主要集中在自然语言处理中,特别是用于评估和改进生成语言模型中的社会偏见问题,从而推动更包容和负责任的语言技术发展。
GeniL is a multilingual dataset created by Google Research. It comprises over 50,000 sentences across 9 languages, designed to detect generalization phenomena in language. This dataset collects naturally occurring sentences from the multilingual public crawl (mC4) language corpus, and is annotated by native speakers of each language to distinguish between sentences conveying generic statements and inferential generalizations. The GeniL dataset is primarily applied in natural language processing (NLP), specifically for evaluating and mitigating social biases in generative language models, thereby advancing the development of more inclusive and responsible language technologies.
提供机构:
谷歌研究院
创建时间:
2024-04-09



