five

jaagli/common-words-79k

收藏
Hugging Face2024-07-12 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/jaagli/common-words-79k
下载链接
链接失效反馈
官方服务:
资源简介:
“Common Words 79K” (common-words-79k) 数据集包含79,059个单词和短语,以及包含这些词汇的维基百科句子。该数据集来源于以下资源:从ImageNet-21K中选择的类(每个类包含超过100张可用图像,并且类名在维基百科中至少出现五次)、从英语词表中筛选的词汇,以及从英文维基百科收集的词频数据。

The Common Words 79K dataset contains 79,059 words and phrases, along with sentences from Wikipedia that include these words and phrases. It is derived from selecting classes from ImageNet-21K based on specific criteria and including words that meet certain conditions from an English wordlist. Additionally, word frequency data from English Wikipedia is collected for all the words and phrases.
提供机构:
jaagli
原始信息汇总

数据集概述

数据集名称

  • 名称: Common Words 79K (common-words-79k)

数据集描述

  • 内容: 包含79,059个单词和短语,以及来自维基百科的包含这些单词和短语的句子。
  • 来源:
    • 从ImageNet-21K中选择符合以下条件的类别:(1) 每个类别包含超过100张可用图片,(2) 类别名称在维基百科中至少出现五次。
    • English wordlist中包含符合第二个条件的单词。
    • 从英语维基百科收集所有上述单词和短语的词频数据。

数据实例

  • 示例: json { alias: newborn_infant, frequency: 157, sentences: [ It is also recited as a prayer for protection of a newborn infant., The newborn infant was named Sawai Madhavrao., Jocasta handed the newborn infant over to Laius., "Spider-Man manages to save them and rescue Lilys newborn infant from the supervillains.", After her newborn infant died, Alison Langdon mutilated herself while deeply depressed., ..., The newborn infant was named Sawai Madhavrao ("Sawai" means "One and a Quarter"). ] }

数据集结构

  • 特征:
    • alias: 字符串类型
    • frequency: 整数类型
    • sentences: 字符串序列
  • 分割:
    • whole: 包含79,059个样本,总大小为83,865,723字节

数据集大小

  • 下载大小: 54,972,667字节
  • 数据集大小: 83,865,723字节

任务类别

  • 任务: 特征提取

语言

  • 语言: 英语

引用

  • 引用:

    @misc{li2024visionlanguagemodelsshare, title={Do Vision and Language Models Share Concepts? A Vector Space Alignment Study}, author={Jiaang Li and Yova Kementchedjhieva and Constanza Fierro and Anders Søgaard}, year={2024}, eprint={2302.06555}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2302.06555}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作