HSDSLab/TwitterMemes
收藏Hugging Face2024-07-10 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HSDSLab/TwitterMemes
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 100K<n<1M
pretty_name: twitter_memes
dataset_info:
features:
- name: image
dtype: image
- name: id
dtype: string
- name: user_id
dtype: string
- name: date
dtype: string
- name: likes
dtype: int64
- name: shares
dtype: int64
- name: comments
dtype: int64
- name: post_text
dtype: string
- name: post_link
dtype: string
- name: img_link
dtype: string
- name: ocr
dtype: string
splits:
- name: train
num_bytes: 8879698359.938
num_examples: 174338
download_size: 11301489086
dataset_size: 8879698359.938
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for Twitter Image Dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
## Dataset Description
### Dataset Summary
This dataset contains images scraped from Twitter along with associated metadata. The dataset is intended for research purposes, focusing on image analysis, natural language processing, and social media dynamics studies.
### Supported Tasks and Leaderboards
The dataset can be used for tasks such as image recognition, sentiment analysis, text extraction from images (OCR), and social media trend analysis.
### Languages
The text in the dataset is primarily in English, as extracted from Twitter posts.
## Dataset Structure
### Data Instances
A data instance consists of an image and its metadata. For example:
```json
{
"id": "12345",
"user_id": "67890",
"date": "2023-07-04",
"likes": 150,
"shares": 25,
"comments": 40,
"post_text": "Here's a great moment captured #fun",
"post_link": "https://twitter.com/example/status/12345",
"img_link": "https://example.com/img.jpg",
"ocr": "Here's a great moment captured",
"file_name": "12345.jpg"
}
```
### Data Fields
- `id`: Unique identifier for each post.
- `user_id`: Twitter user ID of the post author.
- `date`: Date the post was made.
- `likes`: Number of likes the post received.
- `shares`: Number of shares the post received.
- `comments`: Number of comments on the post.
- `post_text`: Text content of the post.
- `post_link`: URL to the original Twitter post.
- `img_link`: URL to the image.
- `ocr`: Text extracted from the image using OCR.
- `file_name`: Name of the file stored locally.
### Data Splits
This dataset does not have predefined splits (train/test/validation). Users can create splits as needed for their specific tasks.
language:
- 英语
size_categories:
- 100000 < 样本数 < 1000000
pretty_name: twitter_memes(Twitter表情包)
dataset_info:
features:
- name: 图像(image)
dtype: 图像
- name: 唯一标识符(id)
dtype: 字符串
- name: 用户ID(user_id)
dtype: 字符串
- name: 发布日期(date)
dtype: 字符串
- name: 点赞数(likes)
dtype: 64位整数
- name: 转发数(shares)
dtype: 64位整数
- name: 评论数(comments)
dtype: 64位整数
- name: 帖子文本(post_text)
dtype: 字符串
- name: 帖子链接(post_link)
dtype: 字符串
- name: 图片链接(img_link)
dtype: 字符串
- name: 光学字符识别结果(ocr,Optical Character Recognition)
dtype: 字符串
splits:
- name: 训练集(train)
num_bytes: 8879698359.938
num_examples: 174338
download_size: 11301489086 字节
dataset_size: 8879698359.938 字节
configs:
- config_name: 默认配置
data_files:
- split: 训练集(train)
path: data/train-*
# Twitter图像数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与基准排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据拆分](#data-splits)
## 数据集描述
### 数据集摘要
本数据集包含从Twitter爬取的图像及其关联元数据,专为图像分析、自然语言处理及社交媒体动态研究等科研场景打造。
### 支持任务与基准排行榜
本数据集可应用于图像识别、情感分析、图像文本提取(光学字符识别,Optical Character Recognition, OCR)以及社交媒体趋势分析等任务。
### 语言
数据集中的文本均提取自Twitter帖子,主要为英语。
## 数据集结构
### 数据实例
单条数据实例由一幅图像及其元数据组成,示例如下:
json
{
"唯一标识符": "12345",
"用户ID": "67890",
"发布日期": "2023-07-04",
"点赞数": 150,
"转发数": 25,
"评论数": 40,
"帖子文本": "Here's a great moment captured #fun",
"帖子链接": "https://twitter.com/example/status/12345",
"图片链接": "https://example.com/img.jpg",
"光学字符识别结果": "Here's a great moment captured",
"本地存储文件名": "12345.jpg"
}
### 数据字段
- `id`:每条帖子的唯一标识符
- `user_id`:帖子作者的Twitter用户ID
- `date`:帖子发布日期
- `likes`:帖子获得的点赞数
- `shares`:帖子获得的转发数
- `comments`:帖子的评论数
- `post_text`:帖子的文本内容
- `post_link`:指向原始Twitter帖子的URL
- `img_link`:指向图像的URL
- `ocr`:通过光学字符识别(Optical Character Recognition, OCR)从图像中提取的文本
- `file_name`:本地存储的文件名
### 数据拆分
本数据集未预设训练/测试/验证拆分,用户可根据自身具体任务需求自行划分数据集拆分。
提供机构:
HSDSLab



