five

NLP-POL/instagram-political-communication-it

收藏
Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NLP-POL/instagram-political-communication-it
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - name: instagram-political-communication-it version: 1.0.0 license: cc-by-4.0 task_categories: - text-classification - feature-extraction - sentence-similarity - text-retrieval - text-ranking - tabular-classification language: - it configs: - config_name: profiles data_files: - split: train path: "data/profiles/*.parquet" - config_name: posts data_files: - split: train path: "data/posts/*.parquet" - config_name: comments data_files: - split: train path: "data/comments/*.parquet" - config_name: post_sentiment_saliency data_files: - split: train path: "data/post_sentiment_saliency/*.parquet" - config_name: post_hate_speech_saliency data_files: - split: train path: "data/post_hate_speech_saliency/*.parquet" - config_name: post_keyphrases data_files: - split: train path: "data/post_keyphrases/*.parquet" - config_name: comment_sentiment_saliency data_files: - split: train path: "data/comment_sentiment_saliency/*.parquet" - config_name: comment_hate_speech_saliency data_files: - split: train path: "data/comment_hate_speech_saliency/*.parquet" - config_name: comment_keyphrases data_files: - split: train path: "data/comment_keyphrases/*.parquet" --- # Instagram Political Communication (Italy) — NLP-POL ## Dataset Summary This dataset is part of **NLP-POL (NLP for Political Communication)**, a research project focused on the analysis of political communication strategies through Natural Language Processing. The dataset contains **Instagram posts and comments** collected from **more than 300 Italian political figures**, primarily members of the Italian Parliament (with a strong focus on Deputies). It includes both **content published by political actors** and **public audience reactions** expressed through comments. The dataset is intended to support research in: - political discourse and framing - sentiment and emotional tone in political communication - public reactions to political messaging - hate speech and moderation-related analysis - semantic representations of political language The dataset is **actively maintained and periodically updated** as new Instagram content is scraped and processed. ## Dataset Structure The dataset is released as a **multi-table, relational dataset** with flat schemas and stable identifiers. A normalized design is used to ensure scalability, efficient joins, and reproducibility. ### Core Tables (this repository) | Table | Description | |------|------------| | `profiles` | Public political figures | | `posts` | Instagram posts published by political profiles | | `comments` | Public comments under posts | | `post_sentiment_saliency` | Salient sentiment terms extracted from posts | | `post_hate_speech_saliency` | Salient hate-speech-related terms extracted from posts | | `post_keyphrases` | Keyphrases extracted from posts | | `comment_sentiment_saliency` | Salient sentiment terms extracted from comments | | `comment_hate_speech_saliency` | Salient hate-speech-related terms extracted from comments | | `comment_keyphrases` | Keyphrases extracted from comments | > 🔗 **Companion embeddings dataset**: `instagram-political-communication-it-embeddings` — [Go to repository](https://huggingface.co/datasets/NLP-POL/instagram-political-communication-it-embeddings) ## Example Usage The following example shows how to load multiple tables from the core dataset. ```python import duckdb from datasets import load_dataset q_posts = con.execute(""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/posts/*.parquet') LIMIT 10 """) posts_df = q_posts.fetch_df() comments_q = con.execute(f""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/comments/*.parquet') WHERE post_info__id IN ({', '.join([f"'{_id}'" for _id in posts_df['_id'].tolist()])}) """) comments_df = comments_q.fetch_df() display(posts_df.head()) display(comments_df.head()) ``` ## Data Fields Overview ### Profiles (`profiles`) Each row represents a public political figure. Key fields: - `_id`: unique profile identifier - `nome`: full name - `instagram`: Instagram handle - `x`: X/Twitter handle (if available) - `partito`: political party affiliation - `descriptions`: list of public role descriptions - `url_name`: normalized URL-friendly name - `instagram_posts_count`: number of scraped posts - `dataset_version` ### Posts (`posts`) Each row represents one Instagram post. Key fields: - `_id`: post identifier - `uri`: public Instagram URL - `author`: Instagram username - `datetime`: UTC timestamp of publication - `caption`: post caption text - `topics`: high-level topic labels - sentiment scores (`sentiment_positive`, `neutral`, `negative`) - hate speech scores (`acceptable`, `inappropriate`, `offensive`, `violent`) - `comments_ids_count` - `dataset_version` ### Comments (`comments`) Each row represents a public comment under a post. Key fields: - `_id`: comment identifier - `username`: commenting user - `datetime`: UTC timestamp - `text`: comment text - `likes`: number of likes (if available) - `post_info__id`: referenced post identifier - `post_info_author`: post author username - `post_info_datetime`: post publication timestamp - sentiment and hate speech scores - `dataset_version` ## Data Collection ### Sources - Public Instagram profiles of Italian political figures - Publicly available posts and comments only Data is collected through **periodic scraping** of publicly accessible content. ## Data Processing Pipeline The dataset is generated through a structured NLP pipeline: 1. Scraping of Instagram content 2. Text normalization and cleaning 3. Topic classification 4. Sentiment analysis 5. Hate speech classification 6. Keyphrase extraction 7. Semantic embedding generation (released separately) All preprocessing steps are applied consistently across dataset versions. ## Embeddings Dataset Vector representations are released in a **separate companion dataset**: **`instagram-political-communication-it-embeddings`** This includes: - post-level embeddings - sentence-level embeddings - comment embeddings - keyphrase embeddings This separation enables lighter downloads, independent versioning, and model updates without breaking the core dataset. ## Intended Use ### Primary Use Cases - Political communication analysis - Computational social science - NLP benchmarking on political language - Sentiment and hate speech research ## Limitations and Biases - The dataset reflects Instagram usage and engagement patterns - Audience comments are not representative of the general population - Automated NLP annotations may introduce bias or errors Users should assess suitability for their specific research goals. ## License This dataset is released under the **Creative Commons Attribution 4.0 (CC-BY 4.0)** license. ## Citation If you use this dataset, please cite: ```bibtex @dataset{nlp_pol_instagram_political_communication_it_2026, title = {NLP-POL: Instagram Political Communication in Italy}, author = {PMG-t and NLP-POL Project}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/PMG-t/instagram-political-communication-it}, note = {Maintained by PMG-t. Part of the NLP-POL (NLP for Political Communication) project.}, howpublished = {\url{https://github.com/PMG-t}} }

dataset_info: - 名称: instagram-political-communication-it 版本: 1.0.0 许可证: cc-by-4.0 任务类别: - 文本分类 - 特征提取 - 句子相似度 - 文本检索 - 文本排序 - 表格分类 语言: - 意大利语 配置项: - 配置名称: profiles 数据文件: - 拆分方式: 训练集 路径: "data/profiles/*.parquet" - 配置名称: posts 数据文件: - 拆分方式: 训练集 路径: "data/posts/*.parquet" - 配置名称: comments 数据文件: - 拆分方式: 训练集 路径: "data/comments/*.parquet" - 配置名称: post_sentiment_saliency 数据文件: - 拆分方式: 训练集 路径: "data/post_sentiment_saliency/*.parquet" - 配置名称: post_hate_speech_saliency 数据文件: - 拆分方式: 训练集 路径: "data/post_hate_speech_saliency/*.parquet" - 配置名称: post_keyphrases 数据文件: - 拆分方式: 训练集 路径: "data/post_keyphrases/*.parquet" - 配置名称: comment_sentiment_saliency 数据文件: - 拆分方式: 训练集 路径: "data/comment_sentiment_saliency/*.parquet" - 配置名称: comment_hate_speech_saliency 数据文件: - 拆分方式: 训练集 路径: "data/comment_hate_speech_saliency/*.parquet" - 配置名称: comment_keyphrases 数据文件: - 拆分方式: 训练集 路径: "data/comment_keyphrases/*.parquet" --- # Instagram政治传播(意大利)—— NLP-POL ## 数据集概述 本数据集隶属于**NLP-POL(NLP for Political Communication,政治传播自然语言处理)**研究项目,该项目旨在通过自然语言处理技术分析政治传播策略。 本数据集包含从**300余名意大利政治人士**(主要为意大利议会议员,重点聚焦众议员)的Instagram账号中采集的**Instagram帖子与评论**,涵盖政治主体发布的官方内容,以及公众通过评论表达的受众反馈。 本数据集可支撑以下方向的研究: - 政治话语与框架分析 - 政治传播中的情感与情绪基调研究 - 公众对政治讯息的反应分析 - 仇恨言论与内容审核相关研究 - 政治语言的语义表征研究 本数据集将持续维护并定期更新,以纳入新采集并处理的Instagram内容。 ## 数据集结构 本数据集以**多表关联数据集**形式发布,采用扁平化模式与稳定标识符,通过标准化设计保障可扩展性、高效关联查询与结果可复现性。 ### 核心表(本仓库) | 表名 | 描述 | |------|------| | `profiles` | 公共政治人物档案 | | `posts` | 政治账号发布的Instagram帖子 | | `comments` | 帖子下的公开评论 | | `post_sentiment_saliency` | 从帖子中提取的显著性情感术语 | | `post_hate_speech_saliency` | 从帖子中提取的显著性仇恨言论相关术语 | | `post_keyphrases` | 从帖子中提取的关键短语 | | `comment_sentiment_saliency` | 从评论中提取的显著性情感术语 | | `comment_hate_speech_saliency` | 从评论中提取的显著性仇恨言论相关术语 | | `comment_keyphrases` | 从评论中提取的关键短语 | > 🔗 **配套嵌入向量数据集**:`instagram-political-communication-it-embeddings` — [前往仓库](https://huggingface.co/datasets/NLP-POL/instagram-political-communication-it-embeddings) ## 示例用法 以下示例展示如何加载核心数据集中的多张表: python import duckdb from datasets import load_dataset q_posts = con.execute(""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/posts/*.parquet') LIMIT 10 """) posts_df = q_posts.fetch_df() comments_q = con.execute(f""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/comments/*.parquet') WHERE post_info__id IN ({', '.join([f"'{_id}'" for _id in posts_df['_id'].tolist()])}) """) comments_df = comments_q.fetch_df() display(posts_df.head()) display(comments_df.head()) ## 数据字段概览 ### 人物档案表(`profiles`) 每行代表一位公共政治人物。 关键字段: - `_id`:唯一档案标识符 - `nome`:全名 - `instagram`:Instagram用户名 - `x`:X(原Twitter)用户名(如可用) - `partito`:所属政党 - `descriptions`:公开职务描述列表 - `url_name`:标准化的URL友好型名称 - `instagram_posts_count`:已采集的帖子数量 - `dataset_version`:数据集版本 ### 帖子表(`posts`) 每行代表一则Instagram帖子。 关键字段: - `_id`:帖子标识符 - `uri`:公开Instagram链接 - `author`:Instagram用户名 - `datetime`:发布的UTC时间戳 - `caption`:帖子配文文本 - `topics`:高级主题标签 - 情感得分(`sentiment_positive`、`neutral`、`negative`) - 仇恨言论得分(`acceptable`、`inappropriate`、`offensive`、`violent`) - `comments_ids_count`:评论数量 - `dataset_version`:数据集版本 ### 评论表(`comments`) 每行代表一则帖子下的公开评论。 关键字段: - `_id`:评论标识符 - `username`:评论用户 - `datetime`:UTC时间戳 - `text`:评论文本 - `likes`:点赞数(如可用) - `post_info__id`:关联帖子标识符 - `post_info_author`:帖子作者用户名 - `post_info_datetime`:帖子发布时间戳 - 情感与仇恨言论得分 - `dataset_version`:数据集版本 ## 数据采集 ### 数据来源 - 意大利政治人士的公开Instagram档案 - 仅采集公开可用的帖子与评论 数据通过**定期爬取**公开可访问的内容获得。 ## 数据处理流水线 本数据集通过结构化自然语言处理流水线生成: 1. Instagram内容爬取 2. 文本归一化与清洗 3. 主题分类 4. 情感分析 5. 仇恨言论分类 6. 关键短语提取 7. 语义嵌入生成(单独发布) 所有预处理步骤在各数据集版本中保持一致。 ## 嵌入向量数据集 向量表征已在**单独的配套数据集**中发布: **`instagram-political-communication-it-embeddings`** 包含: - 帖子级嵌入向量 - 句子级嵌入向量 - 评论嵌入向量 - 关键短语嵌入向量 这种分离设计可实现轻量化下载、独立版本控制,且更新模型不会破坏核心数据集。 ## 预期用途 ### 主要应用场景 - 政治传播分析 - 计算社会科学研究 - 政治语言相关自然语言处理基准测试 - 情感与仇恨言论研究 ## 局限性与偏差 - 本数据集反映了Instagram的使用与互动模式 - 受众评论无法代表普通大众的观点 - 自动化自然语言处理标注可能引入偏差或错误 使用者应根据自身具体研究目标评估数据集的适用性。 ## 许可证 本数据集采用**知识共享署名4.0(CC-BY 4.0)**许可证发布。 ## 引用说明 若您使用本数据集,请引用以下文献: bibtex @dataset{nlp_pol_instagram_political_communication_it_2026, title = {NLP-POL: Instagram Political Communication in Italy}, author = {PMG-t and NLP-POL Project}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/PMG-t/instagram-political-communication-it}, note = {Maintained by PMG-t. Part of the NLP-POL (NLP for Political Communication) project.}, howpublished = {url{https://github.com/PMG-t}} }
提供机构:
NLP-POL
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作