five

zennn077/India_budget

收藏
Hugging Face2024-02-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/zennn077/India_budget
下载链接
链接失效反馈
官方服务:
资源简介:
!pip install requests-html import requests from bs4 import BeautifulSoup import csv # Function to scrape data from the website def scrape_website(url): # Send a GET request to the URL response = requests.get(url) # Check if the request was successful if response.status_code == 200: # Parse the HTML content soup = BeautifulSoup(response.content, 'html.parser') # Find the press release content press_release_content = soup.find('div', {'id': 'divPressRelease'}) # Extract the title and content title = press_release_content.find('h1').text.strip() content = press_release_content.find('div', {'class': 'pressreldetail'}).text.strip() return title, content else: print("Failed to retrieve data from the website.") return None, None # Main function def main(): # URL of the website to scrape url = 'https://www.pib.gov.in/PressReleasePage.aspx?PRID=1895315' # Scrape data from the website title, content = scrape_website(url) # Write the scraped data to a CSV file if title and content: with open('scraped_data.csv', 'w', newline='', encoding='utf-8') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Title', 'Content']) writer.writerow([title, content]) print("Scraped data has been saved to 'scraped_data.csv'.") else: print("No data was scraped.")

# 执行以下命令安装requests-html库:`!pip install requests-html` import requests from bs4 import BeautifulSoup import csv # 定义网站数据抓取函数 def scrape_website(url): # 向目标URL发送GET请求 response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: # 解析HTML响应内容 soup = BeautifulSoup(response.content, 'html.parser') # 定位新闻稿内容区域 press_release_content = soup.find('div', {'id': 'divPressRelease'}) # 提取标题与正文内容 title = press_release_content.find('h1').text.strip() content = press_release_content.find('div', {'class': 'pressreldetail'}).text.strip() return title, content else: print("未能从目标网站获取数据。") return None, None # 定义主函数 def main(): # 待抓取的新闻稿页面URL url = 'https://www.pib.gov.in/PressReleasePage.aspx?PRID=1895315' # 从目标网站抓取数据 title, content = scrape_website(url) # 将抓取得到的数据写入CSV文件 if title and content: with open('scraped_data.csv', 'w', newline='', encoding='utf-8') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Title', 'Content']) writer.writerow([title, content]) print("抓取的数据已成功保存至'scraped_data.csv'。") else: print("未抓取到任何有效数据。")
提供机构:
zennn077
原始信息汇总

数据集概述

数据来源

  • 数据来源于网站:https://www.pib.gov.in/PressReleasePage.aspx?PRID=1895315

数据内容

  • 数据包括新闻稿的标题和内容。
  • 标题和内容分别从网页中的<div id="divPressRelease"><div class="pressreldetail">标签中提取。

数据存储

  • 提取的数据存储在CSV文件scraped_data.csv中。
  • CSV文件包含两列:TitleContent
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作