five

ro-h/regulatory_comments_api

收藏
Hugging Face2024-03-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ro-h/regulatory_comments_api
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - government - api - policy pretty_name: Regulation.gov Public Comments API size_categories: - n<1K task_categories: - text-classification --- # Dataset Card for Regulatory Comments (Direct API Call) United States governmental agencies often make proposed regulations open to the public for comment. Proposed regulations are organized into "dockets". This dataset will use Regulation.gov public API to aggregate and clean public comments for dockets selected by the user. Each example will consist of one docket, and include metadata such as docket id, docket title, etc. Each docket entry will also include information about the top 10 comments, including comment metadata and comment text. In this version, the data is called directly from the API, which can result in slow load times. If the user wants to simply load a pre-downloaded dataset, reference https://huggingface.co/datasets/ro-h/regulatory_comments. For an example of how to use this dataset structure, reference [https://colab.research.google.com/drive/1AiFznbHaDVszcmXYS3Ht5QLov2bvfQFX?usp=sharing]. ## Dataset Details ### Dataset Description and Structure This dataset will call the API to individually gather docket and comment information. The user must input their own API key ([https://open.gsa.gov/api/regulationsgov/]), as well as a list of dockets they want to draw information from. Data collection will stop when all dockets input have been gathered, or when the API hits a rate-limit. The government API limit is 1000 calls per hour. Since each comment's text requires an individual call and there will be ~10 comments collected per docket, only around 100 dockets can be collected without extending the limit. Furthermore, since some dockets don't have comment data and thus will not be included, it is realistic to expect approximatley 60-70 dockets collected. For this number of dockets, expect a load time of around 10 minutes. If a larger set of dockets are required, consider requesting a rate-unlimited API key. For more details, visit [https://open.gsa.gov/api/regulationsgov/]. The following information is included in this dataset: **Docket Metadata** id (int): A unique numerical identifier assigned to each regulatory docket. agency (str): The abbreviation for the agency posting the regulatory docket (e.g., "FDA") title (str): The official title or name of the regulatory docket. This title typically summarizes the main issue or area of regulation covered by the docket. update_date (str): The date when the docket was last modified on Regulations.gov. update_time (str): The time when the docket was last modified on Regulations.gov. purpose (str): Whether the docket was rulemaking, non-rulemaking, or other. keywords (list): A string of keywords, as determined by Regulations.gov. **Comment Metadata** Note that huggingface converts lists of dictionaries to dictionaries of lists. comment_id (int): A unique numerical identifier for each public comment submitted on the docket. comment_url (str): A URL or web link to the specific comment or docket on Regulations.gov. This allows direct access to the original document or page for replicability purposes. comment_date (str): The date when the comment was posted on Regulations.gov. This is important for understanding the timeline of public engagement. comment_time (str): The time when the comment was posted on Regulations.gov. commenter_fname (str): The first name of the individual or entity that submitted the comment. This could be a person, organization, business, or government entity. commenter_lname (str): The last name of the individual or entity that submitted the comment. comment_length (int): The length of the comment in terms of the number of characters (spaces included) **Comment Content** text (str): The actual text of the comment submitted. This is the primary content for analysis, containing the commenter's views, arguments, and feedback on the regulatory matter. ### Dataset Limitations Commenter name features were phased in later in the system, so some dockets will have no first name/last name entries. Further, some comments were uploaded solely via attachment, and are stored in the system as null since the API has no access to comment attachments. - **Curated by:** Ro Huang ### Dataset Sources - **Repository:** [https://huggingface.co/datasets/ro-h/regulatory_comments_api] - **Original Website:** [https://www.regulations.gov/] - **API Website:** [https://open.gsa.gov/api/regulationsgov/] ## Uses This dataset may be used by researchers or policy-stakeholders curious about the influence of public comments on regulation development. For example, sentiment analysis may be run on comment text; alternatively, simple descriptive analysis on the comment length and agency regulation may prove interesting. ## Dataset Creation ### Curation Rationale After a law is passed, it may require specific details or guidelines to be practically enforceable or operable. Federal agencies and the Executive branch engage in rulemaking, which specify the practical ways that legislation can get turned into reality. Then, they will open a Public Comment period in which they will receive comments, suggestions, and questions on the regulations they proposed. After taking in the feedback, the agency will modify their regulation and post a final rule. As an example, imagine that the legislative branch of the government passes a bill to increase the number of hospitals nationwide. While the Congressman drafting the bill may have provided some general guidelines (e.g., there should be at least one hospital in a zip code), there is oftentimes ambiguity on how the bill’s goals should be achieved. The Department of Health and Human Services is tasked with implementing this new law, given its relevance to national healthcare infrastructure. The agency would draft and publish a set of proposed rules, which might include criteria for where new hospitals can be built, standards for hospital facilities, and the process for applying for federal funding. During the Public Comment period, healthcare providers, local governments, and the public can provide feedback or express concerns about the proposed rules. The agency will then read through these public comments, and modify their regulation accordingly. While this is a vital part of the United States regulatory process, there is little understanding of how agencies approach public comments and modify their proposed regulations. Further, the data extracted from the API is often unclean and difficult to navigate. #### Data Collection and Processing **Filtering Methods:** For each docket, we retrieve relevant metadata such as docket ID, title, context, purpose, and keywords. Additionally, the top 10 comments for each docket are collected, including their metadata (comment ID, URL, date, title, commenter's first and last name) and the comment text itself. The process focuses on the first page of 25 comments for each docket, and the top 10 comments are selected based on their order of appearance in the API response. Dockets with no comments are filtered out. **Data Normalization:** The collected data is normalized into a structured format. Each docket and its associated comments are organized into a nested dictionary structure. This structure includes key information about the docket and a list of comments, each with its detailed metadata. **Data Cleaning:** HTML text tags are removed from comment text. However, the content of the comment remains unedited, meaning any typos or grammatical errors in the original comment are preserved. **Tools and Libraries Used:** Requests Library: Used for making API calls to the Regulations.gov API to fetch dockets and comments data. Datasets Library from HuggingFace: Employed for defining and managing the dataset's structure and generation process. Python: The entire data collection and processing script is written in Python. **Error Handling:** In the event of a failed API request (indicated by a non-200 HTTP response status), the data collection process for the current docket is halted, and the process moves to the next docket.
提供机构:
ro-h
原始信息汇总

数据集卡片 for Regulatory Comments (Direct API Call)

数据集详情

数据集描述和结构

该数据集通过调用Regulation.gov公共API来收集和整理用户选择的文件夹的公众评论。每个示例包含一个文件夹,并包括文件夹ID、文件夹标题等元数据。每个文件夹条目还包括关于前10条评论的信息,包括评论元数据和评论文本。

数据集内容

文件夹元数据

  • id (int): 每个监管文件夹的唯一数字标识符。
  • agency (str): 发布监管文件夹的机构的缩写(例如,"FDA")。
  • title (str): 监管文件夹的官方标题或名称。
  • update_date (str): 文件夹在Regulations.gov上最后一次修改的日期。
  • update_time (str): 文件夹在Regulations.gov上最后一次修改的时间。
  • purpose (str): 文件夹是规则制定、非规则制定还是其他。
  • keywords (list): Regulations.gov确定的关键词列表。

评论元数据

  • comment_id (int): 每个公众评论提交在文件夹上的唯一数字标识符。
  • comment_url (str): 特定评论或文件夹在Regulations.gov上的URL或网页链接。
  • comment_date (str): 评论在Regulations.gov上发布的日期。
  • comment_time (str): 评论在Regulations.gov上发布的时间。
  • commenter_fname (str): 提交评论的个人或实体的名字。
  • commenter_lname (str): 提交评论的个人或实体的姓氏。
  • comment_length (int): 评论的字符数(包括空格)。

评论内容

  • text (str): 提交的评论的实际文本。

数据集限制

评论者姓名特征是在系统后期引入的,因此一些文件夹将没有名字/姓氏条目。此外,一些评论仅通过附件上传,并且存储在系统中为null,因为API无法访问评论附件。

数据集来源

数据集用途

该数据集可用于对公众评论对监管发展影响感兴趣的研究人员或政策利益相关者。例如,可以对评论文本进行情感分析;或者对评论长度和机构监管进行简单的描述性分析。

数据集创建

数据收集和处理

过滤方法: 对于每个文件夹,我们检索相关的元数据,如文件夹ID、标题、上下文、目的和关键词。此外,还收集每个文件夹的前10条评论,包括它们的元数据(评论ID、URL、日期、标题、评论者的名字和姓氏)和评论文本本身。该过程关注每个文件夹的25条评论的第一页,并根据它们在API响应中的出现顺序选择前10条评论。没有评论的文件夹被过滤掉。

数据规范化: 收集的数据被规范化成结构化格式。每个文件夹及其相关评论被组织成嵌套的字典结构。该结构包括关于文件夹的关键信息和一个评论列表,每个评论都有其详细的元数据。

数据清洗: 从评论文本中删除HTML文本标签。然而,评论的内容保持不变,这意味着原始评论中的任何拼写错误或语法错误都被保留。

使用的工具和库:

  • Requests Library: 用于向Regulations.gov API发出API调用以获取文件夹和评论数据。
  • HuggingFace的Datasets Library: 用于定义和管理数据集的结构和生成过程。
  • Python: 整个数据收集和处理脚本是用Python编写的。

错误处理: 如果API请求失败(由非200 HTTP响应状态指示),当前文件夹的数据收集过程将停止,并转到下一个文件夹。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作