slnader/fcc-comments
收藏Hugging Face2022-11-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/slnader/fcc-comments
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- en
language_creators:
- found
license:
- cc-by-nc-sa-4.0
multilinguality:
- monolingual
pretty_name: fcc-comments
size_categories:
- 10M<n<100M
source_datasets:
- original
tags:
- notice and comment
- regulation
- government
task_categories:
- text-retrieval
task_ids:
- document-retrieval
---
# Dataset Card for fcc-comments
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository: https://github.com/slnader/fcc-comments **
- **Paper: https://doi.org/10.1002/poi3.327 **
### Dataset Summary
Online comment floods during public consultations have posed unique governance challenges for
regulatory bodies seeking relevant information on proposed regulations.
How should regulatory bodies separate spam and fake comments from genuine submissions by the public,
especially when fake comments are designed to imitate ordinary citizens? How can regulatory bodies
achieve both breadth and depth in their citations to the comment corpus? What is the best way to
select comments that represent the average submission and comments that supply highly specialized
information?
`fcc-comments` is an annotated version of the comment corpus from the Federal Communications Commission's
(FCC) 2017 "Restoring Internet Freedom" proceeding. The source data were downloaded directly from the FCC's Electronic
Comment Filing System (ECFS) between January and February of 2019 and include raw comment text and metadata on
comment submissions. The comment data were processed to be in a consistent format
(machine-readable pdf or plain text), and annotated with three types of information: whether the comment was cited in the
agency's final order, the type of commenter (individual, interest group, business group), and whether the comment was associated with an in-person meeting.
The release also includes query-term and document-term matrices to facilitate keyword searches on the comment corpus.
An example of how these can be used with the bm25 algorithm can be found
[here](https://github.com/slnader/fcc-comments/blob/main/process_comments/1_score_comments.py).
## Dataset Structure
FCC relational database (fcc.pgsql): The core components of the database include a table for submission metadata,
a table for attachment metadata, a table for filer metadata, and a table that contains comment text if submitted in express format.
In addition to these core tables, there are several derived tables specific to the analyses in the paper,
including which submissions and attachments were cited in the final order, which submissions were associated with in-person meetings,
and which submissions were associated with interest groups. Full documentation of the tables can be found in fcc_database.md.
Attachments (attachments.tar.gz): Attachments to submissions that could be converted to text via OCR and saved in machine-readable pdf format.
The filenames are formatted as [submission_id]_[document_id].pdf, where submission_id and document_id are keys in the relational database.
Search datasets (search.tar.gz): Objects to facilitate prototyping of search algorithms on the comment corpus. Contains the following elements:
| Filename | description |
| ----------- | ----------- |
query_dtm.pickle | Query-term matrix (79x3986) in sparse csr format (rows are queries, columns are bigram keyword counts).
query_text.pickle | Dictionary keyed by the paragraph number in the FCC’s Notice of Proposed Rulemaking. Values are the text of the query containing a call for comments. |
search_dtms_express.pickle | Document-term matrix for express comments (3800691x3986) in sparse csr format (rows are comment pages, columns are bigram keyword counts). |
search_index_express.pickle | Pandas dataframe containing unique id and total term length for express comments. |
search_dtms.pickle | Document-term matrix for standard comment attachments (44655x3986) in sparse csr format (rows are comment pages, columns are bigram keyword counts). |
search_index.pickle | Pandas dataframe containing unique id and total term length for standard comment attachments. |
### Data Fields
The following tables are available in fcc.pgsql:
- comments: plain text comments associated with submissions
| column | type | description |
| ----------- | ----------- | ----------- |
| comment_id | character varying(64) | unique id for plain text comment |
comment_text | text | raw text of plain text comment
row_id | integer | row sequence for plain text comments
- submissions: metadata for submissions
| column | type | description |
| ----------- | ----------- | ----------- |
submission_id | character varying(20) | unique id for submission
submission_type | character varying(100) | type of submission (e.g., comment, reply, statement)
express_comment | numeric | 1 if express comment
date_received | date | date submission was received
contact_email | character varying(255) | submitter email address
city | character varying(255) | submitter city
address_line_1 | character varying(255) | submitter address line 1
address_line_2 | character varying(255) | submitter address line 2
state | character varying(255) | submitter state
zip_code | character varying(50) | submitter zip
comment_id | character varying(64) | unique id for plain text comment
- filers: names of filers associated with submissions
| column | type | description |
| ----------- | ----------- | ----------- |
submission_id | character varying(20) | unique id for submission
filer_name | character varying(250) | name of filer associated with submission
- documents: attachments associated with submissions
| column | type | description |
| ----------- | ----------- | ----------- |
submission_id | character varying(20) | unique id for submission
document_name | text | filename of attachment
download_status | numeric | status of attachment download
document_id | character varying(64) | unique id for attachment
file_extension | character varying(4) | file extension for attachment
- filers_cited: citations from final order
| column | type | description |
| ----------- | ----------- | ----------- |
point | numeric | paragraph number in final order
filer_name | character varying(250) | name of cited filer
submission_type | character varying(12) | type of submission as indicated in final order
page_numbers | text[] | cited page numbers
cite_id | integer | unique id for citation
filer_id | character varying(250) | id for cited filer
- docs_cited: attachments associated with cited submissions
| column | type | description |
| ----------- | ----------- | ----------- |
cite_id | numeric | unique id for citation
submission_id | character varying(20) | unique id for submission
document_id | character varying(64) | unique id for attachment
- near_duplicates: lookup table for comment near-duplicates
| column | type | description |
| ----------- | ----------- | ----------- |
target_document_id | unique id for target document
duplicate_document_id | unique id for duplicate of target document
- exact_duplicates: lookup table for comment exact duplicates
| column | type | description |
| ----------- | ----------- | ----------- |
target_document_id | character varying(100) | unique id for target document
duplicate_document_id | character varying(100) | unique id for duplicate of target document
- in_person_exparte: submissions associated with ex parte meeting
| column | type | description |
| ----------- | ----------- | ----------- |
submission_id | character varying(20) | unique id for submission
- interest_groups: submissions associated with interest groups
| column | type | description |
| ----------- | ----------- | ----------- |
submission_id | character varying(20) | unique id for submission
business | numeric | 1 if business group, 0 otherwise
## Dataset Creation
### Curation Rationale
The data were curated to perform information retrieval and summarization tasks as documented in https://doi.org/10.1002/poi3.327.
### Source Data
#### Initial Data Collection and Normalization
The data for this study come from the FCC's Electronic Comment Filing System (ECFS) system, accessed between January and February of 2019.
I converted the API responses into a normalized, relational database containing information on 23,951,967 submissions.
23,938,686 "express" submissions contained a single plain text comment submitted directly through the comment form.
13,821 "standard" submissions contained one or more comment documents submitted as attachments in various file formats.
While the FCC permitted any file format for attachments, I only consider documents attached in pdf, plain text, rich text,
and Microsoft Word file formats, and I drop submitted documents that were simply copies of the FCC’s official documents (e.g., the NPRM itself).
Using standard OCR software, I attempted to convert all attachments into plain text and saved them as machine-readable pdfs.
#### Who are the source language producers?
All submitters of public comments during the public comment period (but see note on fake comments in considerations).
### Annotations
#### Annotation process
- Citations: I consider citations from the main text of the FCC's final rule. I did not include citations to
supporting documents not available through ECFS (e.g., court decisions), nor did I include citations
to submissions from prior FCC proceedings. The direct citations to filed submissions are included
in a series of 1,186 footnotes. The FCC’s citation format typically followed a relatively standard
pattern: the name of the filer (e.g., Verizon), a description of the document (e.g., Comment), and
at times a page number. I extracted citations from the text using regular expressions. Based on a
random sample of paragraphs from the final order, the regular expressions identified 98% of eligible citations,
while successfully excluding all non-citation text. In total, this produced 1,886 unique citations.
I then identified which of the comments were cited. First, I identified all documents from the cited filer
that had enough pages to contain the page number cited (if provided), and, where applicable, whose filename
contained the moniker from the FCC’s citation (e.g., "Reply"). The majority of citations matched to only one
possible comment submitted, and I identified the re- maining cited comments through manual review of the citations.
In this way, I was able to tag documents associated with all but three citations. When the same cited document was
submitted under multiple separate submissions, I tagged all versions of the document as being cited.
- Commenter type: Comments are labeled as mass comments if 10 or more duplicate or near-duplicate copies were
submitted by individual commenters. Near-duplicates were defined as comments with non-zero identical information scores.
To identify the type of commenter for non-mass comments, I take advantage of the fact that the vast majority of organized
groups preferred standard submissions over express submissions. Any non-mass comment submitted as an express comment was
coded as coming from an individual. To distinguish between individuals and organizations that used standard submissions,
I use a first name and surname database from the names dataset Python package to characterize filer names as belonging to
individuals or organizations. I also use the domain of the submitter’s email address to re-categorize comments as coming
from organizations if they were submitted on behalf of organizations by an individual. Government officials were identified by
their .gov email addresses. I manually review this procedure for mischaracterizations. After obtaining a list of organization
names, I manually code each one as belonging to a business group or a non-business group. Government officials writing in
their official capacity were categorized as a non-business group.
- In-person meetings: To identify which commenters held in-person meetings with the agency, I collect all comments labeled
as an ex-parte submission in the EFCS. I manually review these submissions for mention of an in-person meeting. I label
a commenter as having held an in-person meeting if they submitted at least one ex-parte document that mentioned an in-person meeting.
#### Who are the annotators?
Annotations are a combination of automated and manual review done by the author.
### Personal and Sensitive Information
This dataset may contain personal and sensitive information, as there were no restrictions on what commenters could submit to
the agency. This dataset also contains numerous examples of profanity and spam. These comments represent what the FCC decided was
appropriate to share publicly on their own website.
## Considerations for Using the Data
### Discussion of Biases
This proceeding was famous for the large number of "fake" comments (comments impersonating ordinary citizens) submitted to the
agency (see [this report](https://ag.ny.gov/sites/default/files/oag-fakecommentsreport.pdf) by the NY AG for more information).
As such, this comment corpus contains a mix of computer-generated and natural language, and there is currently no way to reliably separate
mass comments submitted with the approval of the commenter and those submitted on behalf of the commenter without their knowledge.
## Additional Information
### Licensing Information
CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International.
### Citation Information
```
@article{handan2022,
title={Do fake online comments pose a threat to regulatory policymaking? Evidence from Internet regulation in the United States},
author={Handan-Nader, Cassandra},
journal={Policy \& Internet},
year={2022}
}
```
---
注释生成者:
- 专家生成
语言:
- 英语(en)
语言生成方式:
- 公开采集
许可协议:
- CC BY-NC-SA 4.0(知识共享署名-非商业性使用-相同方式共享4.0国际许可协议)
多语言属性:
- 单语言
美观名称:fcc-comments
规模类别:
- 1000万 < 样本数 < 1亿
源数据集:
- 原始数据集
标签:
- 通知与评议
- 监管
- 政府
任务类别:
- 文本检索
任务子类型:
- 文档检索
---
# fcc-comments 数据集卡片
## 目录
- [目录](#目录)
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与基准排行榜](#支持任务与基准排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建动因](#构建动因)
- [源数据](#源数据)
- [注释标注](#注释标注)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献](#贡献)
## 数据集描述
- **"代码仓库:"** https://github.com/slnader/fcc-comments
- **"关联论文:"** https://doi.org/10.1002/poi3.327
### 数据集概述
公共咨询期间出现的评论洪流,为监管机构在拟议法规相关信息检索中带来了独特的治理挑战。监管机构应如何将垃圾评论、仿冒真实公民的虚假评论与公众的真实提交内容区分开来,尤其是在虚假评论被设计为模仿普通公民的场景下?如何在评论语料库的引用中兼顾广度与深度?如何最优筛选出兼具代表性的普通提交评论与提供高度专业化信息的评论?
`fcc-comments` 是联邦通信委员会(Federal Communications Commission, FCC)2017年“恢复互联网自由”议程的评论语料库的标注版本。源数据于2019年1月至2月间直接从FCC电子评论提交系统(Electronic Comment Filing System, ECFS)下载,包含原始评论文本与评论提交的元数据。评论数据已被处理为统一格式(机器可读PDF或纯文本),并标注了三类信息:评论是否被监管机构最终裁决引用、评论者类型(个人、利益团体、商业团体),以及评论是否关联线下会议。
本次发布还包含查询词矩阵与文档词矩阵,以支持评论语料库的关键词检索。关于如何结合BM25算法使用这些矩阵的示例,可参见[此处](https://github.com/slnader/fcc-comments/blob/main/process_comments/1_score_comments.py)。
## 数据集结构
FCC关系型数据库(fcc.pgsql):该数据库的核心组件包括提交元数据表、附件元数据表、提交者元数据表,以及用于存储快速提交格式评论文本的表。除核心表外,还包含若干针对论文分析的衍生表,例如记录最终裁决中引用的提交内容与附件、关联线下会议的提交内容、关联利益团体的提交内容的表。所有表的完整文档可参见fcc_database.md。
附件包(attachments.tar.gz):可通过光学字符识别(Optical Character Recognition, OCR)转换为文本并保存为机器可读PDF格式的提交附件。文件名格式为`[submission_id]_[document_id].pdf`,其中submission_id与document_id为关系型数据库中的主键。
检索数据集(search.tar.gz):用于辅助在评论语料库上快速原型开发搜索算法的资源,包含以下文件:
| 文件名 | 描述 |
| ----------- | ----------- |
| query_dtm.pickle | 稀疏压缩稀疏行(Compressed Sparse Row, CSR)格式的查询词矩阵(79×3986),行代表查询,列代表二元关键词计数。 |
| query_text.pickle | 以FCC拟议规则制定通知的段落编号为键的字典,值为包含评论征集要求的查询文本。 |
| search_dtms_express.pickle | 快速提交评论的文档词矩阵(3800691×3986),稀疏CSR格式,行代表评论页面,列代表二元关键词计数。 |
| search_index_express.pickle | 包含快速提交评论唯一ID与总词长的Pandas数据框。 |
| search_dtms.pickle | 标准评论附件的文档词矩阵(44655×3986),稀疏CSR格式,行代表评论页面,列代表二元关键词计数。 |
| search_index.pickle | 包含标准评论附件唯一ID与总词长的Pandas数据框。 |
### 数据字段
以下为fcc.pgsql中包含的表:
- **comments表:** 与提交内容关联的纯文本评论
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| comment_id | character varying(64) | 纯文本评论的唯一标识符 |
| comment_text | text | 纯文本评论的原始内容 |
| row_id | integer | 纯文本评论的行序列编号 |
- **submissions表:** 提交内容的元数据
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
| submission_type | character varying(100) | 提交内容类型(例如:评论、回复、声明) |
| express_comment | numeric | 若为快速提交格式则取值为1 |
| date_received | date | 提交内容的接收日期 |
| contact_email | character varying(255) | 提交者的电子邮箱地址 |
| city | character varying(255) | 提交者所在城市 |
| address_line_1 | character varying(255) | 提交者地址第一行 |
| address_line_2 | character varying(255) | 提交者地址第二行 |
| state | character varying(255) | 提交者所在州/省 |
| zip_code | character varying(50) | 提交者的邮政编码 |
| comment_id | character varying(64) | 纯文本评论的唯一标识符 |
- **filers表:** 与提交内容关联的提交者名称
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
| filer_name | character varying(250) | 与提交内容关联的提交者名称 |
- **documents表:** 与提交内容关联的附件
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
| document_name | text | 附件的文件名 |
| download_status | numeric | 附件的下载状态 |
| document_id | character varying(64) | 附件的唯一标识符 |
| file_extension | character varying(4) | 附件的文件扩展名 |
- **filers_cited表:** 来自最终裁决的引用信息
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| point | numeric | 最终裁决中的段落编号 |
| filer_name | character varying(250) | 被引用的提交者名称 |
| submission_type | character varying(12) | 最终裁决中注明的提交内容类型 |
| page_numbers | text[] | 被引用的页码 |
| cite_id | integer | 引用的唯一标识符 |
| filer_id | character varying(250) | 被引用提交者的标识符 |
- **docs_cited表:** 与被引用提交内容关联的附件
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| cite_id | numeric | 引用的唯一标识符 |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
| document_id | character varying(64) | 附件的唯一标识符 |
- **near_duplicates表:** 评论近似重复项的查找表
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| target_document_id | | 目标文档的唯一标识符 |
| duplicate_document_id | | 目标文档的重复副本的唯一标识符 |
- **exact_duplicates表:** 评论完全重复项的查找表
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| target_document_id | character varying(100) | 目标文档的唯一标识符 |
| duplicate_document_id | character varying(100) | 目标文档的重复副本的唯一标识符 |
- **in_person_exparte表:** 与单方面会议关联的提交内容
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
- **interest_groups表:** 与利益团体关联的提交内容
| 字段名 | 数据类型 | 描述 |
| ----------- | ----------- | ----------- |
| submission_id | character varying(20) | 提交内容的唯一标识符 |
| business | numeric | 若为商业团体则取值为1,否则为0 |
## 数据集构建
### 构建动因
本数据集的筛选旨在支持信息检索与摘要任务,具体细节可参见https://doi.org/10.1002/poi3.327。
### 源数据
#### 初始数据采集与标准化
本研究的数据采集自FCC电子评论提交系统(ECFS),采集时间为2019年1月至2月。研究者将API响应转换为标准化的关系型数据库,涵盖23,951,967条提交内容的信息。其中23,938,686条“快速”提交内容包含通过评论表单直接提交的单条纯文本评论;13,821条“标准”提交内容包含以附件形式提交的一条或多条评论文档。尽管FCC允许附件使用任意文件格式,但本数据集仅收录PDF、纯文本、富文本与Microsoft Word格式的附件,并剔除了仅为FCC官方文档副本的提交文件(例如拟议规则制定通知本身)。研究者使用标准OCR软件将所有附件转换为纯文本,并保存为机器可读PDF格式。
#### 源语言生成者
公共评论征集期内的所有公开评论提交者(详见本数据集使用注意事项中关于虚假评论的说明)。
### 注释标注
#### 标注过程
1. **引用标注**:研究者仅收录FCC最终规则正文中的引用,不包含ECFS无法获取的辅助文档(例如法院判决)引用,也不包含FCC既往议程中的提交内容引用。针对已提交内容的直接引用收录于1186条脚注中。FCC的引用格式通常遵循标准模式:提交者名称(例如Verizon)、文档描述(例如Comment),部分场景附带页码。研究者使用正则表达式从文本中提取引用,基于最终裁决的随机段落样本验证显示,正则表达式可识别98%的符合条件的引用,并成功过滤所有非引用文本,最终得到1,886条唯一引用。随后研究者识别被引用的评论:首先筛选出来自被引用提交者、页码符合引用要求(若有指定)、文件名包含FCC引用标识(例如“Reply”)的所有文档;多数引用仅匹配到一条可能的提交评论,剩余引用通过人工审核完成匹配。最终仅3条引用未能关联到对应文档。当同一被引用文档通过多个独立提交提交时,研究者将该文档的所有版本均标记为被引用。
2. **评论者类型标注**:若单个评论者提交了10条及以上重复或近似重复的评论,则该类评论被标记为批量评论。近似重复的定义为信息相似度得分非零的评论。针对非批量评论,研究者利用绝大多数有组织团体倾向于使用标准提交而非快速提交的特征进行分类:所有以快速提交格式提交的非批量评论均被编码为个人提交。为区分使用标准提交的个人与组织,研究者使用Python的`names`数据集提供的姓名数据库,将提交者名称归类为个人或组织;同时利用提交者电子邮箱的域名,将以个人名义代组织提交的评论重新归类为组织提交。政府官员通过.gov电子邮箱地址识别。研究者会人工审核该分类流程以修正分类错误。在得到组织名称列表后,研究者手动将每个组织归类为商业团体或非商业团体,以官方身份提交评论的政府官员被归类为非商业团体。
3. **线下会议标注**:为识别与监管机构举行线下会议的评论者,研究者收集了ECFS中标记为单方面会议(ex parte)的所有提交内容,并人工审核这些提交中是否提及线下会议。若提交者提交了至少一份提及线下会议的单方面会议文档,则将该提交者标记为举行过线下会议。
#### 标注者
所有标注工作由研究者本人通过自动化工具结合人工审核完成。
### 个人与敏感信息
由于评论者向监管机构提交内容不受限制,本数据集可能包含个人与敏感信息。此外,数据集内包含大量粗言秽语与垃圾评论示例,这些内容均为FCC认为适合在其官方网站公开的内容。
## 数据集使用注意事项
### 偏差讨论
本议程因大量“虚假”评论(仿冒普通公民的评论)而广受关注(详细信息可参见纽约州总检察长发布的[报告](https://ag.ny.gov/sites/default/files/oag-fakecommentsreport.pdf))。因此,本评论语料库同时包含计算机生成文本与自然语言文本,目前尚无可靠方法区分经评论者本人同意提交的批量评论,以及未经评论者知情同意代其提交的批量评论。
## 附加信息
### 许可信息
知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)。
### 引用信息
@article{handan2022,
title={虚假在线评论是否威胁监管决策?来自美国互联网监管的证据},
author={Handan-Nader, Cassandra},
journal={Policy & Internet},
year={2022}
}
提供机构:
slnader
原始信息汇总
数据集概述
数据集名称
- 名称: fcc-comments
- 别名: 无
数据集基本信息
- 语言: 英语 (en)
- 许可证: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (cc-by-nc-sa-4.0)
- 多语言性: 单语种
- 大小: 10M<n<100M
- 来源: 原始数据
- 标签: 通知和评论, 法规, 政府
- 任务类别: 文本检索
- 任务ID: 文档检索
数据集内容
- 描述: 该数据集包含联邦通信委员会(FCC)2017年“恢复互联网自由”程序的评论语料库的注释版本。数据包括原始评论文本和提交评论的元数据。
- 结构: 数据集包括一个关系数据库(fcc.pgsql),其中包含提交元数据、附件元数据、提交者元数据和评论文本表。此外,还包括附件和搜索数据集,用于支持搜索算法原型设计。
- 数据字段: 数据集包含多个表,如comments, submissions, filers, documents等,每个表包含多个字段,如comment_id, submission_id, filer_name等,用于存储评论、提交信息、提交者信息和附件信息。
数据集创建
- 采集理由: 用于执行信息检索和摘要任务。
- 源数据: 数据来自FCC的电子评论归档系统(ECFS),通过API响应转换为标准化关系数据库。
- 注释: 注释包括评论是否被机构最终命令引用、评论者类型(个人、利益团体、商业团体)以及评论是否与面对面会议相关联。
使用数据注意事项
- 社会影响: 数据集包含大量“假”评论,这些评论可能影响监管政策制定。
- 偏见讨论: 数据集中的评论混合了计算机生成和自然语言,目前无法可靠地区分未经评论者同意提交的大量评论。
附加信息
- 许可证信息: 数据集遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International许可证。
- 引用信息: 数据集的引用信息已提供。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



