five

aboutme

收藏
魔搭社区2025-07-16 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/allenai/aboutme
下载链接
链接失效反馈
官方服务:
资源简介:
# AboutMe: Self-Descriptions in Webpages ## Dataset description **Curated by:** Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge **Languages:** English **License:** AI2 ImpACT License - Low Risk Artifacts **Paper:** [https://arxiv.org/abs/2401.06408](https://arxiv.org/abs/2401.06408) ## Dataset sources Common Crawl ## Uses This dataset was originally created to document the effects of different pretraining data curation practices. It is intended for research use, e.g. AI evaluation and analysis of development pipelines or social scientific research of Internet communities and self-presentation. ## Dataset structure This dataset consists of three parts: - `about_pages`: webpages that are self-descriptions and profiles of website creators, or text *about* individuals and organizations on the web. These are zipped files with one json per line, with the following keys: - `url` - `hostname` - `cc_segment` (for tracking where in Common Crawl the page is originally retrieved from) - `text` - `title` (webpage title) - `sampled_pages`: random webpages from the same set of websites, or text created or curated *by* individuals and organizations on the web. It has the same keys as `about_pages`. - `about_pages_meta`: algorithmically extracted information from "About" pages, including: - `hn`: hostname of website - `country`: the most frequent country of locations on the page, obtained using Mordecai3 geoparsing - `roles`: social roles and occupations detected using RoBERTa based on expressions of self-identification, e.g. *I am a **dancer***. Each role is accompanied by sentence number and start/end character offsets. - `class`: whether the page is detected to be an individual or organization - `cluster`: one of fifty topical labels obtained via tf-idf clustering of "about" pages Each file contains one json entry per line. Note that the entries in each file are not in a random order, but instead reflect an ordering outputted by CCNet (e.g. neighboring pages may be similar in Wikipedia-based perplexity.) ## Dataset creation AboutMe is derived from twenty four snapshots of Common Crawl collected between 2020–05 and 2023–06. We extract text from raw Common Crawl using CCNet, and deduplicate URLs across all snapshots. We only include text that has a fastText English score > 0.5. "About" pages are identified using keywords in URLs (about, about-me, about-us, and bio), and their URLs end in `/keyword/` or `keyword.*`, e.g. `about.html`. We only include pages that have one candidate URL, to avoid ambiguity around which page is actually about the main website creator. If a webpage has both `https` and `http` versions in Common Crawl, we take the `https` version. The "sampled" pages are a single webpage randomly sampled from the website that has an "about" page. More details on metadata creation can be found in our paper, linked above. ## Bias, Risks, and Limitations Algorithmic measurements of textual content is scalable, but imperfect. We acknowledge that our dataset and analysis methods (e.g. classification, information retrieval) can also uphold language norms and standards that may disproportionately affect some social groups over others. We hope that future work continues to improve these content analysis pipelines, especially for long-tail or minoritized language phenomena. We encourage future work using our dataset to minimize the extent to which they infer unlabeled or implicit information about subjects in this dataset, and to assess the risks of inferring various types of information from these pages. In addition, measurements of social identities from AboutMe pages are affected by reporting bias. Future uses of this data should avoid incorporating personally identifiable information into generative models, report only aggregated results, and paraphrase quoted examples in papers to protect the privacy of subjects. ## Citation ``` @misc{lucy2024aboutme, title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters}, author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge}, year={2024}, eprint={2401.06408}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Dataset contact lucy3_li@berkeley.edu

# AboutMe:网页自我描述数据集 ## 数据集说明 **整理者:** Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge **语言:** 英语 **许可证:** AI2低风险制品许可协议(AI2 ImpACT License - Low Risk Artifacts) **论文:** [https://arxiv.org/abs/2401.06408](https://arxiv.org/abs/2401.06408) ## 数据集来源 通用爬虫语料库(Common Crawl) ## 应用场景 本数据集最初用于量化不同预训练数据整理流程的影响,仅面向学术研究场景,例如AI模型评估与开发管线分析,或针对互联网社群与自我呈现的社会科学研究。 ## 数据集结构 本数据集包含三个子模块: - `about_pages`:用于描述网站创作者的自我说明与个人档案网页,或互联网上关于个人与组织的文本内容。该模块为压缩文件,每行对应一条JSON数据,包含以下字段: - `url`:网页链接 - `hostname`:主机名 - `cc_segment`:用于追踪该网页在通用爬虫语料库(Common Crawl)中的原始获取片段 - `text`:网页文本内容 - `title`:网页标题 - `sampled_pages`:从拥有`about_pages`的同批网站中随机采样的网页,或由互联网个人与组织创作/整理的文本内容。该模块与`about_pages`拥有相同字段。 - `about_pages_meta`:从“关于”页面中通过算法提取的元信息,包含以下内容: - `hn`:网站主机名 - `country`:通过Mordecai3地理解析工具得到的页面中最频繁出现的地理位置所属国家 - `roles`:基于自我认同表述,通过RoBERTa模型检测到的社会角色与职业,例如*我是一名舞者*。每个角色均附带所属句子编号以及字符起始/结束偏移量 - `class`:页面被检测为个人页面还是组织页面的标签 - `cluster`:通过对“关于”页面进行词频-逆文档频率(tf-idf)聚类得到的50个主题标签之一 每个文件均为每行一条JSON条目。需注意,各文件中的条目并非随机排序,而是遵循CCNet工具输出的原始顺序(例如相邻页面可能在基于维基百科的困惑度上具有相似性)。 ## 数据集构建流程 AboutMe数据集源自2020年5月至2023年6月间采集的24份通用爬虫语料库(Common Crawl)快照。我们通过CCNet工具从原始通用爬虫语料中提取文本,并对所有快照中的URL进行去重处理。仅保留fastText语言检测得分大于0.5的英文文本。“关于”页面通过URL中的关键词(about、about-me、about-us及bio)进行识别,且其URL以`/keyword/`或`keyword.*`结尾,例如`about.html`。我们仅保留仅存在单条候选URL的页面,以避免对实际对应网站主体创作者的页面产生歧义。若某网页在通用爬虫语料中同时存在`https`与`http`版本,我们优先选取`https`版本。“采样页面”则从拥有“关于”页面的网站中随机单条采样得到。 关于元信息生成的更多细节可参见上文链接的论文。 ## 偏差、风险与局限性 针对文本内容的算法测量具备可扩展性,但并非完美无缺。我们意识到,本数据集与所采用的分析方法(例如分类、信息检索)可能会固化语言规范与标准,从而对部分社会群体造成不成比例的影响。我们期望未来的研究能够持续优化此类内容分析管线,尤其针对长尾或小众语言现象。 我们鼓励使用本数据集的后续研究尽可能减少对数据集中主体未标注或隐含信息的推断,并评估从这些页面中推断各类信息的潜在风险。此外,从AboutMe页面中测量社会身份的结果会受到报告偏差的影响。 未来使用本数据集时,应避免将个人可识别信息融入生成式模型,仅报告聚合后的统计结果,并在论文中对引用的示例进行转述,以保护数据主体的隐私。 ## 引用 @misc{lucy2024aboutme, title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters}, author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge}, year={2024}, eprint={2401.06408}, archivePrefix={arXiv}, primaryClass={cs.CL} } ## 数据集联系人 lucy3_li@berkeley.edu
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作