paloma
收藏魔搭社区2025-11-27 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/allenai/paloma
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Paloma
<!-- Provide a quick summary of the dataset. -->
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains—varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. Among 16 source curated in Paloma, we include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs.
## Dataset Details
### Evaluating with Paloma
In addition to the dataset hosted here, Paloma introduces guidelines for making perplexity results comparable across models and code that implements these guidelines with specific experimental controls.
Whether you are just evaluating an off-the-shelf model or preparing to conduct your own pretraining experiment from scratch, we recommend that you employ as much of our standardized code as possible to ensure the greatest level comparability with existing results.
[How to conduct fully comparable pretraining experiments with Paloma](https://github.com/allenai/ai2-olmo-eval/blob/main/paloma/README.md)
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
Paloma aims to enable research on differences in LM fit over hundreds of domains by curating and standardizing the text datasets with the most fine-grained domains readily available from existing metadata.
We define two terms: *Sources* are as existing datasets (or curated subsets there of) in use for research. *Domains* are fine-grained partitions of sources based on available metadata that attempt to surface a distinct and intuitive distribution of language (e.g., Wikipedia articles about visual arts or a subreddit for advice on PC builds). Paloma is derived from 16 sources. Where we curate previous fine-grained corpora, we inherit their operationalization of domains, ranging from the community-driven Wikipedia ontology to expert curation and automatic classification. Where we build our own fine-grained domains from Reddit and GitHub, we make similar use of metadata about subreddits and file extensions.
**Curated by:** Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, and Jesse Dodge
**Languages:** We elect to focus just on the language modeling of English and code data.
**License:** The data subsets are licensed under the AI2 ImpACT License - Low Risk Artifacts, except as listed below.
- Wikitext-103 - CC BY-SA
- TwitterAAE - for research purposes only
- Red Pajama - see license details
- M2D2 - CC BY-NC
**Paper:** https://arxiv.org/abs/2312.10523
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
<!-- - [Paper]() -- (TODO update when paper is preprinted) -->
<!-- - [Website](paloma.allen.ai) -->
- [Code](https://github.com/allenai/ai2-olmo-eval/blob/main/paloma/README.md)
- Paloma 1B Baseline Models: [Dolma](https://huggingface.co/allenai/paloma-1b-baseline-dolma), [Pile](https://huggingface.co/allenai/paloma-1b-baseline-pile), [RedPajama](https://huggingface.co/allenai/paloma-1b-baseline-redpajama), [C4](https://huggingface.co/allenai/paloma-1b-baseline-c4), [mC4-en](https://huggingface.co/allenai/paloma-1b-baseline-mc4), [Falcon-RefinedWeb](https://huggingface.co/allenai/paloma-1b-baseline-falcon-refinedweb)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
This benchmark is intended for use in evaluating language model fit to fine-grained domains.
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
This dataset should be used for evaluating the likelihood of text from a given domain by a language model.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
Note that the sources contained in this benchmark include varying licenses with differing restrictions (see [License](#dataset-description))
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
The sources in this dataset are each organized into their own subcorpus. This consists of a `val` and `test` split. Data within this is organized as files with lines separated JSON data where each line represents a document and its associated metadata. The type of metadata available varies from source to source, but each line contains at least a field `'text'` which contains the text of the document.
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
Perplexity is conventionally reported on held out data from a model's training distribution or a small number of traditional test sets. Such monolithic evaluation ignores potential variation of model fit across different domains that LMs implicitly learn to model. We curate sources of fine-grained textual domains in Paloma to enable evaluation of language model fit to specific domains of text.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Standard language modeling benchmarks
Though it is common practice to evaluate on held out data from the pretraining corpus of a given model, we evaluate *across* several major pretraining corpora and standard language modeling benchmarks. We also break down performance per domain within the datasets that have multiple domains.
| Source | Citation | Description |
|-------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| c4-en | Raffel et al (2019) via Dodge et al (2021) | Standard contemporary LM pretraining corpus automatically filtered from the April 2019 Common Crawl scrape |
| mc4-en | Xue et al (2021) | The English language portion of a pretraining corpus automatically filtered from 71 Common Crawl scrapes |
| Wikitext-103 | Merity et al (2016) | A standard collection of verified “Good” and “Featured” articles on Wikipedia |
| Penn Tree Bank | Marcus et al (1999) via Nunes, Davide. (2020) | Classic Wall Street Journal benchmark with linguistic structure annotations omitted |
| RedPajama | Together Computer (2023) | A publicly available reproduction of the LLaMA (Touvron et al., 2023) pretraining source mixture, combining large amounts of webscraped text with smaller curated sources |
| Falcon-RefinedWeb | Penedo et al. (2023) | A corpus of English sampled from all Common Crawl scrapes until June 2023, more aggressively filtered and deduplicated than c4 and mc4-en |
| Dolma v1.5 | Soldaini et al. (2023) | A three trillion token corpus that samples sources commonly used to train LMs in order to enable open research on pretraining data |
#### Fine-grained domain benchmarks
Where typical pretraining corpora offer at most tens of labeled domains usually based on where the data is sourced, we examine datasets with up to an order of magnitude more domains. Existing datasets (M2D2 and c4 100 Domains) and datasets we curate from Dolma v1.5 use metadata to define hundreds of domains over Wikipedia, Semantic Scholar, Common Crawl, Reddit, and Github data. These include diverse domains from *Culture and the arts: Performing arts*, a topic on Wikipedia, to *r/depression*, a forum on Reddit for mental health support.
| Source | Citation | Description |
|---------------------------------|--------------------------------------------------|-----------------------------------------------------------------------------------|
| M2D2 S2ORC | Reid et al (2022) | Papers from Semantic Scholar grouped by hierarchical academic field categories |
| M2D2 Wiki | Reid et al (2022) | Wikipedia articles grouped by hierarchical categories in the Wikipedia ontology |
| c4 100 Domains | Chronopoulou et al (2021) | Balanced samples of the top 100 URL domains in C4 |
| Dolma 100 Subreddits | Soldaini et al. (2023) | Balanced samples of the top 100 Subreddits from the Dolma Reddit subset |
| Dolma 100 Programming Languages | Kocetkov et al. (2022) via Soldaini et al. (2023) | Balanced samples of the top 100 programming languages from the Dolma Stack subset |
#### Disparities between speech communities
LMs today primarily process dominant dialects in countries, such as the US, where they are most often trained and deployed. Even within English, hundreds of millions of people around the world speak other dialects that have been shown to be underserved by existing models. As a starting point for measuring disparities between dialects, we include TwitterAAE two corpora representing African-American and White-aligned English, automatically classified via geolocation information and demographic census statistics.
| Source | Citation | Description |
|------------|----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TwitterAAE | Blodgett et al. (2016) via Liang et al (2022) | Balanced sets of tweets classified as African American or White aligned English |
#### Fringe sources previously studied for problematic discourse
Text from some fringe online communities has been shown to contain larger proportions of hate speech and toxicity than more mainstream sources. Measuring perplexity on Manosphere, Gab, and 4chan characterises model exposure to distinct social contexts in which toxic language arises.
| Source | Citation | Description |
|-------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Manosphere Corpus | Ribeiro et al (2020) | 9 forums where a set of related masculinist ideologies developed over the 2000s and 2010s |
| Gab Corpus | Zannettou et al (2018) | Data from 2016-18 from an alt-right, free-speech-oriented social media platform shown to contain more hate speech than mainstream platforms |
| 4chan Corpus | Papasavva et al (2020) | Data from 2016-19 from a politics subforum of an anonymity-focused forum found to contain among the highest rates of toxic content |
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
The data in Paloma are sampled from existing sources. Most often perplexity evaluation data is subsampled uniformly over the original distribution of domains in a source, resulting in more or less tokens from each domain in the evaluation data based on how well represented they are in the corpus. We instead employ stratified sampling, in which all sources with marked domains are partitioned by domain and a uniform sample of the same size is taken from each partition. Specifically, documents are sampled from each domain until a target number of tokens is reached. This helps ensure that no domains are lost or very small after subsampling.
In social media domains with additional metadata that is typically displayed along with posts, we format metadata such as timestamps into the document `'text'` field. Where information is available about how threads of posts are connected, documents in that domain contain all posts in a given thread.
Additional details on source specific processing are available in our paper.
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
Text data from each of the sources curated in Paloma is created by varying sets of original authors. Some sources are collected from users of specific internet fora such as specific subreddits. Other data is collected on the basis of expert or automated classification of demographic groups. Other data is collected from authors of archival material including scientific preprints, Wikipedia, and code repositories. Lastly, data sampled from standard pretraining corpora comes from authors collected through automatic webscrapping and large scale sampling of archival sources, making it difficult to recover much specific information about these authors.
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
No annotation is done on this data.
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
No annotation is done on this data.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
Sources in Paloma may contain personally identifiable information (PII). No attempt is made to measure or remove this information for the following reason: Paloma provides a small subsample of already publicly available data. The small size of this subsample renders this data less useful for aggregation of PII information than the already available public sources which we subsample.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
It is beyond the scope of any one group of researchers to prescribe an exhaustive set of domains that should be examined for a LM. Rather Paloma brings together a substantial selection of domains that are identifiable from already available metadata to demonstrate the kinds of analyses possible with hundreds of domains and rigorous experimental controls.
Different research goals will motivate different definitions and selections of domains, but other researchers can apply the guidelines we detail in our paper to novel fine-grained domains suitable for their research questions. One of the key advantages of evaluating a model by its fit to a collection of text representing a domain is that such domains can be identified not just by researchers who study LMs. We hope future work will identify many more domains that no one discipline would think to look at.
Interpreting language model fit to domains also poses challenges. Instead of relying on LM fit to represent alignment to a domain's human salient features, we examine anomalies in domain fit to deepen understanding of language modeling dynamics and illuminate gaps in existing approaches to evaluation.
Also, some domains in Paloma appear in multiple sources, such as academic papers. Though Dolma and RedPajama process academic papers differently, the subcorpora on academic papers in each source represent different approximations of the same or very similar domains. However for the sake of simplicity, we make the reductive assumption of counting all 546 domains in Paloma as fully distinct.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
In our paper we outline guidelines for evaluating language model fit. We encourage users of Paloma to adopt these experimental controls for metric variance when subsampling, benchmark contamination, differing tokenization, training data order, and evaluation data format.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@article{Magnusson2023PalomaAB,
title={Paloma: A Benchmark for Evaluating Language Model Fit},
author={Ian Magnusson and Akshita Bhagia and Valentin Hofmann and Luca Soldaini and A. Jha and Oyvind Tafjord and Dustin Schwenk and Pete Walsh and Yanai Elazar and Kyle Lo and Dirk Groeneveld and Iz Beltagy and Hanna Hajishirzi and Noah A. Smith and Kyle Richardson and Jesse Dodge},
journal={ArXiv},
year={2023},
volume={abs/2312.10523},
url={https://api.semanticscholar.org/CorpusID:266348815}
}
```
<!-- [More Information Needed] -->
## Dataset Card Contact
{ianm,jessed}@allenai.org
# Paloma 数据集卡片
<!-- 提供数据集的快速摘要。 -->
语言模型(LMs)的评估通常会在训练时留出的单块数据上报告困惑度(perplexity)。这类数据隐式或显式地由多个领域——即不同的语言分布——构成。我们提出了**语言模型评估困惑度分析(Perplexity Analysis for Language Model Assessment, Paloma)**,这是一个用于衡量语言模型对546个英语与代码领域适配程度的基准,无需假设单一分布上的困惑度结果可外推至其他分布。Paloma共整合了16个来源数据集,其中包含两个全新的数据集:Reddit平台前100个子论坛(例如Reddit上的r/depression)以及主流编程语言(例如GitHub上的Java),这两类数据均为当前大语言模型训练中常用的数据源。
## 数据集详情
### 使用Paloma进行评估
除本平台托管的数据集外,Paloma还提供了一套用于实现跨模型困惑度结果可比较的指南,以及配套的代码实现,该代码包含了特定的实验控制逻辑。
无论您是仅评估预训练就绪模型,还是从零开始开展预训练实验,我们均建议您尽可能使用我们提供的标准化代码,以确保与现有研究结果具备最高程度的可比性。
[如何使用Paloma开展完全可比的预训练实验](https://github.com/allenai/ai2-olmo-eval/blob/main/paloma/README.md)
### 数据集描述
<!-- 提供数据集的详细说明。 -->
Paloma旨在通过整理并标准化现有元数据中可获取的最细粒度领域文本数据集,支持针对数百个领域的语言模型适配差异开展研究。
我们定义了两个术语:**来源(Sources)**指当前用于研究的现有数据集(或经整理的子集);**领域(Domains)**指基于现有元数据对来源进行的细粒度划分,旨在呈现独特且直观的语言分布(例如关于视觉艺术的维基百科文章,或是用于分享电脑装机建议的子论坛)。Paloma共源自16个来源数据集。对于我们整理的已有细粒度语料库,我们沿用其领域划分方式,涵盖社区驱动的维基百科本体论、专家整理以及自动分类等多种方案;对于我们从Reddit和GitHub构建的自有细粒度领域,我们同样利用了子论坛元数据与文件扩展名信息。
**整理方:** Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge
**语言范围:** 本数据集仅聚焦英语与代码数据的语言建模任务。
**许可协议:** 除下述例外情况外,所有数据子集均遵循AI2 ImpACT License - Low Risk Artifacts协议。
- Wikitext-103 - CC BY-SA
- TwitterAAE - 仅可用于研究目的
- Red Pajama - 详见许可协议详情
- M2D2 - CC BY-NC
**论文链接:** https://arxiv.org/abs/2312.10523
## 数据集来源
<!-- 提供数据集的基础链接。 -->
- [代码](https://github.com/allenai/ai2-olmo-eval/blob/main/paloma/README.md)
- Paloma 1B 基线模型:[Dolma](https://huggingface.co/allenai/paloma-1b-baseline-dolma), [Pile](https://huggingface.co/allenai/paloma-1b-baseline-pile), [RedPajama](https://huggingface.co/allenai/paloma-1b-baseline-redpajama), [C4](https://huggingface.co/allenai/paloma-1b-baseline-c4), [mC4-en](https://huggingface.co/allenai/paloma-1b-baseline-mc4), [Falcon-RefinedWeb](https://huggingface.co/allenai/paloma-1b-baseline-falcon-refinedweb)
## 使用场景
<!-- 解答数据集的预期使用场景相关问题。 -->
本基准旨在用于评估语言模型对细粒度领域文本的适配程度。
### 直接使用场景
本数据集应用于评估语言模型对给定领域文本的似然度计算能力。
### 超出适用范围的使用场景
请注意本基准包含的数据源带有不同的许可协议与使用限制(详见[许可协议](#数据集描述)部分)。
## 数据集结构
<!-- 本节描述数据集的字段信息,以及数据集划分标准、数据点关系等额外结构信息。 -->
本数据集的各个来源均组织为独立的子语料库,包含验证集(`val`)与测试集(`test`)两个划分。数据以文件形式存储,每行均为独立的JSON数据,代表一篇文档及其关联元数据。不同来源的元数据类型存在差异,但每行数据至少包含一个`'text'`字段,用于存储文档的文本内容。
## 数据集构建
### 整理初衷
<!-- 创建本数据集的动机。 -->
传统上,困惑度评估通常基于模型训练分布的留出数据或少量传统测试集开展。这类单一化的评估方式忽略了语言模型在不同领域适配程度上的潜在差异,而这正是语言模型隐式学习建模的内容。我们在Paloma中整理了细粒度文本领域的数据源,以支持针对特定文本领域的语言模型适配程度评估。
### 源数据
<!-- 本节描述源数据的构成(例如新闻文本与标题、社交媒体帖子、翻译句子等)。 -->
#### 标准语言建模基准数据集
尽管当前常用的做法是在模型预训练语料的留出数据上开展评估,但我们的评估覆盖了多个主流预训练语料库与标准语言建模基准数据集。同时,我们会在包含多个领域的数据集中按领域拆分性能表现。
| 数据源 | 引用文献 | 描述 |
|-------------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| c4-en | Raffel等人(2019),经Dodge等人(2021)引用 | 基于2019年4月Common Crawl爬取数据自动过滤得到的标准现代大语言模型预训练语料库 |
| mc4-en | Xue等人(2021) | 从71次Common Crawl爬取数据中自动过滤得到的预训练语料库的英语部分 |
| Wikitext-103 | Merity等人(2016) | 标准的维基百科精选文章集合,包含经过验证的“优质”与“特色”条目 |
| Penn Tree Bank | Marcus等人(1999),经Nunes, Davide.(2020)引用 | 经典的《华尔街日报》基准数据集,省略了语言学结构标注信息 |
| RedPajama | Together Computer(2023) | 公开复刻的LLaMA(Touvron等人,2023)预训练数据源混合体,整合了大量网络爬取文本与少量精选数据源 |
| Falcon-RefinedWeb | Penedo等人(2023) | 基于2023年6月前所有Common Crawl爬取数据采样得到的英语语料库,过滤与去重程度比c4-en和mc4-en更严格 |
| Dolma v1.5 | Soldaini等人(2023) | 包含3万亿Token的语料库,采样了当前大语言模型训练常用的数据源,旨在支持预训练数据相关的开放研究 |
#### 细粒度领域基准数据集
与通常仅提供至多数十个标注领域(通常基于数据来源)的预训练语料库不同,我们的数据集包含了数量多达一个数量级以上的领域。现有数据集(M2D2与c4 100 Domains)以及我们从Dolma v1.5中整理的数据集,均利用元数据为维基百科、Semantic Scholar、Common Crawl、Reddit与GitHub数据定义了数百个领域,涵盖从“文化与艺术:表演艺术”这类维基百科主题,到r/depression这类Reddit心理健康支持论坛的多样领域。
| 数据源 | 引用文献 | 描述 |
|---------------------------------|--------------------------------------------------|-----------------------------------------------------------------------------------|
| M2D2 S2ORC | Reid等人(2022) | Semantic Scholar收录的论文,按层级学术领域分类进行分组 |
| M2D2 Wiki | Reid等人(2022) | 维基百科文章,按维基百科本体论中的层级分类进行分组 |
| c4 100 Domains | Chronopoulou等人(2021) | 从C4数据集中采样得到的前100个URL领域的平衡样本 |
| Dolma 100 Subreddits | Soldaini等人(2023) | 从Dolma的Reddit子集中采样得到的前100个Reddit子论坛的平衡样本 |
| Dolma 100 Programming Languages | Kocetkov等人(2022),经Soldaini等人(2023)引用 | 从Dolma的Stack子集中采样得到的前100种编程语言的平衡样本 |
#### 语言社群差异数据集
当前的语言模型主要处理部分国家的主流方言,例如美国,这类模型通常在美国训练与部署。即便在英语范围内,全球仍有数亿使用者使用其他方言,现有模型对这类方言的支持明显不足。作为衡量方言差异的初步尝试,我们纳入了TwitterAAE数据集,该数据集包含两个语料库,分别代表非裔美国人英语与白人认同英语,通过地理位置信息与人口普查统计数据自动分类得到。
| 数据源 | 引用文献 | 描述 |
|------------|----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TwitterAAE | Blodgett等人(2016),经Liang等人(2022)引用 | 经分类得到的非裔美国人英语与白人认同英语的平衡推文集合 |
#### 此前被研究过的存在问题言论的边缘数据源
研究表明,部分边缘网络社区的文本包含比主流来源更高比例的仇恨言论与有毒内容。在Manosphere、Gab与4chan数据集上评估困惑度,可以帮助我们了解语言模型对这类易产生有毒语言的特定社会场景的暴露情况。
| 数据源 | 引用文献 | 描述 |
|-------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Manosphere Corpus | Ribeiro等人(2020) | 9个论坛的文本,这些论坛在2000至2010年代发展出了一系列相关的男权主义意识形态 |
| Gab Corpus | Zannettou等人(2018) | 2016-2018年来自极右翼、支持言论自由的社交媒体平台的数据,该平台被发现包含比主流平台更多的仇恨言论 |
| 4chan Corpus | Papasavva等人(2020) | 2016-2019年来自匿名论坛的政治板块的数据,该板块被发现拥有极高的有毒内容占比 |
#### 数据收集与处理流程
<!-- 本节描述数据收集与处理过程,例如数据选择标准、过滤与归一化方法、使用的工具与库等。 -->
Paloma的数据均源自现有数据源。传统的困惑度评估数据通常会按照来源中领域的原始分布进行均匀下采样,导致评估数据中各领域的Token数量取决于其在原始语料库中的占比。我们则采用分层采样策略:将所有带有领域标注的数据源按领域划分,再从每个划分中采样相同规模的均匀样本。具体而言,我们从每个领域中采样文档,直至达到预设的Token数量目标。该策略可确保下采样后不会出现领域缺失或占比极小的情况。
对于带有额外元数据的社交媒体领域(通常与帖子一同展示),我们会将时间戳等元数据整合至文档的`'text'`字段中。若存在帖子线程关联信息,则该领域的文档会包含给定线程内的所有帖子。各来源的专属处理细节可详见我们的论文。
#### 源数据生产者是谁?
<!-- 本节描述原始创建数据的个人或系统。若源数据创建者有自我报告的人口统计或身份信息,也应在此说明。 -->
Paloma中整理的每个数据源的文本均由不同的原始作者创建。部分数据源来自特定网络社区的用户,例如特定的Reddit子论坛。其他数据则基于对人口群体的专家或自动分类结果收集。另有部分数据来自档案材料的作者,包括科学预印本、维基百科与代码仓库。最后,从标准预训练语料库中采样的数据来自自动网络爬取与大规模档案源采样,因此很难获取这些作者的具体信息。
#### 标注流程
<!-- 本节描述标注过程,例如使用的标注工具、标注数据量、提供给标注人员的标注指南、标注者间一致性统计、标注验证等。 -->
本数据集未进行任何标注操作。
#### 标注人员是谁?
<!-- 本节描述创建标注的个人或系统。 -->
本数据集未进行任何标注操作。
#### 个人与敏感信息
<!-- 说明数据集是否包含可能被视为个人、敏感或私密的数据(例如揭示地址、唯一可识别的姓名或别名、种族或族裔起源、性取向、宗教信仰、政治观点、财务或健康数据等)。若已对数据进行匿名化处理,请描述匿名化过程。 -->
Paloma的数据源可能包含个人可识别信息(PII)。我们未尝试检测或移除这类信息,原因如下:Paloma仅对已公开的数据集进行了小比例下采样,该子样本的规模较小,相较于原始公开数据源,其用于聚合PII信息的实用性更低。
## 偏差、风险与局限性
<!-- 本节旨在说明技术与社会技术层面的局限性。 -->
为语言模型评估指定一套完整的领域集合超出了单个研究团队的能力范围。Paloma的核心目标是整合一批可通过现有元数据识别的领域,以展示通过数百个领域与严格实验控制可开展的分析类型。
不同的研究目标会催生不同的领域定义与选择方案,但其他研究者可以采用我们在论文中详述的指南,为其研究问题适配全新的细粒度领域。通过语言模型对领域文本的适配程度开展评估的核心优势之一在于,这类领域不仅可由研究语言模型的学者识别。我们期待未来的研究能够识别出更多当前各学科尚未关注的领域。
解读语言模型对领域的适配程度也存在挑战。我们并未依赖语言模型的适配程度来表征其对领域人类显著特征的对齐情况,而是通过分析领域适配程度中的异常值,加深对语言建模动态的理解,并揭示现有评估方法的缺陷。
此外,Paloma中的部分领域会出现在多个数据源中,例如学术论文。尽管Dolma与RedPajama对学术论文的处理方式不同,但每个数据源中的学术论文子语料库代表了同一或高度相似领域的不同近似版本。为简化处理,我们做出了一个简化假设:将Paloma中的所有546个领域视为完全独立的个体。
### 建议
<!-- 本节旨在针对偏差、风险与技术局限性提出建议。 -->
我们在论文中概述了评估语言模型适配程度的指南。我们鼓励Paloma的使用者在开展下采样、基准污染、不同Token化方式、训练数据顺序与评估数据格式等实验时,采用我们提出的实验控制策略以统一指标方差。
## 引用格式
<!-- 若有介绍该数据集的论文或博客文章,应在此处提供APA与Bibtex格式的引用信息。 -->
**BibTeX格式:**
@article{Magnusson2023PalomaAB,
title={Paloma: A Benchmark for Evaluating Language Model Fit},
author={Ian Magnusson and Akshita Bhagia and Valentin Hofmann and Luca Soldaini and A. Jha and Oyvind Tafjord and Dustin Schwenk and Pete Walsh and Yanai Elazar and Kyle Lo and Dirk Groeneveld and Iz Beltagy and Hanna Hajishirzi and Noah A. Smith and Kyle Richardson and Jesse Dodge},
journal={ArXiv},
year={2023},
volume={abs/2312.10523},
url={https://api.science.org/CorpusID:266348815}
}
<!-- [更多信息待补充] -->
## 数据集卡片联系人
{ianm,jessed}@allenai.org
提供机构:
maas
创建时间:
2025-05-27



