five

C4 Dataset|自然语言处理数据集|机器学习数据集

收藏
github2024-10-11 更新2024-10-12 收录
自然语言处理
机器学习
下载链接:
https://github.com/abx393/llm-pruning-calibration-data
下载链接
链接失效反馈
资源简介:
C4数据集是一个用于语言模型预训练的大型文本数据集,广泛用于评估和优化语言模型的性能。
创建时间:
2024-10-04
原始信息汇总

EMNLP 2024 数据集概述

数据集名称

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning

数据集描述

该数据集用于研究大型语言模型(LLM)剪枝中的校准数据选择问题。研究评估了多种常用数据集在LLM剪枝中的表现,包括预训练数据集和下游任务数据集。研究结果表明,C4数据集并非最优选择,某些算术数据集在校准数据选择上表现更佳。

数据集内容

校准数据集

  • 文本数据集
    • C4
    • Pile
    • Oscar
    • RedPajama
  • 算术问答数据集
    • GSM8K
    • SVAMP
    • MAWPS
  • 自然语言推理数据集
    • e-SNLI
    • ANLI R1
    • ANLI R3
  • 常识问答数据集
    • CommonSenseQA
    • RACE
    • WinoGrande

剪枝方法

  • Wanda
  • SparseGPT

模型

  • Llama 2-Chat 7B
  • LLaMA 7B

使用说明

参数说明

  • --model:Hugging Face模型库中的LLaMA模型标识符。
  • --cache_dir:加载或存储LLM权重的目录,默认为llm_weights
  • --prune_method:剪枝方法,可选值为["magnitude", "wanda", "sparsegpt", "none"]。
  • --sparsity_ratio:表示要剪枝的权重百分比。
  • --sparsity_type:指定稀疏类型,可选值为[unstructured, 2:4, 4:8]。
  • --save:指定存储结果的目录。
  • --calibration:校准数据集选择,可选值包括[c4, oscar, redpajama, pile, gsm8k, svamp, mawps, anli_r1, anli_r2, anli_r3, esnli, rte, boolq, commonsense_qa, race, winogrande, wmt14, ellipses, random]。
  • --seed:校准数据采样的种子,默认为0。
  • --nsamples:校准样本数量,默认为128。
  • --cache_dir:缓存权重的文件路径目录,默认为llm_weights
  • --input_format:默认为concat,可选值为[single, concat, zero]。
  • --seqlen:上下文窗口的长度(以token为单位),默认为2048。
  • --data_seqlen:每个校准样本中有意义的token数量,剩余部分用填充token填充。
  • --num_incontext:每个校准样本中的上下文问答对数量。
  • --num_cot_steps:每个问答对在校准样本中的CoT推理步骤数量,仅在使用--rationale时有效。
  • --rationale:如果包含此标志,在校准样本的问答对答案部分包含CoT推理。
  • --eval_rationale:如果包含此标志,在评估时,在提示中的上下文示例中包含CoT推理。
  • --eval:默认为wikitext,可选值为[wikitext, redpajama, oscar, gsm8k, svamp, mawps, anli_r1, anli_r2, anli_r3, esnli, rte, boolq, commonsense_qa, race, winogrande, all]。
  • --skip_dense_eval:如果包含此标志,跳过密集模型(剪枝前)的评估。
  • --verbose:如果包含此标志,将中间结果打印到标准输出。
  • --append_to_file:追加结果的文件。
  • --save_model:保存剪枝模型的路径。

示例

sh python main.py --model huggyllama/llama-7b --seed 0 --prune_method wanda --sparsity_ratio 0.5 --sparsity_type unstructured --save out/llama_7b/0/

引用

@article{bandari2024c4datasetoptimalpruning, title={Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning}, author={Abhinav Bandari and Lu Yin and Cheng-Yu Hsieh and Ajay Kumar Jaiswal and Tianlong Chen and Li Shen and Ranjay Krishna and Shiwei Liu}, year={2024}, eprint={2410.07461}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.07461}, }

AI搜集汇总
数据集介绍
main_image_url
构建方式
在构建C4数据集时,研究者们精心挑选了一系列广泛用于大型语言模型(LLM)训练和评估的数据集,包括四个预训练数据集和九个下游任务数据集。这些数据集在网络剪枝(pruning)过程中被用作校准数据,以计算剪枝分数。每个下游数据集都通过上下文学习(In-Context Learning, ICL)和思维链(Chain-of-Thought, CoT)进行提示,以确保数据的多样性和适用性。
特点
C4数据集的显著特点在于其广泛的数据来源和多样化的应用场景。它不仅涵盖了常见的预训练数据集,如C4、Pile、Oscar和RedPajama,还包含了多种下游任务数据集,如算术问答(GSM8K、SVAMP、MAWPS)、自然语言推理(e-SNLI、ANLI R1、ANLI R3)和常识问答(CommonSenseQA、RACE、WinoGrande)。这种多样性使得C4数据集在评估和优化LLM剪枝过程中具有独特的优势。
使用方法
使用C4数据集时,用户可以通过指定模型标识符(如LLaMA模型)、剪枝方法(如magnitude、wanda、sparsegpt)、稀疏比率、稀疏类型等参数来配置实验。此外,用户还可以选择不同的校准数据集和设置随机种子,以确保实验的可重复性。通过提供的示例脚本,用户可以轻松地在各种设置下运行实验,并保存剪枝后的模型及其结果。
背景与挑战
背景概述
C4数据集,由Abhinav Bandari等研究人员于2024年创建,旨在探索大型语言模型(LLM)剪枝过程中校准数据的最优选择。该数据集的提出源于对现有LLM剪枝方法普遍依赖C4数据集作为校准数据,而其最优性未被深入探讨的现状。通过评估多种常用数据集在LLM剪枝中的表现,研究团队揭示了校准数据选择对剪枝性能的显著影响,为高效部署这些强大模型提供了新的视角。
当前挑战
C4数据集在LLM剪枝中的应用面临多重挑战。首先,现有方法对C4数据集的依赖性未经过充分验证,其最优性亟待探索。其次,选择合适的校准数据对剪枝性能的影响显著,如何在众多数据集中找到最优选择是一大挑战。此外,不同类型的下游任务对校准数据的需求各异,如何在保证剪枝效果的同时满足多样化任务需求,也是研究中需要解决的问题。
常用场景
经典使用场景
C4数据集在大型语言模型(LLM)的剪枝过程中被广泛用作校准数据,以计算剪枝分数。然而,本研究揭示了C4并非最优选择,尤其是在与其他常见预训练数据集如Pile、Oscar和RedPajama的比较中。通过使用不同的校准数据集,研究者能够更精确地评估剪枝对模型性能的影响,从而优化模型的部署效率。
实际应用
在实际应用中,C4数据集的研究成果可用于优化大型语言模型的部署成本和性能。通过选择更合适的校准数据集,企业和服务提供商能够在保持模型性能的同时,显著减少计算资源和存储需求,从而实现更经济高效的AI解决方案。
衍生相关工作
C4数据集的研究不仅揭示了校准数据选择的重要性,还激发了一系列相关工作。例如,后续研究可能探索更多类型的校准数据集,或开发自动选择最优校准数据的方法。此外,该研究也为其他领域的模型剪枝和优化提供了新的思路和方法。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国食物成分数据库

食物成分数据比较准确而详细地描述农作物、水产类、畜禽肉类等人类赖以生存的基本食物的品质和营养成分含量。它是一个重要的我国公共卫生数据和营养信息资源,是提供人类基本需求和基本社会保障的先决条件;也是一个国家制定相关法规标准、实施有关营养政策、开展食品贸易和进行营养健康教育的基础,兼具学术、经济、社会等多种价值。 本数据集收录了基于2002年食物成分表的1506条食物的31项营养成分(含胆固醇)数据,657条食物的18种氨基酸数据、441条食物的32种脂肪酸数据、130条食物的碘数据、114条食物的大豆异黄酮数据。

国家人口健康科学数据中心 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

猫狗图像数据集

该数据集包含猫和狗的图像,每类各12500张。训练集和测试集分别包含10000张和2500张图像,用于模型的训练和评估。

github 收录

中国1km分辨率逐月降水量数据集(1901-2023)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

威廉王岛—全球变化数据大百科辞条

威廉王岛(King William Island)位于北美洲,北极圈内,属于加拿大北极群岛。它位于维多利亚岛和布西亚半岛之间,距离维多利亚岛85 km;北面距离威尔士亲王岛155 km;南面隔斯托里斯海峡和辛普森海峡与北美洲大陆(阿德莱德半岛)相望,最近处只有3.3 km。威廉王岛于1830年被指挥官詹姆斯.罗斯(James Ross)发现,以当时在位的英国君主威廉四世的名字命名。行政区划上,威廉王岛隶属于加拿大努纳武特(Nunavut)地区。它的地理位置为:69&deg54′22″N - 68&deg27′12″N,99&deg32′48″W - 95&deg09′25″W。威廉王岛总面积13259.59 km&sup2,海岸线总长1555.35 km。岛屿地势平坦,表面散布着无数的小湖。位于岛屿东南侧的约阿港(Gjoa Haven)是岛上最主要的居民点。在约阿港东北,有一机场。该数据集是基于Google Earth遥感影像全球多尺度海陆(岛)岸线数据集(2015),结合加拿大相关地图完成。数据集由24个数据文件组成,以.kmz和.shp数据格式存储,数据量2.98 MB(压缩成3个数据文件,数据量2.06 MB)。

国家对地观测科学数据中心 收录