EleutherAI/proof-pile-2|机器学习数据集|人工智能数据集
收藏数据集概述
名称: Proof-Pile-2
大小: 55亿 token
语言: 英语 (en)
任务类别: 文本生成 (text-generation)
标签: 数学 (math)
数据集组成:
arxiv
(29亿 tokens)open-web-math
(15亿 tokens)algebraic-stack
(11亿 tokens)
数据集详情
子集描述
- arxiv: 来自 RedPajama 的 ArXiv 子集。
- open-web-math: OpenWebMath 数据集,包含互联网上的高质量数学文本。
- algebraic-stack: 包含数学代码的新数据集,涉及数值计算、计算机代数和形式数学。
数据集结构
- 每行结构: python { "text": ..., # 文档文本 "meta": ..., # JSON 字符串形式的元数据 }
许可证
- 不更改任何底层数据的许可证。
版本历史
- v1.1.0: 包含更新的 OpenWebMath 版本,改进了过滤,例如移除了非常短的文档。
- v1.0.0: 用于训练 Llemma 7B 和 Llemma 34B 的数据。
引用信息
-
整个 Proof-Pile-2:
@misc{azerbayev2023llemma, title={Llemma: An Open Language Model For Mathematics}, author={Zhangir Azerbayev and others}, year={2023}, eprint={2310.10631}, archivePrefix={arXiv}, primaryClass={cs.CL} }
-
ArXiv 子集:
@software{together2023redpajama, author={Together Computer}, title={RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month={April}, year={2023}, url={https://github.com/togethercomputer/RedPajama-Data} }
-
OpenWebMath:
@misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and others}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} }

yahoo-finance-data
该数据集包含从Yahoo! Finance、Nasdaq和U.S. Department of the Treasury获取的财务数据,旨在用于研究和教育目的。数据集包括公司详细信息、高管信息、财务指标、历史盈利、股票价格、股息事件、股票拆分、汇率和每日国债收益率等。每个数据集都有其来源、简要描述以及列出的列及其数据类型和描述。数据定期更新,并以Parquet格式提供,可通过DuckDB进行查询。
huggingface 收录
Tropicos
Tropicos是一个全球植物名称数据库,包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护,旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。
www.tropicos.org 收录
Population and Housing Census of 2007 - Ethiopia
Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.
catalog.ihsn.org 收录
中国气象数据
本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。
github 收录
NAEP - National Assessment of Educational Progress
NAEP(国家教育进展评估)数据集包含了美国全国范围内对学生学术成就的定期评估结果。该数据集涵盖了多个学科领域,如阅读、数学、科学等,并提供了不同年级和不同州的数据。数据集还包括了学生的背景信息和社会经济因素,以帮助分析教育成就的影响因素。
nces.ed.gov 收录