five

LLM Training

收藏
Snowflake2023-08-07 更新2024-05-01 收录
下载链接:
https://app.snowflake.com/marketplace/listing/GZTSZ290BUX1X
下载链接
链接失效反馈
官方服务:
资源简介:
The LLM Training data listing includes four text-based datasets covering filings with the SEC, US patent grants, US government contracts, and the global research system catalog from OpenAlex. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs). **Topics Covered** - SEC filings including fiscal calendars, press releases, earnings, annual & quarterly reports, major company events, and quarterly fund holdings - US patent grants including patent type, invention title, and contributor name/location - US federal government contract opportunities and awards - OpenAlex index of scholarly entities (e.g. works, sources, authors, funders, publishers) and how they are connected to one another **[Documentation](https://data-docs.snowflake.com/foundations/products/llm-training/)** Refer to [Docs](https://data-docs.snowflake.com/foundations/products/llm-training/) for detailed attributes (e.g. history, release frequency, granularity), source links, release frequencies, notes and methodologies. **Data Sources** 1. US Securities & Exchange Commission (SEC) 2. United States Patent and Trademark Office (USPTO) 3. OpenAlex 4. System for Award Management (SAM.gov) Visit the [Data Catalog](https://data-catalog.snowflake.com/) to search all sources Snowflake integrates and the Snowflake Marketplace listings where you can find the data. **Snowflake's Unified Schema** All Snowflake products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company) that Snowflake provides data on. All timeseries' dates and values that refer to the entity are included in a timeseries table. Descriptors of the timeseries are included in an attributes table. Data is joinable across all Snowflake products that have a GEO_ID. Visit [docs](https://data-docs.snowflake.com/intro/concepts/unified-schema/) for more details. *Disclaimer: The data in this app is sourced [here](https://data-docs.snowflake.com/foundations/products/llm-training/). Links to provider license, terms and disclaimers are provided where appropriate. Snowflake is not endorsed by or affiliated with any of these providers.*
提供机构:
Snowflake Public Data Products
创建时间:
2023-08-04
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集整合了SEC申报文件、美国专利授权、政府合同及OpenAlex学术目录四大文本资源,适用于大语言模型的训练与优化。所有数据遵循Snowflake统一的EAV模型架构,可通过GEO_ID实现跨产品关联。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作