AI Ready Data
收藏Databricks2024-10-08 收录
下载链接:
https://marketplace.databricks.com/details/2522a0fa-3d32-45b5-a706-d97ddb5d9ef3/S-P-Global-Energy_AI-Ready-Data
下载链接
链接失效反馈官方服务:
资源简介:
**Overview**
The AI Ready Data dataset encompasses a comprehensive array of textual content across Energy publications produced by in-house editorial and research teams, including market reports, news articles, rationales, commentaries, fundamentals analyses, outlooks, and more - all in an LLM-friendly format prepared for seamless integration with AI systems.
Customers can effortlessly leverage AI Ready Data for their Retrieval-Augmented Generation (RAG) solutions, enhancing their analytical capabilities and driving informed decision-making. This dataset removes restrictions as you integrate your choice of large language models (LLMs), to uncover patterns, correlations, and insights across commodities. Our flexibility aids processing and understanding data to suit your organizations, and you can utilize the provided data embeddings or set your own as per your preference. Additionally, you can integrate with your own vector database and leverage various internal and external data sources to enrich the dataset.
This dataset includes:
- Unstructured data in an AI-ready format broken down into documents and segments with LLM-friendly metadata
- Flexible data delivery
- Easy customization of your own search and relevancy-boosting algorithms
- Ease of discovery of relevant content for your end users
**Use cases**
- Machine Learning - Leverage AI Ready data from Commodity Insights as a RAG solution
- Pricing Analysis - Uncover insights into pricing assessments using information from assessment summaries, market commentaries, and rationales
- Sentiment Analysis - Perform sentiment analysis on News articles leveraging dedicated libraries
- Fundamental Analysis - Discover insights into fundamental data from content within analyses and research reports
**Product details**
Sample Tables
A ) DOCUMENT_METADATA
B ) SEGMENT_METADATA
Sample Fields:
A ) DOCUMENT_METADATA
- PUBLISHED
- UPDATED
- FILETYPE
- FILESIZE
- SOURCEURl
- REPORTINGFREQUENCY
- PRIMARYENTITYTYPE
- PRIMARYENTITYNAME
- DOCUMENT_PRIMARY_ENTITY_IDF
- OTHERDOCUMENTMETADATA
B ) SEGMENT_METADATA
- DOCUMENTID
- SEGMENTATIONSTRATEGY
- SEGMENTID
- SEGMENTTYPE
- SEGMENTLOCATION
- RAWSEGMENTCONTENT
- PROCESSEDSEGMENTCONTENT
- LANGUAGE
- SEGMENTOVERLAP
- OTHERSEGMENTMETADATA
- SEGMENTEMBEDDINGS
- SEGMENTORDER
Table Descriptions:
- DOCUMENT_METADATA - Contains metadata about various documents such as id, name, file type, size, sourceURL, and reportingFrequency. Additionally, it includes related tags like primary entity, commodity, geography, and any additional metadata that helps in identifying the document.
- SEGMENT_METADATA - Contains chunked segments from documents along with metadata such as related document id, segment id, type, location, along with the processed and raw content of the segment. Additionally, it contains information on the segmentation strategy used to chunk the data and the embedding ids for each segment.
**概述**
本AI就绪数据集(AI Ready Data)涵盖由内部编辑与研究团队产出的能源领域全品类出版物文本,包含市场报告、新闻稿件、论证说明、评论文章、基本面分析、行业展望等多种内容,所有数据均采用适配大语言模型(Large Language Model, LLM)的格式预处理,可无缝集成至各类AI系统中。
客户可轻松将本AI就绪数据集应用于检索增强生成(Retrieval-Augmented Generation, RAG)解决方案,以此强化自身分析能力,助力科学决策。本数据集无大语言模型绑定限制,用户可自由集成所选的大语言模型,挖掘大宗商品领域的各类模式、关联关系与洞见。其灵活架构可适配不同机构的数据处理与理解需求,用户既可使用官方提供的数据集嵌入向量,也可根据自身偏好自定义嵌入向量。此外,用户还可对接自有向量数据库,并整合各类内外部数据源以丰富数据集内容。
本数据集具备以下特性:
- 采用AI就绪格式的非结构化数据,已拆分为文档与分段,并附带适配大语言模型的元数据
- 灵活的数据交付方式
- 支持自定义搜索与相关性提升算法
- 便于终端用户快速发现相关内容
**使用场景**
- 机器学习:将来自Commodity Insights的AI就绪数据用作检索增强生成解决方案
- 定价分析:借助评估摘要、市场评论与论证说明等内容,挖掘定价评估相关洞见
- 情感分析:依托专用工具库对新闻稿件开展情感分析
- 基本面分析:从各类分析与研究报告内容中挖掘基本面数据相关洞见
**产品详情**
示例表格
A)文档元数据表(DOCUMENT_METADATA)
B)分段元数据表(SEGMENT_METADATA)
示例字段:
A)文档元数据表(DOCUMENT_METADATA)
- 发布时间(PUBLISHED)
- 更新时间(UPDATED)
- 文件类型(FILETYPE)
- 文件大小(FILESIZE)
- 源URL(SOURCEURL)
- 报告频率(REPORTINGFREQUENCY)
- 主要实体类型(PRIMARYENTITYTYPE)
- 主要实体名称(PRIMARYENTITYNAME)
- 文档主要实体ID(DOCUMENT_PRIMARY_ENTITY_IDF)
- 其他文档元数据(OTHERDOCUMENTMETADATA)
B)分段元数据表(SEGMENT_METADATA)
- 文档ID(DOCUMENTID)
- 分段策略(SEGMENTATIONSTRATEGY)
- 分段ID(SEGMENTID)
- 分段类型(SEGMENTTYPE)
- 分段位置(SEGMENTLOCATION)
- 原始分段内容(RAWSEGMENTCONTENT)
- 处理后分段内容(PROCESSEDSEGMENTCONTENT)
- 语言(LANGUAGE)
- 分段重叠度(SEGMENTOVERLAP)
- 其他分段元数据(OTHERSEGMENTMETADATA)
- 分段嵌入向量(SEGMENTEMBEDDINGS)
- 分段顺序(SEGMENTORDER)
表格说明:
- 文档元数据表(DOCUMENT_METADATA):存储各类文档的元数据,包括文档ID、名称、文件类型、大小、源URL与报告频率等信息,同时涵盖相关标签,如主要实体、大宗商品品类、地理区域及其他用于标识文档的附加元数据。
- 分段元数据表(SEGMENT_METADATA):存储文档拆分后的分段内容及对应元数据,包括关联文档ID、分段ID、分段类型、分段位置,以及分段的原始与处理后内容。此外,还包含用于数据分段的策略信息,以及各分段的嵌入向量ID。
提供机构:
S&P Global Energy
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集提供能源领域各类文本内容的AI就绪格式数据,支持RAG解决方案和多场景分析应用,包含文档和段落层级的结构化元数据,支持灵活集成与自定义处理。
以上内容由遇见数据集搜集并总结生成



