CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets
收藏arXiv2024-12-03 更新2024-12-05 收录
下载链接:
http://arxiv.org/abs/2412.02065v1
下载链接
链接失效反馈官方服务:
资源简介:
CEO Pay Ratio和Critical Audit Matters (CAMs) Datasets是由牛津大学学院的研究团队开发的两个重要数据集,旨在通过大型语言模型(LLMs)自动化从非结构化来源收集数据,以解决学术研究中数据访问不平等的问题。数据集包含约10,000份代理声明中的CEO薪酬比率和超过12,000份10-K文件中的关键审计事项(CAMs),数据量庞大且多样化。创建过程利用了GPT-4o-mini模型和检索增强生成(RAG)框架,通过精心设计的提示工程实现了高效且准确的数据提取。这些数据集主要应用于金融、会计等领域的学术研究,旨在解决数据获取成本高昂和手动收集效率低下的问题,促进研究资源的民主化。
The CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets are two pivotal datasets developed by a research team affiliated with the University of Oxford. They were designed to automate data collection from unstructured sources using Large Language Models (LLMs), aiming to address the issue of unequal data access in academic research. These datasets encompass CEO pay ratio data extracted from approximately 10,000 proxy statements, as well as CAMs data from over 12,000 10-K filings, boasting a large and diverse volume of data. Their development leverages the GPT-4o-mini model and a Retrieval-Augmented Generation (RAG) framework, achieving efficient and accurate data extraction through well-crafted prompt engineering. Primarily applied to academic research in fields such as finance and accounting, these datasets are intended to resolve the problems of high data acquisition costs and low efficiency of manual collection, thereby promoting the democratization of research resources.
提供机构:
牛津大学学院
创建时间:
2024-12-03
搜集汇总
数据集介绍

构建方式
The CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets were constructed using a novel methodology that leverages Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) framework. This approach automates data collection from unstructured corporate disclosures, specifically focusing on CEO pay ratios from approximately 10,000 proxy statements and CAMs from more than 12,000 10-K filings. The methodology involves using GPT-4o-mini to extract targeted information from complex corporate filings, significantly reducing the time and cost associated with manual data collection. The process includes retrieving relevant passages from a large corpus, conditioning the language model for more accurate output, and employing careful prompt engineering to guide the LLM in extracting and structuring complex data from various disclosure formats.
使用方法
Researchers can utilize the CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets by accessing the detailed documentation provided, which includes step-by-step instructions, code snippets, and practical insights. This documentation serves as a roadmap for implementing similar techniques in their own work. The datasets can be used to investigate the impact of regulatory changes on executive compensation, corporate governance, and financial reporting. By making the data publicly available, the study aims to stimulate further research in these critical areas. Researchers can also adapt the methodology for other data collection tasks across various topics and document types, leveraging the efficiency and cost-effectiveness of the LLM-based approach.
背景与挑战
背景概述
The CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets were developed by Julian Junyan Wang from University College, University of Oxford, and Victor Xiaoqi Wang from California State University Long Beach in November 2024. These datasets were created to address the long-standing issue of unequal access to costly financial datasets, which has hindered researchers from disadvantaged institutions. The primary goal was to democratize access to these essential datasets by leveraging recent breakthroughs in Large Language Models (LLMs). The researchers utilized GPT-4o-mini within a Retrieval-Augmented Generation (RAG) framework to automate data collection from unstructured sources, achieving human-level accuracy in collecting CEO pay ratios and CAMs from corporate disclosures. This methodology significantly reduced the time and cost associated with manual data collection, making it a scalable and cost-effective solution for researchers with limited resources.
当前挑战
The primary challenge addressed by the CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets is the unequal access to costly financial datasets, which has long been a barrier for researchers from less affluent institutions. The datasets also faced challenges in the construction process, particularly in extracting specific information from unstructured sources such as corporate disclosures. The inconsistent formatting across company documents posed significant difficulties for traditional rule-based methods, leading to inaccuracies and inefficiencies. Additionally, the extraction of specific information within sections proved even more challenging, necessitating manual data collection for emerging issues. The use of LLMs and the RAG framework aimed to overcome these challenges by automating data collection, reducing processing time, and minimizing costs, thereby democratizing access to essential financial data for academic research.
常用场景
经典使用场景
CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets 数据集的经典使用场景主要集中在财务和会计领域的实证研究。研究者可以利用这些数据集来分析高管薪酬与员工薪酬之间的比例关系,以及审计报告中关键审计事项的披露情况。这些数据集为研究者提供了丰富的定量和定性数据,帮助他们探讨薪酬不平等、公司治理、财务报告透明度等重要议题。
解决学术问题
该数据集解决了学术研究中常见的数据获取难题,特别是对于资源有限的学术机构。传统上,获取这些数据需要昂贵的商业数据库订阅或耗时的手动数据收集。通过利用大型语言模型(LLMs)自动化数据收集过程,该数据集显著降低了数据获取的门槛,使得更多研究者能够进行高影响力的研究。这不仅促进了学术研究的多样性和创新,还提高了研究的可重复性和透明度。
实际应用
在实际应用中,CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets 数据集可以帮助投资者、财务分析师和监管机构更有效地评估公司治理和财务报告的质量。例如,投资者可以利用高管薪酬与员工薪酬的比例数据来评估公司的薪酬公平性和潜在的治理风险。财务分析师则可以通过分析关键审计事项的披露情况,更好地理解公司的财务健康状况和审计风险。此外,监管机构可以利用这些数据来监测和评估公司是否遵守相关法规和披露要求。
数据集最近研究
最新研究方向
在财务和会计领域,CEO Pay Ratio and Critical Audit Matters (CAMs) Datasets 数据集的最新研究方向主要集中在利用大型语言模型(LLMs)自动化数据收集过程,以降低学术研究的门槛。研究者们开发了一种基于 GPT-4o-mini 模型和 Retrieval-Augmented Generation (RAG) 框架的新方法,能够从公司披露文件中高效、低成本地收集 CEO 薪酬比率和关键审计事项(CAMs)数据。这种方法不仅显著减少了手动数据收集的时间和成本,还为资源有限的学术机构提供了平等的研究机会。此外,研究还探讨了如何通过详细的提示工程和多步骤验证来提高数据收集的准确性和可靠性,从而推动财务和会计研究的创新和发展。
相关研究论文
- 1Leveraging Large Language Models to Democratize Access to Costly Financial Datasets for Academic Research牛津大学学院 · 2024年
以上内容由遇见数据集搜集并总结生成



