Common Ownership Data: Scraped SEC form 13F filings for 1999-2017
收藏DataCite Commons2025-05-12 更新2025-04-15 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/ZRH3EU
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3>
In the course of researching the common ownership hypothesis, we found a number of issues with the Thomson Reuters (TR) "S34" dataset used by many researchers and frequently accessed via Wharton Research Data Services (WRDS). WRDS has done <a href="https://wrds-www.wharton.upenn.edu/login/?next=/documents/952/S12_and_S34_Regenerated_Data_2010-2016.pdf">extensive work</a> to improve the database, working with other researchers that have uncovered problems, specifically fixing a lack of records of BlackRock holdings. However, even with the updated dataset posted in the summer of 2018, we discovered a number of discrepancies when accessing data for constituent firms of the S&P 500 Index. We therefore set out to separately create a dataset of 13(f) holdings from the source documents, which are all public and available electronically from the Securities and Exchange Commission (SEC) website. Coverage is good starting in 1999, when electronic filing became mandatory. However, the SEC's Inspector General issued a <a href="https://www.sec.gov/files/480.pdf">critical report</a> in 2010 about the information contained in 13(f) filings.
<h3>The process:</h3>
<ul>
<li>We gathered all 13(f) filings from 1999-2017 <a href="https://www.sec.gov/divisions/investment/13flists.htm">here</a>. The corpus is over 318,000 filings and occupies ~25GB of space if unzipped. (We do not include the raw filings here as they can be downloaded from EDGAR).
<li>We wrote code to parse the filings to extract holding information using regular expressions in Perl. Our target list of holdings was all public firms with a market capitalization of at least $10M. From the header of the file, we first extract the filing date, reporting date, and reporting entity (Central Index Key, or CIK, and CIKNAME).
<ul>
<li>Beginning with the September 30 2013 filing date, all filings were in XML format, which made parsing fairly straightforward, as all values are contained in tags.
<li>Prior to that date, the filings are remarkable for the heterogeneity in formatting.
Several examples are linked to below. Our approach was to look for any lines containing a CUSIP code that we were interested in, and then attempting to determine the "number of shares" field and the "value" field. To help validate the values we extracted, we downloaded stock price data from CRSP for the filing date, as that allows for a logic check of (price * shares) = value. <em>We do not claim that this will exhaustively extract all holding information. We can provide examples of filings that are formatted in such a way that we are not able to extract the relevant information.</em>
<li>In both XML and non-XML filings, we attempt to remove any derivative holdings by looking for phrases such as OPT, CALL, PUT, WARR, etc.
</ul>
<li>We then perform some final data cleaning: in the case of amended filings, we keep an amended level of holdings if the amended report a) occurred within 90 days of the reporting date and b) the initial filing fails our logic check described above.
</ul>
The resulting dataset has around 48M reported holdings (CIK-CUSIP) for all 76 quarters and between 4,000 and 7,000 CUSIPs and between 1,000 and 4,000 investors per quarter. We do not claim that our dataset is perfect; there are undoubtedly errors. As documented elsewhere, there are often errors in the actual source documents as well. However, our method seemed to produce more reliable data in several cases than the TR dataset, as shown in Online Appendix B of the related paper linked above.
<h3> Included Files </h3>
<ul>
<li>Perl Parsing Code (find_holdings_snp.pl). For reference, only needed if you wish to re-parse original filings.
<li>Investor holdings for 1999-2017: lightly cleaned. Each CIK-CUSIP-rdate is unique. Over 47M records. The fields are
<ul>
<li>CIK: the central index key assigned by the SEC for this investor. Mapping to names is available below.
<li>CUSIP: the identity of the holdings. Consult the SEC's 13(f) listings to identify your CUSIPs of interest.
<li>shares: the number of shares reportedly held. Merging in CRSP data on shares outstanding at the CUSIP-Month level allows one to construct \beta. We make no distinction for the sole/shared/none voting discretion fields. If a researcher is interested, we did collect that starting in mid-2013, when filings are in XML format.
<li>rdate: reporting date (end of quarter). 8 digit, YYYYMMDD.
<li>fdate: filing date. 8 digit, YYYYMMDD.
<li>ftype: the form name.
<li>Notes: we did not consolidate separate BlackRock entities (or any other possibly related entities). If one wants to do so, use the CIK-CIKname mapping file below. We drop any CUSIP-rdate observation where any investor in that CUSIP reports owning greater than 50% of shares outstanding (even though legitimate cases exist - see, for example, Diamond Offshore and Loews Corporation). We also drop any CUSIP-rdate observation where greater than 120% of shares outstanding are reported to be held by 13(f) investors. Cases where the shares held are listed as zero likely mean the investor filing lists a holding for the firm but that our code could not find the number of shares due to the formatting of the file. We leave these in the data so that any researchers that find a zero know to go back to that source filing to manually gather the holdings for the securities they are interested in.
</ul>
<li>Processed 13f holdings (airlines.parquet, cereal.parquet, out_scrape.parquet). These are used in our related AEJ:Microeconomics paper. The files contain all firms within the airline industry, RTE cereal industry, and all large cap firms (a superset of the S&P 500) respectively.
<BR>
These are a merged version of the scrape_parsed.csv file described above, that include the shares outstanding and percent ownership used to calculate measures of common ownership. These are distributed as brotli compressed Apache Parquet (binary) files. This preserves date information correctly.
<ul>
<li>mgrno: manager number (which is actually CIK in the scraped data)
<li>rdate: reporting date
<li>ncusip: cusip
<li>rrdate: reportaing date in stata format
<li>mgrname: manager name
<li>shares: shares
<li>sole: shares with sole authority
<li>shared: shares with shared authority
<li>none: shares with no authority
<li>isbr/isfi/iss/isba/isvg: is this blackrock, statestreet, vanguard, barclay, fidelity
<li>numowners: how many owners
<li>prc: price at reporting date
<li>shares_out: shares outstanding at reporting date
<li>value: reported value in 13(f)
<li>beta: shares/shares_out
<li>permno: permno
</ul>
<li>Profit weight values (i.e. \kappa) for all firms in the sample. (public_scrape_kappas_XXXX.parquet). Each file represents one year of data and is around 200MB and distributed as a compressed (brotli) parquet file. Fields are simply CUSIP_FROM, CUSIP_TO, KAPPA, QUARTER.
Note that these have not been adjusted for multi-class share firms, insider holdings, etc. If looking at a particular market, some additional data cleaning on the investor holdings (above) followed by recomputing profit weights is recommended.
<ul>
<li>For this, we did merge the separate BlackRock entities prior to computing \kappa.
<br>
CIKmap.csv (~250K observations)
<li>Mapping is from CIK-rdate to CIKname. Use this if you want to consolidate holdings across reporting entities or explore the identities of reporting firms.
<li>In the case of amended filings that use different names than original ones, we keep the earliest name.
</ul>
</ul>
<h3>Example of Parsing Challenge</h3>
Prior to the XML era, filings were far from uniform, which creates a notable challenge for parsing them for holdings. In the examples directory we include several example text files of raw 13f filings.
<ul>
<li>Example 1 is a "well behaved" filing, with CUSIP, followed by value, followed by number of shares, as recommended by the SEC.
<li>Example 2 shows a case where the ordering is changed: CUSIP, then shares, then value. The column headers show "item 5" coming before "item 4".
<li>Example 3 shows a case of a fixed width table, which in principle could be parsed very easily using the <C> tags at the top, although not all filings consistently use these tags.
<li>Example 4 shows a case with a fixed width table, with no <C> tag for the CUSIP column. Also, notice that if the firm holds more than 10M shares of a firm, that number occupies the entire width of the column and there is no longer a column separator (i.e. Cisco Systems on line 374).
<li>Example 5 shows a comma-separated table format.
<li>Example 6 shows a case of changing the column ordering, but also adding an (unrequired) column for share price.
<li>Example 7 shows a case where the table is split across subsequent pages, and so the CUSIP appears on a different line than the number of shares.
</ul>
提供机构:
Harvard Dataverse
创建时间:
2020-08-14



