Jidi1997/gprop_training_dataset

Name: Jidi1997/gprop_training_dataset
Creator: Jidi1997
Published: 2026-03-24 14:01:45
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Jidi1997/gprop_training_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - text-classification language: - en tags: - climate - ESG - shareholder-proposals - sustainable-finance - ClimateBERT - ISS pretty_name: green_proposal_training_dataset size_categories: - n<1K dataset_info: features: - name: text dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 614145.6 num_examples: 1200 - name: test num_bytes: 153536.4 num_examples: 300 download_size: 165315 dataset_size: 767682.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # 🌍 Green Shareholder Proposal Classification Dataset ## 📖 Dataset Summary This dataset contains manually curated and labelled ISS (Institutional Shareholder Services) shareholder proposals used to fine-tune [`climatebert/distilroberta-base-climate-detector`](https://huggingface.co/climatebert/distilroberta-base-climate-detector) for binary classification of **green (climate/environmental) shareholder proposals**. Each record provides a structured natural-language description of a shareholder proposal — including the filing sponsor's type, the target company's industry, the resolution text, and a broader agenda class context — alongside a binary label indicating whether the proposal is classified as a green/climate resolution. The fine-tuned model trained on this dataset achieves 📈 **F1 = 0.981** (validation set, best checkpoint at epoch 4). --- ## 🔍 Dataset Details ### 📝 Dataset Description The source data are shareholder proposals drawn from the ISS database. This training dataset is a stratified sample of **1,500 observations** constructed to support supervised fine-tuning of a climate text classifier. * 🎯 **Sampling design:** A preliminary LLM classification pass was first run on the full corpus to generate a confidence score (`score`) for each observation. Proposals were then ranked into quartiles by score. To maximise label informativeness, the training sample was drawn from: * **Q1 (low confidence):** 750 randomly sampled proposals where the preliminary model was least certain, including cases where the prior label conflicted with the first-round prediction. * **Q2 (mid confidence):** 750 randomly sampled proposals from the second quartile. * 🧩 **Text construction:** Each `text` field is a single constructed sentence combining four information sources: > **A(An) `[sponsor_type]`-type sponsor has filed a shareholder proposal to a(an) `[sic2_des]`-sector company. This proposal requests: `[resolution]`.** *(Optional suffix)* **[It falls under a broader agenda class that may include items not directly relevant to this specific proposal: `[AgendaCodeInformation]`]** Where the variables represent: 1. `sponsor` information 2. `industry` classification 3. Raw `resolution` text 4. Broader `agenda` context (when available) The `AgendaCodeInformation` clause is drawn from the ISS Agenda Code taxonomy (2023 edition). * 🏷️ **Label construction and manual corrections:** The binary label was initially derived from a prior classification round and subsequently corrected through systematic rule-based review. Proposals matching the following topics were manually relabelled as non-green (`label = 0`) regardless of model output: > GMO / genetically engineered products and ingredients, pesticides, glyphosate, nanomaterials, tobacco, food waste, food safety, palm oil, water rights (human right to water / sanitation), net neutrality, mandatory arbitration, child labour, animal welfare (battery cages / gestation crates), mountaintop mining financing, mercury in dental amalgam, China ESG congruence, bulldozer sales to Israel, gender equality sustainability reports, cocoa supply chain, REIT conversion, spin-off postponement, water risk management. These exclusions reflect the conceptual boundary of the label: the target class captures proposals directly related to **🌱 climate change and environmental sustainability**, and excludes proposals that are primarily social, governance, or only tangentially environmental. * 🛠️ **Curation:** Manual rule-based review + stratified sampling * 🌐 **Language:** English * ⚖️ **License & Usage:** This dataset is released under a restrictive license for non-commercial academic research only. While the data has been strictly anonymized (removing company names, sponsor identities, and direct identifiers) and recontextualized for machine learning purposes, the underlying resolution texts are derived from Institutional Shareholder Services (ISS). Users of this dataset must ensure their use cases comply with ISS licensing terms and do not use this data for commercial financial products. ### 🏗️ Dataset Structure #### 🗂️ Data Fields | Field | Type | Description | |---|---|---| | `order` | integer | Row index | | `resolution` | string | Raw resolution text as provided by ISS | | `text` | string | Constructed LLM input text (see text construction above) | | `label` | integer | Binary label: `1` = green/climate proposal, `0` = non-green | #### 📊 Label Distribution | Label | Class | Approximate share | |---|---|---| | 1 | Green proposal | ~51.5% | | 0 | Non-green proposal | ~48.5% | #### 📂 Data Splits The dataset is provided as a single file (`gprop_train.csv`). The fine-tuning notebook applies an 80/20 train–validation split (`seed=42`), yielding 1,200 training examples and 300 validation examples. --- ## 🚀 Uses ### ✅ Direct Use Fine-tuning or evaluation of text classifiers for identifying climate- and environment-related shareholder proposals. The structured `text` field is designed as a drop-in input for transformer-based sequence classifiers. ### ❌ Out-of-Scope Use - Classification of proposals outside the ISS universe without verifying domain coverage. - Generalisation to non-English-language proposals. - Use cases requiring legal or compliance-grade classification; this dataset reflects research-oriented annotation decisions. --- ## ⚙️ Dataset Creation ### 🏛️ Source Data Proposals and sponsor types were drawn from the ISS. Industry descriptions derive from the SIC-2 taxonomy. Agenda code information is sourced from the ISS Agenda Code table (2023 edition). ### ✍️ Annotations Labels originate from a multi-stage process: 1. **Seed labels** from a prior classification dataset containing human-verified green proposal labels propagated across duplicate resolution texts. 2. **First-round LLM pass** using a ClimateBERT-based classifier on text constructed without agenda code context, producing `prediction` and `score` fields. 3. **Stratified sampling** based on first-round confidence scores to select the most informative training examples. 4. **Manual rule-based correction** applying the exclusion rules listed above to align labels with the intended conceptual scope. ### 🛡️ Personal and Sensitive Information The dataset contains no personal information. No individual-level personal data is included. Proposal texts can be drawn from DEF 14a or public shareholder meeting filings. --- ## 🤖 Model Trained on This Dataset This dataset was created specifically to fine-tune the climate-related shareholder proposal classifier. **Trained Model:** [`climatebert/distilroberta-base-climate-detector`](https://huggingface.co/climatebert/distilroberta-base-climate-detector). Using this dataset, the fine-tuned model achieved an 📈 **F1 score of 0.981** and an **Accuracy of 0.980** on the validation set. For detailed training hyperparameters, evaluation metrics, and model limitations, please refer to the **[Model Card](mylink)**. --- ## 📜 Citation If you use this dataset, please cite the associated working paper (forthcoming). When the working paper is published, this section will be updated with the full DOI/BibTeX. --- ## 📬 Contact For questions about the dataset or associated research, please open an issue on this repository.

提供机构：

Jidi1997

5,000+

优质数据集

54 个

任务类型

进入经典数据集