five

josuediazflores/aao-eb1a-decisions

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/josuediazflores/aao-eb1a-decisions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification - text-generation - question-answering language: - en tags: - legal - immigration - lora - fine-tuning - o1a - eb1a - uscis - aao pretty_name: AAO EB-1A Extraordinary Ability Decisions (Structured) size_categories: - 1K<n<10K --- # AAO EB-1A Extraordinary Ability Decisions — Structured Dataset Structured extractions from **1,466 USCIS Administrative Appeals Office (AAO) non-precedent decisions** on EB-1A extraordinary ability petitions (I-140). Each case has been decomposed into structured components by Claude Sonnet for use in fine-tuning legal reasoning models. Part of [Project Greenlight](https://github.com/josuediazflores/ProjectGreenlight) — an AI-powered O-1A/EB-1A visa intelligence system. ## Dataset Description Each JSON file represents one AAO decision with the following fields: | Field | Type | Description | |-------|------|-------------| | `petitioner_background` | string | Who the petitioner is, their field, nationality, career summary (2-4 sentences) | | `evidence_per_criterion` | array | Per-criterion breakdown with evidence submitted, AAO analysis, and met/not-met | | `outcome` | string | `sustain`, `dismiss`, or `remand` | | `outcome_reasoning` | string | AAO's overall reasoning for the final decision (2-5 sentences) | | `legal_citations` | array | Every case, statute, or regulation cited | | `fraud_or_procedural_issues` | string/null | Any fraud findings or procedural problems | | `filename` | string | Source PDF filename | | `score` | float | Quality score (0-10) from automated rubric | | `original_outcome` | string | Outcome from initial scoring pass | | `original_criteria` | array | Criteria identified in initial scoring pass | ### Evidence Per Criterion Schema ```json { "criterion": "original contributions", "evidence_submitted": "Detailed description of what the petitioner submitted...", "aao_analysis": "What the AAO said about this evidence and why it met or failed...", "met": false } ``` Valid criterion values: `awards`, `membership`, `published material`, `judging`, `original contributions`, `scholarly articles`, `exhibition`, `leading role`, `high salary`, `commercial success` ## Dataset Statistics | Metric | Value | |--------|-------| | Total cases | 1,466 | | Avg criteria per case | 5.9 | | Cases with fraud/procedural issues | 989 (67%) | ### Outcome Distribution | Outcome | Count | Percentage | |---------|-------|------------| | Dismiss | 1,390 | 94.8% | | Sustain | 45 | 3.1% | | Remand | 31 | 2.1% | ### Criteria Frequency | Criterion | Cases Discussing | |-----------|-----------------| | Original contributions | 1,186 | | Published material | 1,135 | | Awards | 1,120 | | Leading role | 1,118 | | Judging | 983 | | Membership | 976 | | Scholarly articles | 750 | | High salary | 628 | | Exhibition | 574 | | Commercial success | 247 | ### Top Legal Citations | Citation | Frequency | |----------|-----------| | 8 C.F.R. § 204.5(h)(2) | 1,441 | | 8 C.F.R. § 204.5(h)(3) | 1,422 | | INA § 203(b)(1)(A) | 1,338 | | Kazarian v. USCIS, 596 F.3d 1115 (9th Cir. 2010) | 959 | ## Data Source All cases sourced from the [USCIS AAO Non-Precedent Decisions repository](https://www.uscis.gov/administrative-appeals/aao-decisions/aao-non-precedent-decisions), filtered to I-140 Extraordinary Ability (EB-1A) petitions. These are public domain government documents. > **Note:** USCIS stopped publishing new AAO decisions in March 2025. ## Data Pipeline 1. **Scrape** — 4,643 EB-1A PDFs downloaded from USCIS AAO repository 2. **Parse** — Text extracted via PyMuPDF + Tesseract OCR (0 failures) 3. **Score** — Quality-scored against a detailed rubric (criteria coverage, analytical depth, evidence discussion, legal reasoning, outcome clarity) 4. **Deduplicate** — TF-IDF cosine similarity at 0.95 threshold removed 18 near-duplicate pairs 5. **Filter** — 1,467 cases selected (score ≥ 7.0) 6. **Extract** — Claude Sonnet decomposed each decision into structured JSON (1,466 succeeded, 1 failed) ## Intended Use - Fine-tuning language models for immigration legal reasoning (LoRA adapters) - Training data for O-1A/EB-1A criteria analysis, gap identification, RFE prediction, and outcome prediction - Research on legal NLP and administrative decision-making patterns ## Limitations - Heavy class imbalance: 95% dismiss, 3% sustain, 2% remand - Structured extractions were generated by an LLM (Claude Sonnet), not manually verified at scale - Pre-March 2025 decisions only - EB-1A decisions used as proxy for O-1A (identical evidentiary criteria) ## License MIT
提供机构:
josuediazflores
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作