five

AI-MO/olympiads-ref

收藏
Hugging Face2025-11-06 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/AI-MO/olympiads-ref
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Olympiads Reference Dataset dataset_info: features: - name: year dtype: string - name: tier dtype: string - name: problem_label dtype: string - name: problem_type dtype: string - name: exam dtype: string - name: problem dtype: string - name: solution dtype: string - name: metadata struct: - name: resource_path dtype: string - name: problem_match dtype: string - name: solution_match dtype: string configs: - config_name: default data_files: - split: train path: '**/segmented/**/*.jsonl' --- # AI-MO Olympiad Reference Dataset This dataset contains a structured collection of Olympiad problems and their solutions, organized by competition. Contains high quality data, prioritizing "official" solutions to problems. ## Structure ``` <competition name>/ # Problems and solutions from the International Mathematical Olympiad ├── raw/ # Raw problem/solution statements (.pdf) │ ├── file1.pdf │ ├── file2.pdf ├── download_script/ # the scripts used to download raw data │ ├── download.py ├── md/ # .md files generated from raw/ files │ ├── file1.md │ ├── file2.md ├── segment_script/ # the scripts used to segment the data │ ├── segment.py └── segmented/ # .jsonl segmented data for easier processing ├── file1.jsonl ├── file2.jsonl └── file3.jsonl ``` Each `json` in `jsonl` file follows this structure: ```json { "problem": "string", // Mandatory: The problem statement in latex or markdown "solution": "string", // Mandatory: The solution for the problem "year": "int", // Optional: Year when the problem was presented "problem_type": "string", // Optional: The mathematical domain of the problem. Here are the supported types: //['Algebra', 'Geometry', 'Number Theory', 'Combinatorics', 'Calculus', //'Inequalities', 'Logic and Puzzles', 'Other'] "question_type": "string", // Optional: The form or style of the mathematical problem. // The supported classes are: ['MCQ', 'proof' or 'math-word-problem']. // 'math-word-problem' is a problem with output. "answer": "string", // Optional: final answer is the question_type is "math-word-problem". "source": "string", // Optional: TODO:describe "exam": "string", // Optional: TODO:describe "difficulty": "int", // Optional: TODO:describe "other": "...", // Optional: You can add other fields with metadata } ``` ## Steps to collect data for formalization ### 1. Assign yourself a task Check the [tracker](https://docs.google.com/spreadsheets/d/1PiK-lUjcZ8VKwjtyzYWbd_bLQXnlbIPl-jmm5ebZplw/edit?gid=0#gid=0) and assign yourself one line by updating columns: * status: IN PROGRESS * assignee: your name ### 2. Setup Download data locally. ```bash git lfs install git clone git@hf.co:datasets/AI-MO/olympiads-ref ``` ### 3. Find `.pdf` ressources. First check if there are already available `.pdf` in https://huggingface.co/AI-MO/olympiads-0.1 * if yes upload them in `AI-MO/olympiads-ref/<competition>/raw/` and continue to step 4. * if no, find sources in internet (preferably with official solution), download and upload in `AI-MO/olympiads-ref/<competition>/raw/` ### 4. Find `.md` ressources. First check if there are already available `.pdf` in https://huggingface.co/AI-MO/olympiads-0.1 * if yes upload in `AI-MO/olympiads-ref/<competition>/md/` and continue to step 6. * if no, find sources in internet (preferably with official solution), download and upload in `AI-MO/olympiads-ref/<competition>/md/` ### 5. Convert `.pdf` to `.md` using Mathpix Use [data_pipeline](https://github.com/project-numina/numina-math/blob/main/data_pipeline). Example: ```bash python -m data_pipeline convert_to_md --method=pdf_to_md --input_dir="/home/marvin/workspace/olympiads-ref/IMO/raw" --output_dir="/home/marvin/workspace/olympiads-ref/IMO/md" ``` ### 6. Find `.jsonl` ressources. First check if there are already segmentaions available `.jsonl` in https://huggingface.co/datasets/AI-MO/olympiads-0.3. You can check if the segmentation has been done in this [old tracker](https://docs.google.com/spreadsheets/d/1fw1nYQo2hN52PYTAT3SYwNTjUfjTmMRJOV84vSNxiTs). * if yes, check quality and upload in `AI-MO/olympiads-ref/<competition>/segmented/` and continue to step 8. * if no, continue to step 7. ### 7. Segment the `.md` files into `.jsonl` Write a `segment.py` that can be applied to your data (please do sanity checks!). Examples are [this](https://huggingface.co/datasets/AI-MO/olympiads-ref/blob/main/IMO/segment_script/segment.py) or [that](https://huggingface.co/datasets/AI-MO/olympiads-ref/blob/main/IMO/segment_script/segment_compendium.py). Once you are fine with your segmentation upload the `.jsonl` in `AI-MO/olympiads-ref/<competition>/segmented/` and the `segment.py` in `AI-MO/olympiads-ref/<competition>/segment_script/`. Ask for a review. ### 8. Update the status in the trackers Update the [tracker](https://docs.google.com/spreadsheets/d/1PiK-lUjcZ8VKwjtyzYWbd_bLQXnlbIPl-jmm5ebZplw/edit?gid=0#gid=0) with columns: * status: DONE + a link to your generated data in hf * problem_count: count of problems in data * solution_count: count of solutions in data (different than problem_count since a problem can have several solutions) * years: range of competition years covered in your data (so we can easily track if many years are missing) * assignee: your name Update the [old tracker](https://docs.google.com/spreadsheets/d/1fw1nYQo2hN52PYTAT3SYwNTjUfjTmMRJOV84vSNxiTs) with this comumn: * ref: color in green for the competition you segmented ### 9. Integrate the data in a base dataset Create a ticket in git ### Notes * Image placeholders in the dataset (like: `![md5:f571b12c2c566ce1beedd8190c986910](f571b12c2c566ce1beedd8190c986910.jpeg)`) correspond to actual images stored in the `images.parquet` file.
提供机构:
AI-MO
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作