five

ppak10/AdditiveLLM2-OA

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ppak10/AdditiveLLM2-OA
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other task_categories: - text-generation pretty_name: AdditiveLLM2-OA size_categories: - 1K<n<10K configs: - config_name: text data_files: - split: train path: "data/text/**/*.parquet" - config_name: images data_files: - split: train path: "data/images/**/*.parquet" - config_name: vit data_files: - split: train path: "data/vit/**/*.parquet" --- # AdditiveLLM2-OA Dataset Open Access journal articles (up to February 2026) used in domain adapting pretraining and instruction tuning for AdditiveLLM2. ## Dataset Split by Journal | `text` | `images` | `vit` | |:---:|:---:|:---:| | ![text](info/charts/journals_text_n1704.png) | ![images](info/charts/journals_images_n24031.png) | ![vit](info/charts/journals_vit_n20250.png) | ## Vocabulary Overlap Pairwise Jaccard similarity of word-level vocabularies (lowercase, 3+ letter tokens) across the four source journals. Run `info/vocabulary/vocabulary_overlap.py` to reproduce. ![Vocabulary Overlap](info/vocabulary/vocabulary_overlap.png) ## Top Phrases by Journal Most frequent bigrams and trigrams per journal after filtering URL/DOI fragments, reference abbreviations, and common function words. Run `info/vocabulary/ngrams.py` to reproduce. ![Top Phrases by Journal](info/vocabulary/ngrams.png) ## Top Keywords Most frequent author-supplied keywords across all 1,704 articles in the `text` config. "Additive manufacturing" is omitted as it appears in nearly every article and adds no discriminative signal. Keywords are normalised to lowercase before counting; capitalisation variants (e.g. `3D Printing` vs `3d printing`) are therefore merged. Run `info/charts/generate_keywords_pie_chart.py` to reproduce. ![Top Keywords](info/charts/keywords_top10.png) ## Source Datasets | Dataset | Journal | Volumes | |---|---|---| | `ppak10/Additive-Manufacturing-Letters` | *Additive Manufacturing Letters* | 001–016 | | `ppak10/Journal-of-Additive-Manufacturing` | *Journal of Additive Manufacturing* | 004–118 | | `ppak10/Rapid-Prototyping-Journal` | *Rapid Prototyping Journal* | 001–032 | | `ppak10/Journal-of-Manufacturing-Processes` | *Journal of Manufacturing Processes* | 001–163 | ## Token Statistics Tokenizer: `google/gemma-3-12b-it`. Image token counts are estimated by sampling 100 images per config. Run `info/tokens/calculate_tokens.py` to reproduce. | Config | Rows | Text Tokens | Image Tokens | Total | |---|---|---|---|---| | `text` | 1,704 | 29,334,571 | n/a | 29,334,571 | | `images` | 24,031 | 3,929,563 | 6,224,029 | 10,153,592 | | `vit` | 20,250 | 12,575,681 | 5,244,750 | 17,820,431 | | **Total** | | **45,839,815** | **11,468,779** | **57,308,594** | ## Configs ### `text` — full article text | Column | Type | Description | |---|---|---| | `text` | string | Full article text (primary training signal; title is included in the text body) | | `source` | string | Source journal name | | `volume` | string | Zero-padded volume number | | `filename` | string | Source PDF filename | | `title` | string | Article title | | `authors` | list[string] | Author names | | `doi` | string | Article DOI URL | | `access_type` | string | `"Open Access"` (all records) | | `keywords` | list[string] | Keywords from PDF metadata | ### `images` — figures and captions | Column | Type | Description | |---|---|---| | `image` | image | Figure image extracted from the PDF | | `caption` | string | Full figure caption text | | `figure_label` | string | Short label e.g. `"Fig. 1"` | | `page` | int32 | Page number within the source PDF | | `source` | string | Source journal name | | `volume` | string | Zero-padded volume number | | `filename` | string | Source PDF filename | | `doi` | string | Article DOI URL | | `title` | string | Article title | | `access_type` | string | `"Open Access"` (all records) | ### `vit` — figures with VLM-generated descriptions and conversations | Column | Type | Description | |---|---|---| | `image` | image | Figure image extracted from the PDF | | `figure_label` | string | Short label e.g. `"Fig. 1"` | | `caption` | string | Full figure caption text | | `conversations` | list[{question, answer}] | VLM-generated Q&A pairs about the figure | | `description` | string | VLM-generated figure description | | `page` | int32 | Page number within the source PDF | | `source` | string | Source journal name | | `volume` | string | Zero-padded volume number | | `filename` | string | Source PDF filename | | `doi` | string | Article DOI URL | | `title` | string | Article title | | `authors` | string | Author names | | `access_type` | string | `"Open Access"` (all records) | | `model` | string | VLM model used to generate descriptions and conversations | ### Loading for training ```python from datasets import load_dataset # Full article text for next token prediction text_ds = load_dataset("ppak10/AdditiveLLM2-OA", "text", split="train") # Figures and captions image_ds = load_dataset("ppak10/AdditiveLLM2-OA", "images", split="train") # VLM-generated descriptions and conversations vit_ds = load_dataset("ppak10/AdditiveLLM2-OA", "vit", split="train") ``` The `text` column of the `text` config is what you pass to your tokenizer during fine-tuning. ## Citation If you use this dataset, please cite the associated paper: ```bibtex @misc{pak2026additivellm2, title={AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing}, author={Peter Pak and Amir Barati Farimani}, year={2026}, eprint={2603.22017}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2603.22017} } ```
提供机构:
ppak10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作