five

VISAI-AI/JUSTNLP2025-L-Summ-formatted

收藏
Hugging Face2025-10-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/VISAI-AI/JUSTNLP2025-L-Summ-formatted
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: judgment dtype: string - name: summary dtype: string - name: len_summ_word_bin dtype: string splits: - name: train num_bytes: 36709068 num_examples: 841 - name: val num_bytes: 8016188 num_examples: 211 download_size: 23172740 dataset_size: 44725256 configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* task_categories: - summarization language: - en size_categories: - 1K<n<10K license: unknown --- # JUSTNLP20205-L-SUMM Formatted Data This repository provides a filtered and formatted dataset used to train and validate the model prior to submitting. ## Data Filtering We explore the relationship between the length of a judgment and the length of its summarization, measured in characters and words. When plotted on a log scale, summarization length shows a strong correlation with judgment length. ![46e1b56a-38e7-4ce2-a73c-a7a920868adf](https://cdn-uploads.huggingface.co/production/uploads/6079391365b9d0165cb1837f/eJKzYXaztBU2IFb_QZP-X.png) To reduce noise that could affect model performance, we remove samples where the summarization length is out of proportion to the judgment length. We fit a simple linear regression between log judgment characters and log summarization characters, and drop any pairs that fall outside the 80% prediction interval. ![e93b2413-a927-4d1d-bb0b-35c747c239be](https://cdn-uploads.huggingface.co/production/uploads/6079391365b9d0165cb1837f/N3eEaD-xCushmznlA6ZMh.png) ## Authors Chompakorn Chaksangchaichot & Pawitsapak Akarajaradwong<br> `{chompakornc_pro,pawitsapaka_visai}@vistec.ac.th`
提供机构:
VISAI-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作