VISAI-AI/JUSTNLP2025-L-Summ-formatted

Name: VISAI-AI/JUSTNLP2025-L-Summ-formatted
Creator: VISAI-AI
Published: 2025-10-19 10:21:48
License: 暂无描述

Hugging Face2025-10-19 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/VISAI-AI/JUSTNLP2025-L-Summ-formatted

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: judgment dtype: string - name: summary dtype: string - name: len_summ_word_bin dtype: string splits: - name: train num_bytes: 36709068 num_examples: 841 - name: val num_bytes: 8016188 num_examples: 211 download_size: 23172740 dataset_size: 44725256 configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* task_categories: - summarization language: - en size_categories: - 1K<n<10K license: unknown --- # JUSTNLP20205-L-SUMM Formatted Data This repository provides a filtered and formatted dataset used to train and validate the model prior to submitting. ## Data Filtering We explore the relationship between the length of a judgment and the length of its summarization, measured in characters and words. When plotted on a log scale, summarization length shows a strong correlation with judgment length. ![46e1b56a-38e7-4ce2-a73c-a7a920868adf](https://cdn-uploads.huggingface.co/production/uploads/6079391365b9d0165cb1837f/eJKzYXaztBU2IFb_QZP-X.png) To reduce noise that could affect model performance, we remove samples where the summarization length is out of proportion to the judgment length. We fit a simple linear regression between log judgment characters and log summarization characters, and drop any pairs that fall outside the 80% prediction interval. ![e93b2413-a927-4d1d-bb0b-35c747c239be](https://cdn-uploads.huggingface.co/production/uploads/6079391365b9d0165cb1837f/N3eEaD-xCushmznlA6ZMh.png) ## Authors Chompakorn Chaksangchaichot & Pawitsapak Akarajaradwong<br> `{chompakornc_pro,pawitsapaka_visai}@vistec.ac.th`

提供机构：

VISAI-AI

5,000+

优质数据集

54 个

任务类型

进入经典数据集