VISAI-AI/JUSTNLP2025-L-Summ-formatted
收藏Hugging Face2025-10-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/VISAI-AI/JUSTNLP2025-L-Summ-formatted
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: judgment
dtype: string
- name: summary
dtype: string
- name: len_summ_word_bin
dtype: string
splits:
- name: train
num_bytes: 36709068
num_examples: 841
- name: val
num_bytes: 8016188
num_examples: 211
download_size: 23172740
dataset_size: 44725256
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
task_categories:
- summarization
language:
- en
size_categories:
- 1K<n<10K
license: unknown
---
# JUSTNLP20205-L-SUMM Formatted Data
This repository provides a filtered and formatted dataset used to train and validate the model prior to submitting.
## Data Filtering
We explore the relationship between the length of a judgment and the length of its summarization, measured in characters and words. When plotted on a log scale, summarization length shows a strong correlation with judgment length.

To reduce noise that could affect model performance, we remove samples where the summarization length is out of proportion to the judgment length. We fit a simple linear regression between log judgment characters and log summarization characters, and drop any pairs that fall outside the 80% prediction interval.

## Authors
Chompakorn Chaksangchaichot & Pawitsapak Akarajaradwong<br>
`{chompakornc_pro,pawitsapaka_visai}@vistec.ac.th`
提供机构:
VISAI-AI



