decodingchris/clean_squad_v2
收藏Hugging Face2025-01-30 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/decodingchris/clean_squad_v2
下载链接
链接失效反馈官方服务:
资源简介:
Clean SQuAD v2数据集是对原始SQuAD v2数据集进行预处理后的 refined 版本,提高了数据质量和可用性,适用于自然语言处理任务,如问答。预处理步骤包括去除问题字段的前后空白、过滤掉少于12个字符的问题以及基于答案的存在与否进行分层,以平衡验证集和测试集中的可回答和不可回答问题。数据集分为训练集、验证集和测试集,每个子集包含问题-上下文对的唯一标识符、文章标题、上下文、预处理后的问题字符串以及正确答案(如果有的话)的文本和起始位置。
The Clean SQuAD v2 dataset is a refined version of the original SQuAD v2 dataset, preprocessed to ensure higher data quality and usability for NLP tasks such as Question Answering. The preprocessing steps include trimming whitespace from the question field, filtering out questions with fewer than 12 characters, and stratifying the validation and test sets based on the presence or absence of answers to ensure a balanced representation of answerable and unanswerable questions. The dataset is divided into three subsets: training, validation, and test, each containing unique identifiers for the question-context pairs, the title of the article from which the context is derived, the context paragraph, the preprocessed question string, and the text of the correct answers with their starting positions in the context if available.
提供机构:
decodingchris



