Dataset for Word Difficulty Prediction
收藏ieee-dataport.org2025-03-22 收录
下载链接:
https://ieee-dataport.org/open-access/dataset-word-difficulty-prediction
下载链接
链接失效反馈官方服务:
资源简介:
Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task. Removing dependency on frequency of previously acquired words for measuring difficulty was one of our primary aims. Then we analyzed a convolutional neural network based prediction model which operates at the character level and evaluate its efficiency compared to others.This dataset contains 40481 data instances. The various column headers are as follows:WordLengthFreq_HALLog_Freq_HALI_Mean_RTI_ZscoreI_SDObsI_Mean_Accuracy The other details of the dataset and the method to obtain the difficulty labels are present in the research publication whose link is attached.For getting open-access to the publication visit https://garain.codesPlease cite both the dataset and the conference paper if the dataset comes to any use.
多数文本简化系统均需依赖词汇复杂度的指示。目前词汇难度预测的普遍方法基于手动特征工程。由于深度学习模型在性能上相对不足,此类模型的应用研究尚处于探索阶段。本研究团队探讨了其中一种模型在预测词汇难度方面的应用。我们将该问题视为一种二元分类问题,并训练了传统的机器学习模型以评估其性能。消除对先前获取词汇频率的依赖以衡量难度成为我们的主要目标之一。随后,我们分析了一种基于字符级别的卷积神经网络预测模型,并评估了其在效率上的表现。本数据集包含40481个数据实例。各列标题如下:词汇长度、频率HALLog、频率HALI、均值RTI、Z分数I、标准差SD、观察值I、均值准确性。数据集的详细信息以及获取难度标签的方法详见附带的研究论文链接。如需获取论文的开放获取版,请访问https://garain.codes。若本数据集得到应用,请同时引用数据集和会议论文。
提供机构:
IEEE Dataport



