Korean Disaster Safety Information Sign Language Translation Benchmark Dataset
收藏数据集概述
数据集名称
SSL: Korean Disaster Safety Information<br>Sign Language Translation Benchmark Dataset
数据集描述
该数据集是一个用于韩国灾难安全信息手语翻译的基准数据集。它解决了现有数据集在计算资源、训练与测试集异质性以及数据未精炼方面的问题。数据集通过精炼原始数据并发布,为韩国手语翻译研究提供了新的基准。
数据集结构
README.md: 项目概述和描述main.py: 项目主执行文件requirements.txt: 项目依赖列表src/: 包含项目源代码__init__.py: 包初始化文件args.py: 处理命令行参数keypoint_extractor.py: 提取关键点模块language_processor.py: 语言处理模块processor.py: 通用处理模块sign_processor.py: 手语预处理模块video_processor.py: 视频处理模块
visualize_keypoint.ipynb: 关键点可视化Jupyter笔记本
运行指南
使用以下命令运行数据预处理: bash python main.py --root_path <path_to_downloaded_data> --save_path <path_to_save_results>
默认情况下,未设置--save_path时,将在./result自动创建结果文件夹。
结果结构
运行main.py后,将生成以下文件夹结构:
result/: 包含main.py生成的输出Train/: 包含训练数据结果Keypoint/: 保存每个手语视频帧提取的关键点npy文件Language/: 保存json和vocab文件Video/: 逐帧预处理视频,保存每一帧
Validation/: 结构与Train/相同
引用信息
若在研究中使用此代码,请引用以下论文:
@inproceedings{kim-etal-2024-korean-disaster, title = "{K}orean Disaster Safety Information Sign Language Translation Benchmark Dataset", author = "Kim, Wooyoung and Kim, TaeYong and Kim, Byeongjin and Lee, Myeong Jin MJ and Lee, Gitaek and Kim, Kirok and Cha, Jisoo and Kim, Wooju", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.869", pages = "9948--9953", abstract = "Sign language is a crucial means of communication for deaf communities. However, those outside deaf communities often lack understanding of sign language, leading to inadequate communication accessibility for the deaf. Therefore, sign language translation is a significantly important research area. In this context, we present a new benchmark dataset for Korean sign language translation named SSL:korean disaster Safety information Sign Language translation benchmark dataset. Korean sign language translation datasets provided by the National Information Society Agency in South Korea have faced challenges related to computational resources, heterogeneity between train and test sets, and unrefined data. To alleviate the aforementioned issue, we refine the origin data and release them. Additionally, we report experimental results of baseline using a transformer architecture. We empirically demonstrate that the baseline performance varies depending on the tokenization method applied to gloss sequences. In particular, tokenization based on characteristics of sign language outperforms tokenization considering characteristics of spoken language and tokenization utilizing statistical techniques. We release materials at our https://github.com/SSL-Sign-Language/Korean-Disaster-Safety-Information-Sign-Language-Translation-Benchmark-Dataset", }




