five

kha2612/ViSL-News

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kha2612/ViSL-News
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi pretty_name: ViSL-News task_categories: - translation - text-generation - video-text-to-text size_categories: - 10K<n<100K license: other annotations_creators: - machine-generated - expert-generated source_datasets: - original --- # ViSL-News ## Dataset Summary **ViSL-News** is a sentence-level Vietnamese Sign Language dataset built from news videos with Vietnamese sign language interpretation. The dataset was created from **HTV Tin Tức** news videos published during **2024–2025**, and each sample corresponds to **one complete sentence** paired with a short video clip. :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} The dataset was constructed using **ViSL Tool**, a semi-automatic pipeline that supports: - data filtering, - automatic sentence-level labeling, - signer clustering, - metadata generation, - and human post-checking. :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6} ViSL-News is intended to support research on: - Vietnamese Sign Language Translation, - sentence-level video-to-text learning, - dataset construction for sign language research, - and data-centric methods for sign language processing. ## Languages - **Sign language:** Vietnamese Sign Language (as interpreted in broadcast news videos) - **Text language:** Vietnamese ## Dataset Structure Each sample is sentence-level and typically contains: - a short video clip, - a Vietnamese sentence, - start and end timestamps, - and signer-related metadata when available. :contentReference[oaicite:7]{index=7} :contentReference[oaicite:8]{index=8} Example fields may include: - `sample_id` - `video` - `text` - `start_time` - `end_time` - `signer_id` - `split` The exact field names may vary depending on the released version. ## Dataset Creation ### Source Data The dataset was collected from **HTV Tin Tức** news videos on YouTube that include Vietnamese sign language interpretation. Within the scope of the thesis, the data source is limited to **HTV Tin Tức (2024–2025)** news videos to ensure feasibility and quality control. :contentReference[oaicite:9]{index=9} ### Creation Pipeline ViSL-News was built with a semi-automatic workflow: 1. collect source videos, 2. filter suitable signing segments, 3. generate sentence-level labels, 4. align timestamps, 5. group signer information, 6. perform post-checking on selected data, 7. export the final dataset. :contentReference[oaicite:10]{index=10} The system was designed to reduce manual workload while preserving a human review step for quality assurance. :contentReference[oaicite:11]{index=11} :contentReference[oaicite:12]{index=12} ### Annotation Process The dataset combines: - **automatic processing** from the ViSL Tool pipeline, and - **expert/human post-checking** for quality control. :contentReference[oaicite:13]{index=13} :contentReference[oaicite:14]{index=14} A portion of the data was post-checked to build more reliable validation and test subsets. :contentReference[oaicite:15]{index=15} ## Dataset Statistics According to the thesis, ViSL-News includes: - **26,395 samples** - **59 hours** of video - **9,918-word vocabulary** - video resolution of **326×426** - **25 FPS** :contentReference[oaicite:16]{index=16} ## Intended Uses This dataset is released for: - academic research, - educational use, - benchmarking and experimentation, - and non-commercial scientific work on Vietnamese sign language processing. Example use cases: - sign language translation, - sentence-level video-to-text modeling, - data-centric studies on alignment and annotation, - dataset construction research. ## Out-of-Scope Uses This dataset is **not** intended for: - commercial deployment, - commercial redistribution, - monetized reuse of source videos, - or any use that violates the rights of the original content owner. ## Copyright and Usage Notice ### Copyright The **original broadcast videos and source content remain the property of HTV Tin Tức**. This dataset is derived from HTV Tin Tức news videos for research purposes. The dataset publisher does **not** claim ownership of the original source videos or broadcast content. ### Usage Restriction This dataset is provided **for research and non-commercial use only**. By using this dataset, you agree that: - you will use it only for **research or educational purposes**, - you will **not** use it for commercial or business purposes, - you will **not** redistribute or exploit the original source videos for commercial gain, - and you will properly acknowledge **HTV Tin Tức** as the original content source. If the copyright holder requests modification or removal of the dataset, the maintainer will cooperate accordingly. ## Limitations - The dataset is limited to the **news domain**, so it may not represent all Vietnamese Sign Language usage contexts. :contentReference[oaicite:17]{index=17} - The data comes from interpreted broadcast videos rather than unconstrained real-world conversations. - Sentence alignment and labeling are supported by a semi-automatic pipeline, so some noise may still remain. - Signer diversity is constrained by the available source videos and broadcast setting. :contentReference[oaicite:18]{index=18} ## Citation If you use this dataset, please cite the corresponding thesis or dataset release. ```bibtex @misc{visl_news_2026, title = {ViSL-News: A Sentence-Level Vietnamese Sign Language Dataset from HTV Tin Tuc News Videos}, author = {Nguyen Hieu Kha}, year = {2026}, note = {Research-use-only dataset derived from HTV Tin Tuc source videos} }
提供机构:
kha2612
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作