Script Dataset of Hindi Movies, Subtitle Dataset of Marathi Movies

Name: Script Dataset of Hindi Movies, Subtitle Dataset of Marathi Movies
Creator: IEEE DataPort
Published: 2024-09-17 12:32:14
License: 暂无描述

DataCite Commons2024-09-17 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/script-dataset-hindi-movies-subtitle-dataset-marathi-movies

下载链接

链接失效反馈

官方服务：

资源简介：

Script dataset of Hindi Movies after 2018 in UTF-8 text format.Subtitle dataset of Marathi Movies after 2010 in UTF-8 format.For Hindi movies scripts dataset, different websites were crawled including but not limited to filmcompanion.in, scribd.com etc. Different tools like open web scrapper, Scrapy were used for this purpose. Many of these scripts were not well and uniformly formatted as Hindi movie industry do not follow  specific standards for writing scripts. Scripts were found to be written as Devnagari, phonotonic hindi, english, pdf format, some were just scanned copy. All scripts we get from web crawling were unified into Unicode UTF-8 text format. To convert scripts in phonotonic hindi and those written in English, into Devnagari Hindi, we processed them by using tools like google translate. Proper spell checks was run on scripts written in English or Hindi electronic texts, using Microsoft word macros to create a cleaner dataset. Later these were reviewed by language experts. For those gathered as scanned copy we used manual typing to convert them into Devnagari Hindi electronic text. All were then arranged in proper uniform text format i.e. UTF-8 Devnagari Hindi text files. Different regular expressions were used for text processing to bring the text in uniform format. A total dataset of 100 Hindi movies scripts across different genre since 2018 are selected, which finally is in form of 100 text files in UTF-8 with Hindi text in Devnagari after pre-processing. These contain  Total Lines: 170744, Total Words: 1029826.For Marathi language movies dataset, It was even more challenge to get a dataset, as no significant readily available scripts were present. Hence we gather subtitles (.srt) of Marathi movies in different languages, and manually translated them in to Marathi language using google translate and later verifying with language experts. One hundred marathi movies of different genre since 2010 are selected.Later, time stamps were removed from subtitle files. In order to remove time tamps we used regex on text files, by removing lines from subtitle files who have only numbers and : (colon) symbols and (,)comma. At this point we are left with only dialogues, sound effect words, time cues, background noise words, and non-verbal communications. We kept these words as it is, so as to improve richness of our subtitles like a script.  Since the original text we are using here was in subtitle format, the sentences (dialogues) were spread across different lines, to combine multiple lines belonging to same sentence, into a single line we used regex on text files. Regex would combine the sentences that are split across multiple lines into a single line by matching termination symbols like . (dot), ? (question mark), ! (exclamation) which are similarly used in Marathi as in English language. We created dataset of 100  different Marathi movies subtitles, which finally is in form of 100 text files in UTF-8 with Marathi text in Devnagari after pre-processing. These contain total 153565 lines and 851377  words. 

提供机构：

IEEE DataPort

创建时间：

2024-09-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集