five

hindi_peoms

收藏
DataCite Commons2024-01-03 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/hindipeoms
下载链接
链接失效反馈
官方服务:
资源简介:
We downloaded the dataset of Hindi Poems from the Website, contains around 2500 poems the downloaded dataset link is: link In the initial phase of our data preprocessing pipeline, we collected text data from a diverse set of HTML files, totaling 2500 documents. These files, constituting a substantial corpus, were meticulously curated for subsequent analysis. To facilitate further investigation, we amalgamated all the extracted text into a consolidated text file, a crucial step in preparing the data for subsequent processing. The first step in refining the collected dataset involved the removal of extraneous characters that did not belong to the Devanagari script. This meticulous process ensured that the ensuing analysis would be focused exclusively on the relevant linguistic elements, enhancing the quality and coherence of the dataset. To enhance the text’s readability and maintain consistency across the dataset, we implemented a procedure to replace multiple consecutive newline characters with a single newline. Following this, we diligently stripped any leading and trailing spaces, contributing to a more uniform and standardized format for subsequent analysis. In an effort to refine the dataset further, we executed a filtering mechanism to exclude numerical characters written in Hindi script. This step aimed to eliminate nonlinguistic elements and enhance the linguistic purity of the dataset, laying the groundwork for more accurate and meaningful analyses. The amalgamation of these preprocessing steps not only streamlined the dataset but also set the stage for a more robust and focused examination of the linguistic content within the Devanagari script. This comprehensive preprocessing pipeline not only addresses the intricacies of handling multitudinous files but also underscores our commitment to rigorously refining the dataset for subsequent research and analysis
提供机构:
IEEE DataPort
创建时间:
2024-01-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作