five

hindi_peoms

收藏
IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/hindipeoms
下载链接
链接失效反馈
官方服务:
资源简介:
We downloaded the dataset of Hindi Poems from the Website, contains around 2500 poems the downloaded dataset link is: link In the initial phase of our data preprocessing pipeline, we collected text data from a diverse set of HTML files, totaling 2500 documents. These files, constituting a substantial corpus, were meticulously curated for subsequent analysis. To facilitate further investigation, we amalgamated all the extracted text into a consolidated text file, a crucial step in preparing the data for subsequent processing. The first step in refining the collected dataset involved the removal of extraneous characters that did not belong to the Devanagari script. This meticulous process ensured that the ensuing analysis would be focused exclusively on the relevant linguistic elements, enhancing the quality and coherence of the dataset. To enhance the text’s readability and maintain consistency across the dataset, we implemented a procedure to replace multiple consecutive newline characters with a single newline. Following this, we diligently stripped any leading and trailing spaces, contributing to a more uniform and standardized format for subsequent analysis. In an effort to refine the dataset further, we executed a filtering mechanism to exclude numerical characters written in Hindi script. This step aimed to eliminate nonlinguistic elements and enhance the linguistic purity of the dataset, laying the groundwork for more accurate and meaningful analyses. The amalgamation of these preprocessing steps not only streamlined the dataset but also set the stage for a more robust and focused examination of the linguistic content within the Devanagari script. This comprehensive preprocessing pipeline not only addresses the intricacies of handling multitudinous files but also underscores our commitment to rigorously refining the dataset for subsequent research and analysis

本研究从公开网站获取印地语诗歌数据集,共包含约2500首诗歌,数据集下载链接为:link。在数据预处理流程的初始阶段,我们从多类HTML文件中采集文本数据,总计2500份文档。这些文件构成了规模可观的语料库,经精心筛选以用于后续分析。为便于后续研究,我们将所有提取的文本整合为单一文本文件,这是为后续处理准备数据的关键步骤。对采集到的数据集进行优化的第一步,是移除所有不属于天城文(Devanagari)的无关字符。这一精细操作确保后续分析仅聚焦于相关语言元素,提升了数据集的质量与连贯性。为提升文本可读性并维持数据集整体一致性,我们采用了将连续多个换行符替换为单个换行符的处理流程。在此之后,我们严格移除所有文本首尾的空白字符,从而为后续分析打造更统一、规范的格式。为进一步优化数据集,我们执行了过滤机制,剔除印地语书写体系中的数字字符。该步骤旨在移除非语言元素,提升数据集的语言纯净度,为后续更精准且有价值的分析奠定基础。上述一系列预处理步骤的整合,不仅简化了数据集的结构,更为针对天城文语言内容开展更全面且聚焦的分析铺平了道路。这套完整的预处理流程不仅解决了多文件处理的复杂问题,更彰显了我们为后续研究与分析严格优化数据集的严谨态度。
提供机构:
Shah, Kavach
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作