five

Sep_Ngram_Tel-Ham01

收藏
DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/g4tnnf683m
下载链接
链接失效反馈
官方服务:
资源简介:
1. Introduction In the field of computational linguistics and probability, n-gram is a continuous sequence of n pieces in a given sequence of text or speech and depending on the application, these items can be phonemes, syllables, letters, words, ext. In this corpus, n-gram on words (more precisely in language processing, tokens) has been calculated for Persian texts in Hamshahri corpus and Telegram messages. In these calculations, the tokenizer developed in Computerized Intelligence Systems Laboratory of University of Tabriz has been used to tokenize the texts. This tokenizer also performs sentence separation. 1-1. The corpus of Hamshahri Hamshahri statue includes texts published in Hamshahri online newspaper. This corpus is in the form of xml files, each file containing the texts of the day, in which the separation of news text and categories, as well as the date of submission in the form of xml tags is done. This corpus is presented in two versions. In Sep_Ngram_Tel-Ham01, the first version of Hamshahri has been used, which includes more than 80,000 articles from July 23, 1996 to June 20, 2003. 1.2 Telegram corpus The main problem with the Hamshahri statue is the lack of new words (such as Telegram, Barjam, etc.) due to the old statue, so in Sep_Ngram_Tel-Ham01, in addition to the Hamshahri statue, the new Telegram statue is used in the calculations. The corpus of Telegram includes posts published in prestigious and famous Persian groups and channels from March 15, 2017 to December 31, 2017, which includes more than 1,800,000 posts, and with thanks to telegram API in Computerized Intelligence Systems Laboratory of University of Tabriz has been collected. It should be noted that a period of one month from this statue is manually labeled thematic and is available [1].
提供机构:
Mendeley
创建时间:
2022-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作