five

Corpus of Law, Academic, and News

收藏
DataCite Commons2020-10-20 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2020T23
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction <br> <br>Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. <br> <br>The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020. <br>Data <br> <br>The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens. <br> <br>Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs. <br> <br>All documents are presented as UTF-8 encoded XML with internal DTDs. <br>Samples <br> <br>Please view this sample (XML). <br>Updates <br> <br>None at this time. <br>Copyright <br>Portions © 2020 Ariana N. Mohammadi, © 2020 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-10-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作