Corpus of Law, Academic, and News
收藏DataCite Commons2020-10-20 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2020T23
下载链接
链接失效反馈官方服务:
资源简介:
Introduction
<br>
<br>Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news.
<br>
<br>The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020.
<br>Data
<br>
<br>The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens.
<br>
<br>Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs.
<br>
<br>All documents are presented as UTF-8 encoded XML with internal DTDs.
<br>Samples
<br>
<br>Please view this sample (XML).
<br>Updates
<br>
<br>None at this time.
<br>Copyright
<br>Portions © 2020 Ariana N. Mohammadi, © 2020 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-10-20



