five

CETEMpublico

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2001T62
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos MCT/Publico), produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001T62 with ISBN 1-58563-206-6, is a corpus of newspaper texts from the Portuguese daily newspaper Publico, compiled for purposes of research and development in natural language processing (NLP) by the Computational Processing of Portuguese Project, under an agreement between Publico and the Portuguese Ministry of Science and Technology (MCT).</p><br> <h3>Data</h3><br> <p>The corpus includes the text of approximately 2,600 editions of Publico, produced between 1991 and 1998, and amounting to approximately 180 million words. CETEMPublico Version 1.7 contains 1,504,258 extracts (CETEMPublico Version 1.0 had 1,567,625). Version 1.7 was created in Oslo on August 6, 2001 and uses SGML tagging. The corpus is in 196 compressed text files, with names in the form cetemXXX.gz, from cetem001.gz to cetem196.gz.</p><br> <p>This corpus was designed to assist researchers who develop computer programs processing the Portuguese language and who would need raw material for their work. In addition, the authors wished for the corpus to be useful to everyone who studies the Portuguese language and wishes to verify their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.</p><br> <p>More detailed information is available at <a href="http://www.linguateca.pt/cetempublico/" rel="nofollow">http://www.linguateca.pt/cetempublico</a>.</p><br> <h3>Updates</h3><br> <p>There are no updates at this time.</p></br>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作