TS Wikipedia

Name: TS Wikipedia
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:27:55
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2015T15

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.</p><br> <h3>Data</h3><br> <p>The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2015T15.jpg">sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2015 Taner Sezer, © 2015 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集