five

Monitor corpus of Slovene Trendi 2023-02

收藏
SSH Open MarketPlace2025-07-04 更新2025-07-05 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/ZNwCDM
下载链接
链接失效反馈
官方服务:
资源简介:
This corpus contains news from 107 different media websites, published by 72 different publishers, and is a monitor corpus of Slovene. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the [Gigafida 2.0](http://hdl.handle.net/11356/1320) reference corpus of written Slovene. All the contents of the Trendi corpus are at the moment obtained using the [Jožef Stefan Institute Newsfeed service](http://newsfeed.ijs.si/). The texts have been annotated using the [CLASSLA-Stanza pipeline](https://github.com/clarinsi/classla), including syntactic parsing according to the [Universal Dependencies](https://universaldependencies.org/sl/) and [Named Entities](https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at [Text classification model SloBERTa-Trendi-Topics 1.0](http://hdl.handle.net/11356/1709), [Text classification model fastText-Trendi-Topics 1.0](http://hdl.handle.net/11356/1710), and [SloBERTa model](https://huggingface.co/cjvt/sloberta-trendi-topics). At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus can be queried through noSketchEngine and KonText concordancers.
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作