The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0
收藏hdl.handle.net2025-01-15 收录
下载链接:
http://hdl.handle.net/11356/1461
下载链接
链接失效反馈官方服务:
资源简介:
The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.
SETimes.HBS 数据集由波斯尼亚语、克罗地亚语和塞尔维亚语三种语言的平行文档组成,这些文档源自已停运的 setimes.com 网站,该网站以东南欧的语言发布新闻。尽管无法确切知晓文档的撰写过程,但它们极有可能是由英语独立翻译而成的。本数据集的主要预期用途与语言辨别密切相关。此数据集并非传统意义上的平行数据集,因为平行文档之间不存在显性的链接。在数据集的训练、开发和测试集中,特别注重确保三种语言包含相同的文档,鉴于三种语言之间的相似性,三者之间的数据泄漏可能会对基准测试造成问题。
提供机构:
hdl.handle.net



