Albanian corpus used for pertaining large language models

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10778229

下载链接

链接失效反馈

官方服务：

资源简介：

The corpus used for pertaining large language models is an SQLite (v3) database with the following tables: corp_src: the sources of the Albanian text corp_doc: the corpus source (names) and source files doc: joins from sentences to corpus document source (`corp_doc`) sent: the Albanian sentences with tokenization and token length This query shows how to get the corpus sources and constituent counts: select cs.id as name, cs.url, count(*) as count from corp_src as cs, corp_doc as cd, doc as d, sent as s where cd.name = cs.id and cd.doc_id = d.rowid and cd.doc_id = s.doc_id group by cs.id;

创建时间：

2024-03-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集