Albanian corpus used for pertaining large language models
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10778229
下载链接
链接失效反馈官方服务:
资源简介:
The corpus used for pertaining large language models is an SQLite (v3) database with the following tables:
corp_src: the sources of the Albanian text
corp_doc: the corpus source (names) and source files
doc: joins from sentences to corpus document source (`corp_doc`)
sent: the Albanian sentences with tokenization and token length
This query shows how to get the corpus sources and constituent counts:
select cs.id as name, cs.url, count(*) as count
from corp_src as cs, corp_doc as cd, doc as d, sent as s
where cd.name = cs.id and
cd.doc_id = d.rowid and
cd.doc_id = s.doc_id
group by cs.id;
创建时间:
2024-03-07



