aleksasp/lt-stressed-corpus
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aleksasp/lt-stressed-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- lt
license: other
pretty_name: LT Stressed Corpus
configs:
- config_name: default
data_files:
- split: train
path: data/default/train.parquet
- config_name: full
data_files:
- split: train
path: data/full/train.parquet
---
# LT Stressed Corpus
A sentence-level Lithuanian corpus with word stress marks added to the text.
This release combines MATAS v1.0 and ALKSNIS v3.0 and keeps only fully stressed sentences. Sentences containing numeric expressions are excluded because Lithuanian number normalization is handled separately.
Each row keeps the original sentence, the stressed sentence, token-level analysis, and stress-source provenance.
Two configs are published from the same sentence set:
- `default`: Keeps the sentence text, token records, and stress/provenance counts.
- `full`: Adds serialized CoNLL-U and full sentence metadata on top of the slim schema.
The main fields are listed below. For a full field-by-field schema reference, see [COLUMNS.md](./COLUMNS.md).
## What Is Included
- `111,358` sentences
- `1,456,222` tokens
- `1,147,025` stressable words
- `0` unresolved stressable words in this export
## Sources
- [Lithuanian Treebank - ALKSNIS v3.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/21): Lithuanian dependency treebank distributed here in CoNLL-U. Used here for `2,603` exported sentences.
- [MATAS v1.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/33): Manually checked, morphologically annotated Lithuanian corpus distributed here in CoNLL-U. Used here for `108,755` exported sentences.
## How Stress Was Added
Stress was added with a conservative pipeline that prefers reusable and auditable sources over guessing.
1. Existing stressed MATAS and ALKSNIS releases were reused first.
2. Cached and morphology-aware sources are part of the pipeline when available.
3. `lki.words` was used as a dictionary fallback.
4. A final sentence-level fallback was used only for still-unresolved words in number-free sentences.
In the provenance columns this final fallback is named `phonology_engine`, but in practical terms it means Lithuanian speech-synthesis preprocessing of the kind used by the LIEPA synthesizer: sentence normalization followed by accent assignment in context.
Most words in this release come from reused stressed archives. Smaller portions were filled in from `lki.words` and the final sentence-level fallback.
- Reused existing stressed archives: `1,138,365` words
- `lki.words` fallback: `1,681` words
- Final sentence-level fallback: `6,979` words
## Main Fields
- `original_text`: original sentence text
- `stressed_text_unicode`: stressed sentence with Unicode combining stress marks
- `tokens`: token-level records with tags, features, dependency fields, stressed form, and stress-source provenance
The `default` config omits `conllu_sentence` and `sentence_metadata` to keep the download lighter. The `full` config adds those two provenance-heavy fields back.
Both configs keep the full `tokens` records and omit duplicate flat `token_*` arrays and other convenience columns that can be reconstructed from the remaining data.
## License
- Both source corpora are marked on their CLARIN-LT item pages as publicly available under `PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT`.
- MATAS source page: [MATAS v1.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/33)
- ALKSNIS source page: [Lithuanian Treebank - ALKSNIS v3.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/21)
- Licence text: [PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT](https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm)
- This stressed derivative preserves the original sentence content and adds stress annotation. Use it in compliance with the upstream CLARIN-LT licence terms; no broader rights are asserted here.
## Citation
- If you use this stressed derivative dataset, cite the Hugging Face dataset or repository release you obtained it from, and also cite both original corpora below.
- MATAS: Rimkutė E., Daudaravičius V., Utka A. 2007: Morphological Annotation of the Lithuanian Corpus. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; Workshop Balto-Slavonic Natural Language Processing 2007, Prague, 94–99.
- ALKSNIS: Bielinskienė A., Boizou L., Kovalevskaitė J., Rimkutė E. 2016: Lithuanian Dependency Treebank ALKSNIS. Proceedings of the Seventh International Conference Baltic HLT 2016. Amsterdam: IOS Press, 107–114.
---
语言:
- 立陶宛语(Lithuanian)
许可协议:其他
数据集展示名称:LT 重音语料库
配置项:
- 配置名称:default
数据文件:
- 拆分集:训练集(train)
路径:data/default/train.parquet
- 配置名称:full
数据文件:
- 拆分集:训练集(train)
路径:data/full/train.parquet
---
# LT 重音语料库
本数据集为句子级立陶宛语语料库,文本中已添加单词重音标注。
本次发布的语料库整合了MATAS v1.0与ALKSNIS v3.0两个原始语料库,仅保留完全标注重音的句子;包含数字表达式的句子已被剔除,原因是立陶宛语数字规范化处理需单独开展。
每条数据均保留原始句子、带重音标注的句子、Token级分析结果以及重音来源溯源信息。
本次发布基于同一句子集合提供两种配置:
- `default`(默认配置):仅保留句子文本、Token记录以及重音与来源计数信息。
- `full`(完整配置):在精简数据结构基础上,新增序列化CoNLL-U格式数据与完整句子元数据。
下文仅列出主要字段,如需完整的逐字段数据结构说明,请参阅[COLUMNS.md](./COLUMNS.md)。
## 数据集包含内容
- `111,358` 条句子
- `1,456,222` 个Token
- `1,147,025` 个可标注重音的单词
- 本次导出中无未完成重音标注的可重音单词
## 原始语料来源
- [立陶宛语依存树库ALKSNIS v3.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/21):以CoNLL-U格式发布的立陶宛语依存树库,本次导出共使用其中2,603条句子。
- [MATAS v1.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/33):经人工校验、带有形态学标注的立陶宛语语料库,以CoNLL-U格式发布,本次导出共使用其中108,755条句子。
## 重音标注添加流程
本次重音标注采用保守流程,优先使用可复用且可溯源的现有资源,而非基于猜测生成标注。
1. 优先复用已有的MATAS与ALKSNIS重音标注版本。
2. 若有可用的缓存资源与形态学感知工具,则将其纳入处理流程。
3. 以`lki.words`词典作为备选资源。
4. 仅针对不含数字的句子中仍未完成重音标注的单词,使用最终的句子级备选方案。
在溯源字段中,该最终备选方案被标记为`phonology_engine`;实际应用中,其对应LIEPA语音合成器所使用的立陶宛语语音合成预处理流程:先对句子进行规范化处理,再结合上下文完成重音分配。
本数据集的绝大多数单词均来自已有的重音标注档案,仅小部分单词通过`lki.words`词典与最终句子级备选方案完成标注。
- 复用现有重音标注档案:`1,138,365` 个单词
- `lki.words`词典备选:`1,681` 个单词
- 最终句子级备选方案:`6,979` 个单词
## 主要字段
- `original_text`:原始句子文本
- `stressed_text_unicode`:使用Unicode组合式重音符号标注的带重音句子
- `tokens`:Token级记录,包含标签、特征、依存关系字段、重音形式以及重音来源溯源信息
`default`(默认配置)剔除了`conllu_sentence`与`sentence_metadata`字段以降低下载体积;`full`(完整配置)则恢复了这两个包含大量溯源信息的字段。
两种配置均保留完整的`tokens`记录,并剔除了可通过剩余数据重构的扁平化`token_*`数组与其他便捷性字段。
## 许可协议
- 两个原始语料库在其CLARIN-LT项目页面均标注为可在`PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT`协议下公开使用。
- MATAS原始语料页面:[MATAS v1.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/33)
- ALKSNIS原始语料页面:[立陶宛语依存树库ALKSNIS v3.0](https://clarin.vdu.lt/xmlui/handle/20.500.11821/21)
- 协议文本:[PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT](https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm)
- 本重音标注衍生数据集保留了原始句子内容,并新增了重音标注。使用本数据集需遵守上游CLARIN-LT协议条款,本数据集未声明任何额外权利。
## 引用规范
- 若使用本重音标注衍生数据集,请引用您获取本数据集的Hugging Face数据集或仓库发布页面,并同时引用下述两个原始语料库。
- MATAS语料库:Rimkutė E., Daudaravičius V., Utka A. 2007: *Morphological Annotation of the Lithuanian Corpus*,发表于第45届国际计算语言学协会年会;2007年波罗的-斯拉夫自然语言处理研讨会,布拉格,94–99页。
- ALKSNIS语料库:Bielinskienė A., Boizou L., Kovalevskaitė J., Rimkutė E. 2016: *Lithuanian Dependency Treebank ALKSNIS*,发表于第7届波罗的海HLT国际会议2016,阿姆斯特丹:IOS出版社,107–114页。
提供机构:
aleksasp



