hugosousa/professor_heideltime_en
收藏Hugging Face2023-11-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hugosousa/professor_heideltime_en
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language:
- en
- fr
- pt
- de
- fr
- it
- es
language_creators:
- found
license:
- mit
multilinguality:
- multilingual
pretty_name: Professor HeidelTime
size_categories:
- 100K<n<1M
source_datasets:
- original
tags:
- Timex
- Timexs
- Temporal Expression
- Temporal Expressions
- Temporal Information
- Timex Identification
- Timex Classification
- Timex Extraction
task_categories:
- token-classification
task_ids:
- parsing
- part-of-speech
- named-entity-recognition
configs:
- config_name: portuguese
data_files: "portuguese.json"
- config_name: english
data_files: "english.json"
- config_name: french
data_files: "french.json"
- config_name: italian
data_files: "italian.json"
- config_name: spanish
data_files: "spanish.json"
- config_name: german
data_files: "german.json"
---
# Professor HeidelTime
[](https://dl.acm.org/doi/10.1145/3583780.3615130)
[](https://github.com/hmosousa/professor_heideltime)
Professor HeidelTime is a project to create a multilingual corpus weakly labeled with [HeidelTime](https://github.com/HeidelTime/heideltime), a temporal tagger.
## Corpus Details
The weak labeling was performed in six languages. Here are the specifics of the corpus for each language:
| Dataset | Language | Documents | From | To | Tokens | Timexs |
| ----------------------- | -------- | --------- | ---------- | ---------- | ---------- | -------- |
| All the News 2.0 | EN | 24,642 | 2016-01-01 | 2020-04-02 | 18,755,616 | 254,803 |
| Italian Crime News | IT | 9,619 | 2011-01-01 | 2021-12-31 | 3,296,898 | 58,823 |
| German News Dataset | DE | 33,266 | 2003-01-01 | 2022-12-31 | 21,617,888 | 348,011 |
| ElMundo News | ES | 19,095 | 2005-12-02 | 2021-10-18 | 12,515,410 | 194,043 |
| French Financial News | FR | 24,293 | 2017-10-19 | 2021-03-19 | 1,673,053 | 83,431 |
| Público News | PT | 27,154 | 2000-11-14 | 2002-03-20 | 5,929,377 | 111,810 |
## Contact
For more information, reach out to [Hugo Sousa](https://hugosousa.net) at <hugo.o.sousa@inesctec.pt>.
This framework is a part of the [Text2Story](https://text2story.inesctec.pt) project. This project is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185).
## Cite
If you use this work, please cite the following [paper](https://dl.acm.org/doi/10.1145/3583780.3615130):
```bibtex
@inproceedings{10.1145/3583780.3615130,
author = {Sousa, Hugo and Campos, Ricardo and Jorge, Al\'{\i}pio},
title = {TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification},
year = {2023},
isbn = {9798400701245},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3583780.3615130},
doi = {10.1145/3583780.3615130},
booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
pages = {5401–5406},
numpages = {6},
keywords = {temporal expression identification, multilingual corpus, weak label},
location = {Birmingham, United Kingdom},
series = {CIKM '23}
}
```
提供机构:
hugosousa
原始信息汇总
数据集概述
基本信息
- 名称: Professor HeidelTime
- 语言: 多语言(英语、法语、葡萄牙语、德语、意大利语、西班牙语)
- 语言创建方式: 发现
- 许可证: MIT
- 多语言性: 多语言
- 大小: 100K<n<1M
- 数据来源: 原始
任务与配置
- 任务类别: 词元分类
- 任务ID: 解析、词性标注、命名实体识别
- 配置:
- 葡萄牙语: 数据文件为 "portuguese.json"
- 英语: 数据文件为 "english.json"
- 法语: 数据文件为 "french.json"
- 意大利语: 数据文件为 "italian.json"
- 西班牙语: 数据文件为 "spanish.json"
- 德语: 数据文件为 "german.json"
数据集详情
-
数据集: All the News 2.0
- 语言: 英语
- 文档数量: 24,642
- 时间范围: 2016-01-01 至 2020-04-02
- 总词数: 18,755,616
- 时间表达式数量: 254,803
-
数据集: Italian Crime News
- 语言: 意大利语
- 文档数量: 9,619
- 时间范围: 2011-01-01 至 2021-12-31
- 总词数: 3,296,898
- 时间表达式数量: 58,823
-
数据集: German News Dataset
- 语言: 德语
- 文档数量: 33,266
- 时间范围: 2003-01-01 至 2022-12-31
- 总词数: 21,617,888
- 时间表达式数量: 348,011
-
数据集: ElMundo News
- 语言: 西班牙语
- 文档数量: 19,095
- 时间范围: 2005-12-02 至 2021-10-18
- 总词数: 12,515,410
- 时间表达式数量: 194,043
-
数据集: French Financial News
- 语言: 法语
- 文档数量: 24,293
- 时间范围: 2017-10-19 至 2021-03-19
- 总词数: 1,673,053
- 时间表达式数量: 83,431
-
数据集: Público News
- 语言: 葡萄牙语
- 文档数量: 27,154
- 时间范围: 2000-11-14 至 2002-03-20
- 总词数: 5,929,377
- 时间表达式数量: 111,810



