hugosousa/professor_heideltime_en

Name: hugosousa/professor_heideltime_en
Creator: hugosousa
Published: 2023-11-13 17:28:12
License: 暂无描述

Hugging Face2023-11-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/hugosousa/professor_heideltime_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - en - fr - pt - de - fr - it - es language_creators: - found license: - mit multilinguality: - multilingual pretty_name: Professor HeidelTime size_categories: - 100K<n<1M source_datasets: - original tags: - Timex - Timexs - Temporal Expression - Temporal Expressions - Temporal Information - Timex Identification - Timex Classification - Timex Extraction task_categories: - token-classification task_ids: - parsing - part-of-speech - named-entity-recognition configs: - config_name: portuguese data_files: "portuguese.json" - config_name: english data_files: "english.json" - config_name: french data_files: "french.json" - config_name: italian data_files: "italian.json" - config_name: spanish data_files: "spanish.json" - config_name: german data_files: "german.json" --- # Professor HeidelTime [![Paper](https://img.shields.io/badge/Paper-557C55)](https://dl.acm.org/doi/10.1145/3583780.3615130) [![GitHub](https://img.shields.io/badge/GitHub-A6CF98)](https://github.com/hmosousa/professor_heideltime) Professor HeidelTime is a project to create a multilingual corpus weakly labeled with [HeidelTime](https://github.com/HeidelTime/heideltime), a temporal tagger. ## Corpus Details The weak labeling was performed in six languages. Here are the specifics of the corpus for each language: | Dataset | Language | Documents | From | To | Tokens | Timexs | | ----------------------- | -------- | --------- | ---------- | ---------- | ---------- | -------- | | All the News 2.0 | EN | 24,642 | 2016-01-01 | 2020-04-02 | 18,755,616 | 254,803 | | Italian Crime News | IT | 9,619 | 2011-01-01 | 2021-12-31 | 3,296,898 | 58,823 | | German News Dataset | DE | 33,266 | 2003-01-01 | 2022-12-31 | 21,617,888 | 348,011 | | ElMundo News | ES | 19,095 | 2005-12-02 | 2021-10-18 | 12,515,410 | 194,043 | | French Financial News | FR | 24,293 | 2017-10-19 | 2021-03-19 | 1,673,053 | 83,431 | | Público News | PT | 27,154 | 2000-11-14 | 2002-03-20 | 5,929,377 | 111,810 | ## Contact For more information, reach out to [Hugo Sousa](https://hugosousa.net) at <hugo.o.sousa@inesctec.pt>. This framework is a part of the [Text2Story](https://text2story.inesctec.pt) project. This project is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185). ## Cite If you use this work, please cite the following [paper](https://dl.acm.org/doi/10.1145/3583780.3615130): ```bibtex @inproceedings{10.1145/3583780.3615130, author = {Sousa, Hugo and Campos, Ricardo and Jorge, Al\'{\i}pio}, title = {TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification}, year = {2023}, isbn = {9798400701245}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3583780.3615130}, doi = {10.1145/3583780.3615130}, booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management}, pages = {5401–5406}, numpages = {6}, keywords = {temporal expression identification, multilingual corpus, weak label}, location = {Birmingham, United Kingdom}, series = {CIKM '23} } ```

提供机构：

hugosousa

原始信息汇总

数据集概述

基本信息

名称: Professor HeidelTime
语言: 多语言（英语、法语、葡萄牙语、德语、意大利语、西班牙语）
语言创建方式: 发现
许可证: MIT
多语言性: 多语言
大小: 100K<n<1M
数据来源: 原始

任务与配置

任务类别: 词元分类
任务ID: 解析、词性标注、命名实体识别
配置:
- 葡萄牙语: 数据文件为 "portuguese.json"
- 英语: 数据文件为 "english.json"
- 法语: 数据文件为 "french.json"
- 意大利语: 数据文件为 "italian.json"
- 西班牙语: 数据文件为 "spanish.json"
- 德语: 数据文件为 "german.json"

数据集详情

数据集: All the News 2.0
- 语言: 英语
- 文档数量: 24,642
- 时间范围: 2016-01-01 至 2020-04-02
- 总词数: 18,755,616
- 时间表达式数量: 254,803
数据集: Italian Crime News
- 语言: 意大利语
- 文档数量: 9,619
- 时间范围: 2011-01-01 至 2021-12-31
- 总词数: 3,296,898
- 时间表达式数量: 58,823
数据集: German News Dataset
- 语言: 德语
- 文档数量: 33,266
- 时间范围: 2003-01-01 至 2022-12-31
- 总词数: 21,617,888
- 时间表达式数量: 348,011
数据集: ElMundo News
- 语言: 西班牙语
- 文档数量: 19,095
- 时间范围: 2005-12-02 至 2021-10-18
- 总词数: 12,515,410
- 时间表达式数量: 194,043
数据集: French Financial News
- 语言: 法语
- 文档数量: 24,293
- 时间范围: 2017-10-19 至 2021-03-19
- 总词数: 1,673,053
- 时间表达式数量: 83,431
数据集: Público News
- 语言: 葡萄牙语
- 文档数量: 27,154
- 时间范围: 2000-11-14 至 2002-03-20
- 总词数: 5,929,377
- 时间表达式数量: 111,810

5,000+

优质数据集

54 个

任务类型

进入经典数据集