five

hugosousa/professor_heideltime_en

收藏
Hugging Face2023-11-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hugosousa/professor_heideltime_en
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language: - en - fr - pt - de - fr - it - es language_creators: - found license: - mit multilinguality: - multilingual pretty_name: Professor HeidelTime size_categories: - 100K<n<1M source_datasets: - original tags: - Timex - Timexs - Temporal Expression - Temporal Expressions - Temporal Information - Timex Identification - Timex Classification - Timex Extraction task_categories: - token-classification task_ids: - parsing - part-of-speech - named-entity-recognition configs: - config_name: portuguese data_files: "portuguese.json" - config_name: english data_files: "english.json" - config_name: french data_files: "french.json" - config_name: italian data_files: "italian.json" - config_name: spanish data_files: "spanish.json" - config_name: german data_files: "german.json" --- # Professor HeidelTime [![Paper](https://img.shields.io/badge/Paper-557C55)](https://dl.acm.org/doi/10.1145/3583780.3615130) [![GitHub](https://img.shields.io/badge/GitHub-A6CF98)](https://github.com/hmosousa/professor_heideltime) Professor HeidelTime is a project to create a multilingual corpus weakly labeled with [HeidelTime](https://github.com/HeidelTime/heideltime), a temporal tagger. ## Corpus Details The weak labeling was performed in six languages. Here are the specifics of the corpus for each language: | Dataset | Language | Documents | From | To | Tokens | Timexs | | ----------------------- | -------- | --------- | ---------- | ---------- | ---------- | -------- | | All the News 2.0 | EN | 24,642 | 2016-01-01 | 2020-04-02 | 18,755,616 | 254,803 | | Italian Crime News | IT | 9,619 | 2011-01-01 | 2021-12-31 | 3,296,898 | 58,823 | | German News Dataset | DE | 33,266 | 2003-01-01 | 2022-12-31 | 21,617,888 | 348,011 | | ElMundo News | ES | 19,095 | 2005-12-02 | 2021-10-18 | 12,515,410 | 194,043 | | French Financial News | FR | 24,293 | 2017-10-19 | 2021-03-19 | 1,673,053 | 83,431 | | Público News | PT | 27,154 | 2000-11-14 | 2002-03-20 | 5,929,377 | 111,810 | ## Contact For more information, reach out to [Hugo Sousa](https://hugosousa.net) at <hugo.o.sousa@inesctec.pt>. This framework is a part of the [Text2Story](https://text2story.inesctec.pt) project. This project is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185). ## Cite If you use this work, please cite the following [paper](https://dl.acm.org/doi/10.1145/3583780.3615130): ```bibtex @inproceedings{10.1145/3583780.3615130, author = {Sousa, Hugo and Campos, Ricardo and Jorge, Al\'{\i}pio}, title = {TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification}, year = {2023}, isbn = {9798400701245}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3583780.3615130}, doi = {10.1145/3583780.3615130}, booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management}, pages = {5401–5406}, numpages = {6}, keywords = {temporal expression identification, multilingual corpus, weak label}, location = {Birmingham, United Kingdom}, series = {CIKM '23} } ```
提供机构:
hugosousa
原始信息汇总

数据集概述

基本信息

  • 名称: Professor HeidelTime
  • 语言: 多语言(英语、法语、葡萄牙语、德语、意大利语、西班牙语)
  • 语言创建方式: 发现
  • 许可证: MIT
  • 多语言性: 多语言
  • 大小: 100K<n<1M
  • 数据来源: 原始

任务与配置

  • 任务类别: 词元分类
  • 任务ID: 解析、词性标注、命名实体识别
  • 配置:
    • 葡萄牙语: 数据文件为 "portuguese.json"
    • 英语: 数据文件为 "english.json"
    • 法语: 数据文件为 "french.json"
    • 意大利语: 数据文件为 "italian.json"
    • 西班牙语: 数据文件为 "spanish.json"
    • 德语: 数据文件为 "german.json"

数据集详情

  • 数据集: All the News 2.0

    • 语言: 英语
    • 文档数量: 24,642
    • 时间范围: 2016-01-01 至 2020-04-02
    • 总词数: 18,755,616
    • 时间表达式数量: 254,803
  • 数据集: Italian Crime News

    • 语言: 意大利语
    • 文档数量: 9,619
    • 时间范围: 2011-01-01 至 2021-12-31
    • 总词数: 3,296,898
    • 时间表达式数量: 58,823
  • 数据集: German News Dataset

    • 语言: 德语
    • 文档数量: 33,266
    • 时间范围: 2003-01-01 至 2022-12-31
    • 总词数: 21,617,888
    • 时间表达式数量: 348,011
  • 数据集: ElMundo News

    • 语言: 西班牙语
    • 文档数量: 19,095
    • 时间范围: 2005-12-02 至 2021-10-18
    • 总词数: 12,515,410
    • 时间表达式数量: 194,043
  • 数据集: French Financial News

    • 语言: 法语
    • 文档数量: 24,293
    • 时间范围: 2017-10-19 至 2021-03-19
    • 总词数: 1,673,053
    • 时间表达式数量: 83,431
  • 数据集: Público News

    • 语言: 葡萄牙语
    • 文档数量: 27,154
    • 时间范围: 2000-11-14 至 2002-03-20
    • 总词数: 5,929,377
    • 时间表达式数量: 111,810
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作