five

DFKI-SLT/mobie

收藏
Hugging Face2024-05-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/DFKI-SLT/mobie
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - de license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification - token-classification task_ids: - named-entity-recognition - entity-linking-classification - multi-class-classification paperswithcode_id: mobie pretty_name: MobIE tags: - structure-prediction - mobility - relation extraction - entity linking - named entity recognition dataset_info: - config_name: ee features: - name: id dtype: string - name: text dtype: string - name: entity_mentions list: - name: text dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: char_start dtype: int32 - name: char_end dtype: int32 - name: type dtype: class_label: names: '0': O '1': date '2': disaster-type '3': distance '4': duration '5': event-cause '6': location '7': location-city '8': location-route '9': location-stop '10': location-street '11': money '12': number '13': organization '14': organization-company '15': org-position '16': percent '17': person '18': set '19': time '20': trigger - name: entity_id dtype: string - name: refids list: - name: key dtype: string - name: value dtype: string - name: event_mentions list: - name: id dtype: string - name: trigger struct: - name: text dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: char_start dtype: int32 - name: char_end dtype: int32 - name: arguments list: - name: text dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: char_start dtype: int32 - name: char_end dtype: int32 - name: role dtype: class_label: names: '0': no_arg '1': trigger '2': location '3': delay '4': direction '5': start_loc '6': end_loc '7': start_date '8': end_date '9': cause '10': jam_length '11': route - name: type dtype: class_label: names: '0': O '1': date '2': disaster-type '3': distance '4': duration '5': event-cause '6': location '7': location-city '8': location-route '9': location-stop '10': location-street '11': money '12': number '13': organization '14': organization-company '15': org-position '16': percent '17': person '18': set '19': time '20': trigger - name: event_type dtype: class_label: names: '0': O '1': Accident '2': CanceledRoute '3': CanceledStop '4': Delay '5': Obstruction '6': RailReplacementService '7': TrafficJam - name: tokens sequence: string - name: pos_tags sequence: string - name: lemma sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-date '2': B-disaster-type '3': B-distance '4': B-duration '5': B-event-cause '6': B-location '7': B-location-city '8': B-location-route '9': B-location-stop '10': B-location-street '11': B-money '12': B-number '13': B-organization '14': B-organization-company '15': B-org-position '16': B-percent '17': B-person '18': B-set '19': B-time '20': B-trigger '21': I-date '22': I-disaster-type '23': I-distance '24': I-duration '25': I-event-cause '26': I-location '27': I-location-city '28': I-location-route '29': I-location-stop '30': I-location-street '31': I-money '32': I-number '33': I-organization '34': I-organization-company '35': I-org-position '36': I-percent '37': I-person '38': I-set '39': I-time '40': I-trigger splits: - name: train num_bytes: 3757740 num_examples: 2115 - name: test num_bytes: 1334445 num_examples: 623 - name: validation num_bytes: 827821 num_examples: 494 download_size: 1891736 dataset_size: 5920006 - config_name: el features: - name: id dtype: string - name: text dtype: string - name: entity_mentions list: - name: text dtype: string - name: start dtype: int32 - name: end dtype: int32 - name: char_start dtype: int32 - name: char_end dtype: int32 - name: type dtype: class_label: names: '0': O '1': date '2': disaster-type '3': distance '4': duration '5': event-cause '6': location '7': location-city '8': location-route '9': location-stop '10': location-street '11': money '12': number '13': organization '14': organization-company '15': org-position '16': percent '17': person '18': set '19': time '20': trigger - name: entity_id dtype: string - name: refids list: - name: key dtype: string - name: value dtype: string splits: - name: train num_bytes: 1487615 num_examples: 2115 - name: test num_bytes: 557349 num_examples: 623 - name: validation num_bytes: 329567 num_examples: 494 download_size: 819444 dataset_size: 2374531 - config_name: ner features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-date '2': B-disaster-type '3': B-distance '4': B-duration '5': B-event-cause '6': B-location '7': B-location-city '8': B-location-route '9': B-location-stop '10': B-location-street '11': B-money '12': B-number '13': B-organization '14': B-organization-company '15': B-org-position '16': B-percent '17': B-person '18': B-set '19': B-time '20': B-trigger '21': I-date '22': I-disaster-type '23': I-distance '24': I-duration '25': I-event-cause '26': I-location '27': I-location-city '28': I-location-route '29': I-location-stop '30': I-location-street '31': I-money '32': I-number '33': I-organization '34': I-organization-company '35': I-org-position '36': I-percent '37': I-person '38': I-set '39': I-time '40': I-trigger splits: - name: train num_bytes: 1112606 num_examples: 2115 - name: test num_bytes: 354244 num_examples: 623 - name: validation num_bytes: 251031 num_examples: 494 download_size: 486201 dataset_size: 1717881 - config_name: re features: - name: id dtype: string - name: tokens sequence: string - name: entities sequence: list: int32 - name: entity_roles sequence: class_label: names: '0': no_arg '1': trigger '2': location '3': delay '4': direction '5': start_loc '6': end_loc '7': start_date '8': end_date '9': cause '10': jam_length '11': route - name: entity_types sequence: class_label: names: '0': O '1': date '2': disaster-type '3': distance '4': duration '5': event-cause '6': location '7': location-city '8': location-route '9': location-stop '10': location-street '11': money '12': number '13': organization '14': organization-company '15': org-position '16': percent '17': person '18': set '19': time '20': trigger - name: event_type dtype: class_label: names: '0': O '1': Accident '2': CanceledRoute '3': CanceledStop '4': Delay '5': Obstruction '6': RailReplacementService '7': TrafficJam - name: entity_ids sequence: string splits: - name: train num_bytes: 1048457 num_examples: 1199 - name: test num_bytes: 501336 num_examples: 609 - name: validation num_bytes: 179001 num_examples: 228 download_size: 342446 dataset_size: 1728794 configs: - config_name: ee data_files: - split: train path: ee/train-* - split: test path: ee/test-* - split: validation path: ee/validation-* - config_name: el data_files: - split: train path: el/train-* - split: test path: el/test-* - split: validation path: el/validation-* - config_name: ner data_files: - split: train path: ner/train-* - split: test path: ner/test-* - split: validation path: ner/validation-* default: true - config_name: re data_files: - split: train path: re/train-* - split: test path: re/test-* - split: validation path: re/validation-* --- # Dataset Card for "MobIE" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/dfki-nlp/mobie](https://github.com/dfki-nlp/mobie) - **Repository:** [https://github.com/dfki-nlp/mobie](https://github.com/dfki-nlp/mobie) - **Paper:** [https://aclanthology.org/2021.konvens-1.22/](https://aclanthology.org/2021.konvens-1.22/) - **Point of Contact:** See [https://github.com/dfki-nlp/mobie](https://github.com/dfki-nlp/mobie) - **Size of downloaded dataset files:** 8.2 MB - **Size of the generated dataset:** 1.7 MB - **Total amount of disk used:** 9.9 MB ### Dataset Summary This script is for loading the MobIE dataset from https://github.com/dfki-nlp/mobie. MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework. The dataset combines annotations for NER, EL and RE, and thus can be used for joint and multi-task learning of these fundamental information extraction tasks. This version of the dataset loader provides configurations for: - Named Entity Recognition (`ner`): NER tags use the `BIO` tagging scheme - Entity Linking (`el`): Entity mentions are linked to an internal knowledge base and Open Street Map - Relation Extraction (`re`): n-ary Relation Extraction - Event Extraction (`ee`): formatted similar to https://github.com/nlpcl-lab/ace2005-preprocessing?tab=readme-ov-file#format For more details see https://github.com/dfki-nlp/mobie and https://aclanthology.org/2021.konvens-1.22/. ### Supported Tasks and Leaderboards - **Tasks:** Named Entity Recognition, Entity Linking, n-ary Relation Extraction, Event Extraction - **Leaderboards:** ### Languages German ## Dataset Structure ### Data Instances #### ner - **Size of downloaded dataset files:** 8.2 MB - **Size of the generated dataset:** 1.7 MB - **Total amount of disk used:** 10.9 MB An example of 'train' looks as follows. ```json { "id": "http://www.ndr.de/nachrichten/verkehr/index.html#2@2016-05-04T21:02:14.000+02:00", "tokens": ["Vorsicht", "bitte", "auf", "der", "A28", "Leer", "Richtung", "Oldenburg", "zwischen", "Zwischenahner", "Meer", "und", "Neuenkruge", "liegen", "Gegenstände", "!"], "ner_tags": [0, 0, 0, 0, 19, 13, 0, 13, 0, 11, 12, 0, 11, 0, 0, 0] } ``` #### el - **Size of downloaded dataset files:** 8.2 MB - **Size of the generated dataset:** 2.1 MB - **Total amount of disk used:** 10.3 MB An example of 'train' looks as follows. ```json { "id": "1108129826844672001", "text": "#S4 #RegioNDS #Teilausfall #Mellendorf(23.03)> #Bennemühlen(23.07). Grund: technische Störung an der Strecke. Bitte nutzen Sie #RB38 nach Soltau über Bennemühlen Abfahrt: 23:08 Uhr vom Gleis 2", "entity_mentions": [ { "text": "#S4", "start": 0, "end": 1, "char_start": 0, "char_end": 3, "type": 7, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "24007" } ] }, { "text": "#RegioNDS", "start": 1, "end": 2, "char_start": 4, "char_end": 13, "type": 13, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#Teilausfall", "start": 2, "end": 3, "char_start": 14, "char_end": 26, "type": 19, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#Mellendorf", "start": 3, "end": 4, "char_start": 27, "char_end": 38, "type": 8, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "8003957" } ] }, { "text": "23.03", "start": 5, "end": 6, "char_start": 39, "char_end": 44, "type": 0, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#Bennemühlen", "start": 8, "end": 9, "char_start": 47, "char_end": 59, "type": 6, "entity_id": "29589800", "refids": [ { "key": "spreeDBReferenceId", "value": "29589800" }, { "key": "osm_id", "value": "29589800" } ] }, { "text": "23.07", "start": 10, "end": 11, "char_start": 60, "char_end": 65, "type": 0, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "technische Störung", "start": 15, "end": 17, "char_start": 76, "char_end": 94, "type": 4, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#RB38", "start": 24, "end": 25, "char_start": 128, "char_end": 133, "type": 7, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "23138" } ] }, { "text": "Soltau", "start": 26, "end": 27, "char_start": 139, "char_end": 145, "type": 6, "entity_id": "1809016", "refids": [ { "key": "spreeDBReferenceId", "value": "-1809016" }, { "key": "osm_id", "value": "1809016" } ] }, { "text": "Bennemühlen", "start": 28, "end": 29, "char_start": 151, "char_end": 162, "type": 8, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "8000871" } ] }, { "text": "23:08 Uhr", "start": 31, "end": 33, "char_start": 172, "char_end": 181, "type": 18, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "2", "start": 35, "end": 36, "char_start": 192, "char_end": 193, "type": 11, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] } ] } ``` #### re - **Size of downloaded dataset files:** 8.2 MB - **Size of the generated dataset:** 1.7 MB - **Total amount of disk used:** 10.9 MB An example of 'train' looks as follows. ```json { "id": "1111185208647274501_1", "text": "RT @SBahn_Stuttgart: 🚨Störung🚨 Derzeit steht eine #S2 Richtung Filderstadt mit einer Türstörung in Stg-Rohr. Es kommt auf den Linien #S1, #…", "tokens": ["RT", "@SBahn_Stuttgart", ":", "🚨", "Störung", "🚨 ", "Derzeit", "steht", "eine", "#S2", "Richtung", "Filderstadt", "mit", "einer", "Türstörung", "in", "Stg", "-", "Rohr", ".", "Es", "kommt", "auf", "den", "Linien", "#S1", ",", "#", "…"], "entities": [[1, 2], [4, 5], [9, 10], [11, 12], [14, 15], [16, 19], [25, 26]], "entity_roles": [0, 1, 2, 0, 0, 0, 0], "entity_types": [13, 4, 7, 6, 4, 8, 7], "event_type": 5, "entity_ids": ["NIL", "NIL", "NIL", "2796535", "NIL", "NIL", "NIL"] } ``` #### ee - **Size of downloaded dataset files:** 8.2 MB - **Size of the generated dataset:** 5.9 MB - **Total amount of disk used:** 14.1 MB An example of 'train' looks as follows. ```json { "id": "1111185208647274501", "text": "RT @SBahn_Stuttgart: 🚨Störung🚨 Derzeit steht eine #S2 Richtung Filderstadt mit einer Türstörung in Stg-Rohr. Es kommt auf den Linien #S1, #…", "entity_mentions": [ { "text": "@SBahn_Stuttgart", "start": 1, "end": 2, "char_start": 3, "char_end": 19, "type": 13, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "Störung", "start": 4, "end": 5, "char_start": 22, "char_end": 29, "type": 4, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#S2", "start": 9, "end": 10, "char_start": 50, "char_end": 53, "type": 7, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "17171" } ] }, { "text": "Filderstadt", "start": 11, "end": 12, "char_start": 63, "char_end": 74, "type": 6, "entity_id": "2796535", "refids": [ { "key": "spreeDBReferenceId", "value": "-2796535" }, { "key": "osm_id", "value": "2796535" } ] }, { "text": "Türstörung", "start": 14, "end": 15, "char_start": 85, "char_end": 95, "type": 4, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "Stg-Rohr", "start": 16, "end": 19, "char_start": 99, "char_end": 107, "type": 8, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "NIL" } ] }, { "text": "#S1", "start": 25, "end": 26, "char_start": 133, "char_end": 136, "type": 7, "entity_id": "NIL", "refids": [ { "key": "spreeDBReferenceId", "value": "16703" } ] } ], "event_mentions": [ { "id": "r/0f748b57-63ec-4cb9-ab54-e35d29ac44f8", "trigger": { "text": "Störung", "start": 4, "end": 5, "char_start": 22, "char_end": 29 }, "arguments": [ { "text": "#S2", "start": 9, "end": 10, "char_start": 50, "char_end": 53, "role": 1, "type": 7 } ], "event_type": 5 } ], "tokens": ["RT", "@SBahn_Stuttgart", ":", "🚨", "Störung", "🚨 ", "Derzeit", "steht", "eine", "#S2", "Richtung", "Filderstadt", "mit", "einer", "Türstörung", "in", "Stg", "-", "Rohr", ".", "Es", "kommt", "auf", "den", "Linien", "#S1", ",", "#", "…"], "pos_tags": ["NN", "NN", "$.", "CARD", "NN", "CARD", "ADV", "VVFIN", "ART", "NN", "NN", "NE", "APPR", "ART", "NN", "APPR", "NE", "$[", "NE", "$.", "PPER", "VVFIN", "APPR", "ART", "NN", "CARD", "$,", "CARD", "$["], "lemma": ["rt", "@sbahn_stuttgart", ":", "🚨", "störung", "🚨", "derzeit", "steht", "eine", "#s2", "richtung", "filderstadt", "mit", "einer", "türstörung", "in", "stg", "-", "rohr", ".", "es", "kommt", "auf", "den", "linien", "#s1", ",", "#", "..."], "ner_tags": [0, 14, 0, 0, 5, 0, 0, 0, 0, 8, 0, 7, 0, 0, 5, 0, 9, 29, 29, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0] } ``` ### Data Fields #### ner - `id`: example identifier, a `string` feature. - `tokens`: list of tokens, a `list` of `string` features. - `ner_tags`: a `list` of classification labels, with possible values including `O` (0), `B-date` (1), `I-date` (2), `B-disaster-type` (3), `I-disaster-type` (4), ... #### el - `id`: example identifier, a `string` feature. - `text`: example text, a `string` feature. - `entity_mentions`: a `list` of `struct` features. - `text`: a `string` feature. - `start`: token offset start, a `int32` feature. - `end`: token offset end, a `int32` feature. - `char_start`: character offset start, a `int32` feature. - `char_end`: character offset end, a `int32` feature. - `type`: a classification label, with possible values including `O` (0), `date` (1), `disaster-type` (2), `distance` (3), `duration` (4), `event-cause` (5), ... - `entity_id`: Open Street Map ID, a `string` feature. - `refids`: knowledge base ids, a `list` of `struct` features. - `key`: name of the knowledge base, a `string` feature. - `value`: identifier, a `string` feature. #### re - `id`: example identifier, a `string` feature. - `text`: example text, a `string` feature. - `tokens`: list of tokens, a `list` of `string` features. - `entities`: a list of token spans, a `list` of `int32` featuress. - `entity_roles`: a `list` of classification labels, with possible values including `no_arg` (0), `trigger` (1), `location` (2), `delay` (3), `direction` (4), ... - `event_type`: a classification label, with possible values including `O` (0), `Accident` (1), `CanceledRoute` (2), `CanceledStop` (3), `Delay` (4), ... - `entity_ids`: list of Open Street Map IDs, a `list` of `string` features. #### ee - `id`: example identifier, a `string` feature. - `text`: example text, a `string` feature. - `entity_mentions`: a `list` of `struct` features. - `text`: a `string` feature. - `start`: token offset start, a `int32` feature. - `end`: token offset end, a `int32` feature. - `char_start`: character offset start, a `int32` feature. - `char_end`: character offset end, a `int32` feature. - `type`: a classification label, with possible values including `O` (0), `date` (1), `disaster-type` (2), `distance` (3), `duration` (4), `event-cause` (5), ... - `entity_id`: Open Street Map ID, a `string` feature. - `refids`: knowledge base ids, a `list` of `struct` features. - `key`: name of the knowledge base, a `string` feature. - `value`: identifier, a `string` feature. - `event_mentions`: a list of `struct` features. - `id`: event identifier, a `string` feature. - `trigger`: a `struct` feature. - `text`: a `string` feature. - `start`: token offset start, a `int32` feature. - `end`: token offset end, a `int32` feature. - `char_start`: character offset start, a `int32` feature. - `char_end`: character offset end, a `int32` feature. - `arguments`: a list of `struct` features. - `text`: a `string` feature. - `start`: token offset start, a `int32` feature. - `end`: token offset end, a `int32` feature. - `char_start`: character offset start, a `int32` feature. - `char_end`: character offset end, a `int32` feature. - `role`: a classification label, with possible values including `no_arg` (0), `trigger` (1), `location` (2), `delay` (3), `direction` (4), ... - `type`: a classification label, with possible values including `O` (0), `date` (1), `disaster-type` (2), `distance` (3), `duration` (4), `event-cause` (5), ... - `event_type`: a classification label, with possible values including `O` (0), `Accident` (1), `CanceledRoute` (2), `CanceledStop` (3), `Delay` (4), ... - `tokens`: list of tokens, a `list` of `string` features. - `pos_tags`: list of part-of-speech tags, a `list` of `string` features. - `lemma`: list of lemmatized tokens, a `list` of `string` features. - `ner_tags`: a `list` of classification labels, with possible values including `O` (0), `B-date` (1), `I-date` (2), `B-disaster-type` (3), `I-disaster-type` (4), ... ### Data Splits | | Train | Dev | Test | |-----|-------|-----|------| | NER | 2115 | 494 | 623 | | EL | 2115 | 494 | 623 | | RE | 1199 | 228 | 609 | | EL | 2115 | 494 | 623 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/) ### Citation Information ``` @inproceedings{hennig-etal-2021-mobie, title = "{M}ob{IE}: A {G}erman Dataset for Named Entity Recognition, Entity Linking and Relation Extraction in the Mobility Domain", author = "Hennig, Leonhard and Truong, Phuc Tran and Gabryszak, Aleksandra", booktitle = "Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)", month = "6--9 " # sep, year = "2021", address = {D{\"u}sseldorf, Germany}, publisher = "KONVENS 2021 Organizers", url = "https://aclanthology.org/2021.konvens-1.22", pages = "223--227", } ``` ### Contributions
提供机构:
DFKI-SLT
原始信息汇总

数据集概述

  • 数据集名称: MobIE
  • 语言: 德语
  • 许可: CC-BY-4.0
  • 多语言性: 单语
  • 数据集大小: 10K<n<100K
  • 源数据集: 原始数据
  • 任务类别:
    • 文本分类
    • 令牌分类
  • 任务ID:
    • 命名实体识别
    • 实体链接分类
    • 多类分类
  • 标签:
    • 结构预测
    • 移动性
    • 关系提取
    • 实体链接
    • 命名实体识别

数据集结构

数据字段

  • id: 字符串类型
  • text: 字符串类型
  • entity_mentions: 列表类型,包含以下字段:
    • text: 字符串类型
    • start: 整数类型
    • end: 整数类型
    • char_start: 整数类型
    • char_end: 整数类型
    • type: 分类标签,包括多种实体类型(如日期、灾难类型、距离等)
    • entity_id: 字符串类型
    • refids: 列表类型,包含键值对(key: 字符串类型, value: 字符串类型)
  • event_mentions: 列表类型,包含以下字段:
    • id: 字符串类型
    • trigger: 结构类型,包含文本、起始、结束、字符起始和字符结束字段
    • arguments: 列表类型,包含文本、起始、结束、字符起始、字符结束和角色字段
    • event_type: 分类标签,包括多种事件类型(如事故、路线取消等)
  • tokens: 字符串序列
  • pos_tags: 字符串序列
  • lemma: 字符串序列
  • ner_tags: 序列类型,包含分类标签,使用BIO标记方案

数据分割

  • train:
    • 字节数: 3757740
    • 示例数: 2115
  • test:
    • 字节数: 1334445
    • 示例数: 623
  • validation:
    • 字节数: 827821
    • 示例数: 494

数据集创建

  • 注释创建者: 专家生成
  • 语言创建者: 发现
  • 源数据: 原始数据
  • 注释: 包含20种粗粒度和细粒度实体类型及实体链接信息

使用数据考虑

  • 社会影响: 数据集用于信息提取任务,可能影响相关领域的研究和应用
  • 偏见讨论: 数据集可能存在的偏见需要进一步分析
  • 其他已知限制: 数据集的具体限制需根据实际使用情况评估

附加信息

  • 数据集管理员: 见GitHub仓库
  • 许可信息: CC-BY-4.0
  • 引用信息: 见论文和GitHub仓库
  • 贡献: 欢迎贡献,详情见GitHub仓库
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作