DFKI-SLT/smartdata-coprpus
收藏数据集概述
数据集描述
数据集名称
SmartData Corpus
数据集语言
德语
数据集任务
- 命名实体识别 (Named Entity Recognition, NER)
- 关系抽取 (Relation Extraction, RE)
- 事件抽取 (Event Extraction, EE)
数据集标签
- 金融
- 关系抽取
- 事件抽取
- 交通
- 工业
数据集大小
- 1K < n < 10K
数据集结构
配置
- ee: 事件抽取
- ner: 命名实体识别
- re: 关系抽取
数据字段
ee
id: 示例标识符,字符串类型。text: 示例文本,字符串类型。entity_mentions: 实体提及列表,结构体类型。id: 实体标识符,字符串类型。text: 实体文本,字符串类型。start: 起始token偏移,int64类型。end: 结束token偏移,int64类型。char_start: 起始字符偏移,int64类型。char_end: 结束字符偏移,int64类型。type: 实体类型,字符串类型。
event_mentions: 事件提及列表,结构体类型。id: 事件标识符,字符串类型。trigger: 触发器,结构体类型。text: 触发器文本,字符串类型。start: 起始token偏移,int64类型。end: 结束token偏移,int64类型。char_start: 起始字符偏移,int64类型。char_end: 结束字符偏移,int64类型。
arguments: 参数列表,结构体类型。text: 参数文本,字符串类型。start: 起始token偏移,int64类型。end: 结束token偏移,int64类型。char_start: 起始字符偏移,int64类型。char_end: 结束字符偏移,int64类型。role: 参数角色,字符串类型。type: 参数实体类型,字符串类型。
event_type: 事件类型,字符串类型。
tokens: token列表,字符串序列。pos_tags: 词性标签列表,字符串序列。lemma: 词元化token列表,字符串序列。ner_tags: NER标签列表,字符串序列。
ner
id: 示例标识符,字符串类型。tokens: token列表,字符串序列。ner_tags: NER标签列表,字符串序列。
re
id: 示例标识符,字符串类型。tokens: token列表,字符串序列。entities: 实体token跨度列表,int64序列。entity_roles: 实体角色列表,字符串序列。event_type: 事件类型,字符串类型。entity_ids: 实体标识符列表,字符串序列。
数据分割
ee
train: 6532239字节,1861个示例。validation: 792697字节,228个示例。test: 802322字节,230个示例。
ner
train: 2062754字节,1861个示例。validation: 250635字节,228个示例。test: 255164字节,230个示例。
re
train: 2116771字节,1007个示例。validation: 265248字节,129个示例。test: 238094字节,128个示例。
数据文件路径
ee
train: ee/train-*validation: ee/validation-*test: ee/test-*
ner
train: ner/train-*validation: ner/validation-*test: ner/test-*
re
train: re/train-*validation: re/validation-*test: re/test-*
许可证
CC BY-SA 4.0
引用信息
BibTeX
@InProceedings{SCHIERSCH18.85, author = {Martin Schiersch and Veselina Mironova and Maximilian Schmitt and Philippe Thomas and Aleksandra Gabryszak and Leonhard Hennig}, title = "{A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }
APA
- Schiersch, M., Mironova, V., Schmitt, M., Thomas, P., Gabryszak, A., & Hennig, L. (2018). A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events. In N. Calzolari (Conference chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. Unknown). Miyazaki, Japan: European Language Resources Association (ELRA). ISBN: 979-10-95546-00-9.



