five

bond005/NEREL_instruct

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bond005/NEREL_instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ru size_categories: - 10K<n<100K license: mit task_categories: - summarization - text-generation pretty_name: NEREL-instruct tags: - ner - nerel - named-entity-recognition - relation-extraction - entity-normalization - entity-definition - relation-definition configs: - config_name: entity_definition data_files: - split: train path: entity_definition/train-* - split: validation path: entity_definition/validation-* - split: test path: entity_definition/test-* - config_name: entity_normalization data_files: - split: train path: entity_normalization/train-* - split: validation path: entity_normalization/validation-* - split: test path: entity_normalization/test-* - config_name: entity_recognition data_files: - split: train path: entity_recognition/train-* - split: validation path: entity_recognition/validation-* - split: test path: entity_recognition/test-* - config_name: relation_definition data_files: - split: train path: relation_definition/train-* - split: validation path: relation_definition/validation-* - split: test path: relation_definition/test-* - config_name: relation_extraction data_files: - split: train path: relation_extraction/train-* - split: validation path: relation_extraction/validation-* - split: test path: relation_extraction/test-* - config_name: relation_keywords data_files: - split: train path: relation_keywords/train-* - split: validation path: relation_keywords/validation-* - split: test path: relation_keywords/test-* dataset_info: - config_name: entity_definition features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 93308083 num_examples: 25402 - name: validation num_bytes: 13575050 num_examples: 3699 - name: test num_bytes: 14788092 num_examples: 3483 download_size: 53536545 dataset_size: 121671225 - config_name: entity_normalization features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 104423413 num_examples: 29395 - name: validation num_bytes: 14906566 num_examples: 4211 - name: test num_bytes: 16715164 num_examples: 4016 download_size: 59087358 dataset_size: 136045143 - config_name: entity_recognition features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 2594962 num_examples: 603 - name: validation num_bytes: 338443 num_examples: 82 - name: test num_bytes: 375510 num_examples: 80 download_size: 1501938 dataset_size: 3308915 - config_name: relation_definition features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 155330180 num_examples: 38475 - name: validation num_bytes: 22804661 num_examples: 5589 - name: test num_bytes: 23583595 num_examples: 5154 download_size: 80794042 dataset_size: 201718436 - config_name: relation_extraction features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 107317166 num_examples: 20592 - name: validation num_bytes: 13727050 num_examples: 2712 - name: test num_bytes: 17931184 num_examples: 2912 download_size: 12074921 dataset_size: 138975400 - config_name: relation_keywords features: - name: system dtype: string - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 98052664 num_examples: 23181 - name: validation num_bytes: 14335829 num_examples: 3358 - name: test num_bytes: 14880037 num_examples: 3124 download_size: 50428522 dataset_size: 127268530 --- # NEREL-instruct ## Dataset Description - **Repository:** [HuggingFace Datasets](https://huggingface.co/datasets/bond005/NEREL_instruct) - **Original NEREL Paper:** [NEREL: A Russian Dataset with Nested Named Entities, Relations and Events](https://acl-bg.org/proceedings/2021/RANLP%202021/pdf/2021.ranlp-1.100.pdf) - **Original NEREL GitHub:** [https://github.com/nerel-ds/NEREL](https://github.com/nerel-ds/NEREL) - **Language:** Russian - **License:** MIT NEREL-instruct is an instruction-based dataset derived from the [NEREL](https://github.com/nerel-ds/NEREL) corpus — a large Russian dataset annotated with nested named entities, relations, and events. The original NEREL annotations (texts + manual entity/relation markup) were converted into a structured instruction-following format using **Qwen2.5-32B-Instruct**. The result is a semi‑synthetic dataset designed for fine‑tuning large language models (LLMs) on a variety of information extraction tasks. The dataset comprises **six distinct tasks** (subsets), each formatted as conversational prompts (`system`, `user`, `assistant`) ready for supervised fine‑tuning. All texts are in Russian, and every example originates from a real NEREL document, ensuring high factual coverage and linguistic diversity. ## Tasks (Configurations) | Config name | Description | # Train examples | |--------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------| | `entity_recognition` | Given a text, generate a list of all normalized named entities (one per line). | 603 | | `entity_normalization` | Given a text and a raw (non‑normalized) entity mention, produce its normalized form. | 29 395 | | `entity_definition` | Given a text and a normalized entity, explain what this entity means in the context of the text. | 25 402 | | `relation_extraction` | Given a text, the full list of normalized entities, and a target entity, list all normalized entities that are directly related to it. | 20 592 | | `relation_definition` | Given a text and two normalized entities, describe the relation that holds between them (or state that no relation exists). | 38 475 | | `relation_keywords` | Given a text, two entities, and a textual description of their relation, extract one or more keywords (comma‑separated) that capture the relation type. | 23 181 | Each configuration follows the same three‑column structure: - `system` – the system prompt defining the task. - `query` – the user message containing the input (text, entities, etc.). - `response` – the expected assistant output (gold answer). ## Dataset Statistics The dataset is split by **document** (no overlapping texts between train, validation, and test). Below are the sizes per configuration: | Config | Train examples | Validation examples | Test examples | |----------------------|----------------|---------------------|---------------| | entity_recognition | 603 | 82 | 80 | | entity_normalization | 29 395 | 4 211 | 4 016 | | entity_definition | 25 402 | 3 699 | 3 483 | | relation_extraction | 20 592 | 2 712 | 2 912 | | relation_definition | 38 475 | 5 589 | 5 154 | | relation_keywords | 23 181 | 3 358 | 3 124 | ## Data Construction 1. **Source**: The original NEREL dataset contains >900 Russian Wikinews articles with manual annotation of 29 entity types, 49 relation types, and event mentions. 2. **Transformation**: For each task, the original annotations were converted into natural language instructions. 3. **Synthetic enhancement**: A large language model (Qwen2.5-32B-Instruct) was used to generate the final instruction‑response pairs, guided by the original gold annotations. This process ensures that the responses are faithful to the ground truth while providing diverse, fluent formulations. All examples are therefore *semi‑synthetic*: the underlying facts come from human annotations, but the phrasing of instructions and answers is model‑generated. ## Usage Example Loading a specific configuration with 🤗 Datasets: ```python from datasets import load_dataset # Load the entity recognition subset dataset = load_dataset("your-username/NEREL-instruct", "entity_recognition") # Example from the training split example = dataset["train"][0] print(example["system"]) print(example["query"]) print(example["response"]) ``` The `query` and `response` fields are already formatted as plain text; they can be directly used with standard chat templates. ## Intended Uses - Fine‑tuning LLMs (e.g., LLaMA, Mistral, Qwen, Meno) for Russian information extraction. - Evaluating model performance on nested entity recognition, relation extraction, and related sub‑tasks. - Studying instruction‑following capabilities in the context of structured knowledge extraction. ## Considerations and Limitations - **Language**: All content is in Russian. The dataset is not suitable for cross‑lingual transfer without adaptation. - **Domain**: Texts are news articles from Wikinews; performance may vary on other genres (e.g., scientific papers, social media). - **Synthetic nature**: Although grounded in human annotations, the instructions are model‑generated and may contain occasional stylistic biases or repetitions. ## Citation If you use NEREL-instruct, please cite both the original NEREL paper and this dataset: ```bibtex @inproceedings{loukachevitch2021nerel, title={{NEREL: A Russian} Dataset with Nested Named Entities, Relations and Events}, author={Loukachevitch, Natalia and Artemova, Ekaterina and Batura, Tatiana and Braslavski, Pavel and Denisov, Ilia and Ivanov, Vladimir and Manandhar, Suresh and Pugachev, Alexander and Tutubalina, Elena}, booktitle={Proceedings of RANLP}, pages={876--885}, year={2021} } @misc{nerel-instruct, author = {Bondarenko, Ivan}, title = {NEREL-instruct: An Instruction-based Dataset for Russian Information Extraction}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, howpublished = {\url{https://huggingface.co/datasets/bond005/NEREL_instruct}} } ``` ## Dataset Card Authors Ivan Bondarenko ([@bond005](https://huggingface.co/bond005)), Novosibirsk State University ## Dataset Card Contact For questions, feedback, or collaboration inquiries, please open an issue on the [dataset repository](https://huggingface.co/datasets/bond005/NEREL_instruct) or contact Ivan Bondarenko via Hugging Face. ## License The dataset is released under the **MIT License**. The original NEREL data is available under Creative Commons BY 2.5 (as per Wikinews).
提供机构:
bond005
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作