five

Netsoft/oai-instruct

收藏
Hugging Face2024-12-16 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Netsoft/oai-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation language: - en size_categories: - 10K<n<100K --- ## OAI Instruct ### Overview OAI Instruct is a dataset designed for the fine-tuning of large language models, that includes comprehensive data structured to facilitate advanced natural language processing tasks. This dataset originates from the "5G INSTRUCT Forge" pipeline, which processes 3GPP specifications to generate training and testing data. The primary aim is to enable LLMs to comprehend and operate based on the intricacies of these technical specifications. ### Dataset Composition - **Training Set**: Contains 87,719 entries, with columns including: - `instruction`: Descriptive task guidance - `task_type`: Categorization of the instruction - `input`: Inputs relevant to the tasks - `completion`: Expected model outputs - **Test Set**: Comprises 9,557 entries, featuring: - `prompt`: Basis for generating completions - `completion`: Standard outputs for performance assessment - `completion2`: Alternative outputs for comparative evaluation ### Evaluation Metrics Performance evaluation in the test set utilizes well-known metrics, including: - **BERTScore**: Measures the semantic similarity between generated texts and reference texts. - **SemScore**: Provides a score based on semantic accuracy relative to the task requirements. ### Total Dataset Size - **Approximate Size**: 100MB ### Usage The OAI Instruct dataset is a key resource for developing LLMs capable of understanding and interacting with complex technical standards. It serves as a proof of concept for the "5G INSTRUCT Forge" pipeline. ### Repository For more detailed insights into the dataset generation and the overall pipeline, visit the GitLab repository: [5G INSTRUCT Forge Repository](https://gitlab.eurecom.fr/Azzedde1/5g_instruct_forge) ### Instructions for Import and Use (Using Hugging Face) To effectively utilize OAI Instruct with Hugging Face, follow these instructions: - **Set Up Environment**: Ensure you have Python and the `datasets` library from Hugging Face installed. If not, you can install it using pip: ```bash pip install datasets ``` - **Import the Dataset**: Use the `datasets` library to load the dataset directly from Hugging Face: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset('Netsoft/oai-instruct') ``` - **Explore the Dataset**: Begin by exploring the dataset to understand its structure and contents: ```python # Print the dataset structure print(dataset) # Access the training set train_set = dataset['train'] print(train_set.column_names) print(train_set[0]) # Display the first entry # Access the test set test_set = dataset['test'] print(test_set.column_names) print(test_set[0]) # Display the first entry ### Citation When citing our dataset in your research, please use the following citation: **BibTeX**: ```bibtex @article{said20245g, title={5G INSTRUCT Forge: An Advanced Data Engineering Pipeline for Making LLMs Learn 5G}, author={Said, Azzedine Idir Ait and Mekrache, Abdelkader and Boutiba, Karim and Ramantas, Kostas and Ksentini, Adlen and Rahmani, Moufida}, journal={IEEE Transactions on Cognitive Communications and Networking}, year={2024}, publisher={IEEE} } ```
提供机构:
Netsoft
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作