sublime-security/babbelphish

Name: sublime-security/babbelphish
Creator: sublime-security
Published: 2023-08-04 17:19:51
License: 暂无描述

Hugging Face2023-08-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/sublime-security/babbelphish

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - translation pretty_name: 'BabbelPhish: Natural Language to Message Query Language' size_categories: - 1K<n<10K --- --- # BabbelPhish BabbelPhish is a dataset based on the [Sublime Security Message Query Language (MQL)](https://docs.sublimesecurity.com/docs/message-query-language) used for email security detection engineering. This dataset is specially created for the BabbelPhish project, which focuses on leveraging large language models to facilitate the work of detection engineers. This dataset comprises around 3,000 examples drawn from various sources. We've utilized the following: - [Sublime Security Documentation](https://docs.sublimesecurity.com/docs/message-query-language) - [Message Data Model (Schema)](https://docs.sublimesecurity.com/docs/message-query-language) - [Sublime Rules Repo](https://github.com/sublime-security/sublime-rules/) - [Sublime Community Slack](https://join.slack.com/t/sublimecommunity/shared_invite/zt-1hhwosroy-LvflKNVE3HEtgIcbHdB1sw) Additionally, we employed additional human-in-the-loop annotation to generate the prompts in this dataset. Each example involves a natural language description paired with an MQL query. The BabbelPhish-dataset does not have a natural online source like Stack Overflow. Therefore, we've made a significant effort to generate a unique dataset that closely mirrors the real-world challenges detection engineers face. We hope this data provides a detailed view of translating natural language prompts into MQL, serving as a valuable resource for similar tasks and research. ## Dataset description The BabbelPhish dataset contains several fields of interest, and their descriptions are as follows: - *id*: A unique identifier for each record in the dataset. - *prompt*: A natural language description or question that outlines the intended task or the specific information to be queried. This forms the input for our language model. - *completion*: An MQL code snippet corresponding to the prompt. This is the target output generated by the language model. - *prompt_size*: The character length of the prompt. - *completion_size*: The character length of the MQL completion. - *min_line_size*: The minimum line size in the MQL completion. - *max_line_size*: The maximum line size in the MQL completion. - *mean_line_size*: The average line size in the MQL completion. - *ratio*: The record's computed character/token ratio, generated using the tokenizer. ## Usage ```python from datasets import load_dataset dataset = load_dataset("sublime-security/babbelphish") dataset DatasetDict({ train: Dataset({ features: ['id', 'prompt', 'completion', 'prompt_size', 'completion_size', 'min_line_size', 'max_line_size', 'mean_line_size', 'ratio'], num_rows: 2857 }) test: Dataset({ features: ['id', 'prompt', 'completion', 'prompt_size', 'completion_size', 'min_line_size', 'max_line_size', 'mean_line_size', 'ratio'], num_rows: 50 }) }) ``` ## Additional resources - [Sublime Security Homepage](https://www.sublime.security). - [BabbelPhish Github Repo](https://github.com/bfilar/babbelphish)

提供机构：

sublime-security

原始信息汇总

数据集概述

数据集名称

名称: BabbelPhish
别名: Natural Language to Message Query Language

数据集规模

规模: 约3,000个示例

数据集用途

用途: 用于电子邮件安全检测工程，特别是帮助检测工程师利用大型语言模型进行工作。

数据集内容

内容描述: 包含自然语言描述与对应的MQL查询代码。
数据来源:
- Sublime Security Documentation
- Message Data Model (Schema)
- Sublime Rules Repo
- Sublime Community Slack

数据集结构

字段:
- id: 唯一标识符
- prompt: 自然语言描述或问题
- completion: MQL代码片段
- prompt_size: 提示字符长度
- completion_size: MQL完成字符长度
- min_line_size: MQL完成的最小行长度
- max_line_size: MQL完成的最大行长度
- mean_line_size: MQL完成的平均行长度
- ratio: 计算的字符/令牌比率

数据集许可

许可: MIT

数据集版本

版本: 包含训练集和测试集
- 训练集: 2857条记录
- 测试集: 50条记录

数据集使用示例

python from datasets import load_dataset dataset = load_dataset("sublime-security/babbelphish") dataset

5,000+

优质数据集

54 个

任务类型

进入经典数据集