sublime-security/babbelphish
收藏Hugging Face2023-08-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sublime-security/babbelphish
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- translation
pretty_name: 'BabbelPhish: Natural Language to Message Query Language'
size_categories:
- 1K<n<10K
---
---
# BabbelPhish
BabbelPhish is a dataset based on the [Sublime Security Message Query Language (MQL)](https://docs.sublimesecurity.com/docs/message-query-language) used for email security detection engineering. This dataset is specially created for the BabbelPhish project, which focuses on leveraging large language models to facilitate the work of detection engineers.
This dataset comprises around 3,000 examples drawn from various sources. We've utilized the following:
- [Sublime Security Documentation](https://docs.sublimesecurity.com/docs/message-query-language)
- [Message Data Model (Schema)](https://docs.sublimesecurity.com/docs/message-query-language)
- [Sublime Rules Repo](https://github.com/sublime-security/sublime-rules/)
- [Sublime Community Slack](https://join.slack.com/t/sublimecommunity/shared_invite/zt-1hhwosroy-LvflKNVE3HEtgIcbHdB1sw)
Additionally, we employed additional human-in-the-loop annotation to generate the prompts in this dataset. Each example involves a natural language description paired with an MQL query.
The BabbelPhish-dataset does not have a natural online source like Stack Overflow. Therefore, we've made a significant effort to generate a unique dataset that closely mirrors the real-world challenges detection engineers face.
We hope this data provides a detailed view of translating natural language prompts into MQL, serving as a valuable resource for similar tasks and research.
## Dataset description
The BabbelPhish dataset contains several fields of interest, and their descriptions are as follows:
- *id*: A unique identifier for each record in the dataset.
- *prompt*: A natural language description or question that outlines the intended task or the specific information to be queried. This forms the input for our language model.
- *completion*: An MQL code snippet corresponding to the prompt. This is the target output generated by the language model.
- *prompt_size*: The character length of the prompt.
- *completion_size*: The character length of the MQL completion.
- *min_line_size*: The minimum line size in the MQL completion.
- *max_line_size*: The maximum line size in the MQL completion.
- *mean_line_size*: The average line size in the MQL completion.
- *ratio*: The record's computed character/token ratio, generated using the tokenizer.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("sublime-security/babbelphish")
dataset
DatasetDict({
train: Dataset({
features: ['id', 'prompt', 'completion', 'prompt_size', 'completion_size', 'min_line_size', 'max_line_size', 'mean_line_size', 'ratio'],
num_rows: 2857
})
test: Dataset({
features: ['id', 'prompt', 'completion', 'prompt_size', 'completion_size', 'min_line_size', 'max_line_size', 'mean_line_size', 'ratio'],
num_rows: 50
})
})
```
## Additional resources
- [Sublime Security Homepage](https://www.sublime.security).
- [BabbelPhish Github Repo](https://github.com/bfilar/babbelphish)
提供机构:
sublime-security
原始信息汇总
数据集概述
数据集名称
- 名称: BabbelPhish
- 别名: Natural Language to Message Query Language
数据集规模
- 规模: 约3,000个示例
数据集用途
- 用途: 用于电子邮件安全检测工程,特别是帮助检测工程师利用大型语言模型进行工作。
数据集内容
- 内容描述: 包含自然语言描述与对应的MQL查询代码。
- 数据来源:
- Sublime Security Documentation
- Message Data Model (Schema)
- Sublime Rules Repo
- Sublime Community Slack
数据集结构
- 字段:
- id: 唯一标识符
- prompt: 自然语言描述或问题
- completion: MQL代码片段
- prompt_size: 提示字符长度
- completion_size: MQL完成字符长度
- min_line_size: MQL完成的最小行长度
- max_line_size: MQL完成的最大行长度
- mean_line_size: MQL完成的平均行长度
- ratio: 计算的字符/令牌比率
数据集许可
- 许可: MIT
数据集版本
- 版本: 包含训练集和测试集
- 训练集: 2857条记录
- 测试集: 50条记录
数据集使用示例
python from datasets import load_dataset dataset = load_dataset("sublime-security/babbelphish") dataset



