five

squeezebits/dynamic_sonnet_llama2

收藏
Hugging Face2024-08-14 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/squeezebits/dynamic_sonnet_llama2
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering - text-generation language: - en size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: 1k path: "dynamic_sonnet_llama_2_prefix_256_max_1024_1024_sampled.parquet" - split: 2k path: "dynamic_sonnet_llama_2_prefix_512_max_2048_1024_sampled.parquet" - split: 4k path: "dynamic_sonnet_llama_2_prefix_1024_max_4096_1024_sampled.parquet" - split: 8k path: "dynamic_sonnet_llama_2_prefix_2048_max_8192_1024_sampled.parquet" --- # Dynamic Sonnet - Llama2 *Curated dataset for benchmarking LLM serving systems* ![plot](distribution.png) In real-world service scenarios, each request comes with varying input token lengths. Some requests generate only a few tokens, while others produce a significant number. Traditional fixed-length benchmarks fail to capture this variability, making it difficult to accurately assess real-world throughput performance. This dynamic nature of input token lengths is crucial as it directly affects key features of LLM serving systems, such as continuous batching, which are essential for optimal performance. To address this challenge, we introduce ***Dynamic Sonnet***—a dataset designed specifically for benchmarking LLM serving systems under realistic conditions. ***Dynamic Sonnet*** comprises four subsets: 1k, 2k, 4k, and 8k. Each subset is carefully curated to have an average token length of 512, 1k, 3k, and 7k, respectively. This variability in token length within the dataset allows for a more accurate and comprehensive evaluation of LLM serving systems in environments that mirror real-world usage. Furthermore, in real-world scenarios, requests often share common prefixes. Advanced systems can leverage this by caching these prefixes to boost performance. ***Dynamic Sonnet*** simulates this behavior by incorporating a common prefix that constitutes approximately 25% of the maximum length in each subset (N/4 for an Nk subset). This design allows for more realistic benchmarking of systems that optimize for such efficiencies. ## Details The Dynamic Sonnet dataset consists of five columns: `id`, `system_prompt`, `user_prompt`, `formatted_input` and `tok_inputs` * `id`: A unique identifier (index) for each prompt * `system_prompt`: A common prefix that instructs the agent to select specific lines from the following text * `user_prompt`: The lines selected from Shakespeare's sonnets * `formatted_input`: The prompt(`system_prompt`+`user_prompt`) formatted according to a specific chat template * `tok_inputs`: The tokenized version of the `formatted_input` ## Usage To benchmark with ***Dynamic Sonnet***, users can pass the token IDs (tok_inputs) directly to the LLM serving system. For benchmarking an OpenAI-compatible system, users can concatenate the `system_prompt` and `user_prompt`, and then send a request to `v1/chat/completions` endpoint, using the concatenated result as the request body.

--- 任务类别: - 问答(question-answering) - 文本生成(text-generation) 语言: - 英语(en) 样本量范围: - 1000 < 样本数 < 10000 配置项: - 配置名称:default 数据文件: - 拆分集:1k,路径:dynamic_sonnet_llama_2_prefix_256_max_1024_1024_sampled.parquet - 拆分集:2k,路径:dynamic_sonnet_llama_2_prefix_512_max_2048_1024_sampled.parquet - 拆分集:4k,路径:dynamic_sonnet_llama_2_prefix_1024_max_4096_1024_sampled.parquet - 拆分集:8k,路径:dynamic_sonnet_llama_2_prefix_2048_max_8192_1024_sampled.parquet --- # 动态十四行诗(Dynamic Sonnet)- Llama2 *用于大语言模型(LLM)服务系统基准测试的精选数据集* ![plot](distribution.png) 在真实的服务场景中,每个请求的输入Token(Token)长度各不相同:部分请求仅生成少量Token(Token),而另一些请求则会生成大量Token(Token)。传统的固定长度基准测试无法捕捉这种差异性,难以准确评估真实场景下的吞吐性能。输入Token(Token)长度的这种动态特性至关重要,因为它直接影响大语言模型(LLM)服务系统的关键特性(如连续批处理),而这些特性是实现最优性能的核心。 为解决这一挑战,我们推出了**动态十四行诗(Dynamic Sonnet)**——一款专为在真实条件下基准测试大语言模型(LLM)服务系统而设计的数据集。该数据集包含四个子集:1k、2k、4k和8k,每个子集的平均Token(Token)长度分别为512、1024、3072和7168。该数据集内Token(Token)长度的差异性,使得我们能够在更贴近真实使用场景的环境中,对大语言模型(LLM)服务系统进行更准确且全面的评估。 此外,在真实场景中,请求通常会共享公共前缀。先进的服务系统可通过缓存这些公共前缀来提升性能。**动态十四行诗(Dynamic Sonnet)**通过在每个子集中引入占最大长度约25%的公共前缀(对于Nk子集,该前缀长度为N/4)来模拟这一行为。此设计使得针对此类效率优化的系统能够进行更贴合实际的基准测试。 ## 数据集详情 动态十四行诗数据集包含五列:`id`、`system_prompt`、`user_prompt`、`formatted_input`和`tok_inputs` * `id`:每个提示的唯一标识符(索引) * `system_prompt`:用于指示AI智能体(AI Agent)从后续文本中选取特定内容的公共前缀 * `user_prompt`:从莎士比亚十四行诗中选取的文本片段 * `formatted_input`:按照特定对话模板格式化后的提示(`system_prompt`+`user_prompt`) * `tok_inputs`:`formatted_input`的Token化版本 ## 使用方法 若使用**动态十四行诗(Dynamic Sonnet)**进行基准测试,用户可直接将Token ID(`tok_inputs`)传入大语言模型(LLM)服务系统。若针对兼容OpenAI接口的系统进行基准测试,用户可将`system_prompt`与`user_prompt`拼接,随后将拼接结果作为请求体,发送至`v1/chat/completions`接口。
提供机构:
squeezebits
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作