edteams/ref-ops-v2
收藏Hugging Face2024-12-17 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/edteams/ref-ops-v2
下载链接
链接失效反馈官方服务:
资源简介:
# Reference Operations V2
## Dot Product
- MXINT8 dot product
```python
# sum(a .* b), dot product
# a.shape: [m]
# b.shape: [m]
# output.shape: scalar
output = dot_product(a, b)
```
- MXINT8 vector-matrix multiplication
```python
# vector @ matrix, vector-matrix multiplication
# vector.shape: [m]
# matrix.shape: [m, k]
# output.shape: [k]
output = gevm(vector, matrix)
```
- MXINT8 matrix-matrix multiplication
```python
# input @ other, matrix-matrix multiplication
# input.shape: [m, k]
# other.shape: [k, n]
# output.shape: [m, n]
output = gemm(input, other)
```
## RoPE constants
### BF16
Dumped tensors:
`rope.freq_bf16`:
- shape: `[1, 4096, 64]`, 4096 denotes maximum sequence length, 64 is half of the head dimension (128/2=64)
- each element is $m \theta_i$ where $m$ is the position id, $i$ is the head dimension index.
`rope.cos_bf16`: `cos_bf16 = cos(freq_bf16)`
`rope.sin_bf16`: `sin_bf16 = sin(freq_bf16)`
### FP32 for reference
HuggingFace forces the RoPE constants to be in FP32 format because both BF16 and FP16 introduce errors in the output cos and sin values. For example, BF16 cannot represent the exact value of 4088, 4089, ..., 4095 (all these values are rounded to 4096). [This issue](https://github.com/huggingface/transformers/pull/29285) discusses this problem.
Therefore, we also provide the FP32 version of the RoPE constants for reference. This assumes all the constants are computed in FP32 format.
- `rope.freq_fp32`
- `rope.cos_fp32`
- `rope.sin_fp32`
## Softmax
`softmax_v` is a function that computes the `matmul(softmax(qk_T), v)` operation in the attention layer. To match the HW behaviour, `softmax_v` is broken down into the following steps:
```python
def softmaxed_v(qk_T, v):
"""
qk_T.shape: [bs, num_heads, seq_q_len, seq_kv_len]
v.shape: [bs, num_heads, seq_kv_len, head_dim]
"""
# elementwise exp
exp_bf16 = row_exp(qk_T, dim=-1) # shape: [bs, num_heads, seq_q_len, seq_kv_len]
# sum exp along the head dimension
exp_sum = reduce_sum(exp_bf16, dim=-1) # shape: [bs, num_heads, seq_q_len, 1]
# quantize to mxint8 for vector-matrix multiplication
exp_mxint8 = quantize(exp_bf16)
# vector-matrix multiplication between two mxint8 tensors
scaled_v = mxint8_gevm(exp_mxint8, v) # shape: [bs, num_heads, seq_q_len, head_dim]
# inverse of the sum
inv = 1.0 / exp_sum # shape: [bs, num_heads, seq_q_len, 1]
# adjust scaled_v by the inverse
out = scaled_v * inv # shape: [bs, num_heads, seq_q_len, head_dim]
return out
```
| Notation | Description |
| --- | --- |
| `bs` | batch size, 1 in this case |
| `num_heads` | number of attention heads, 32 for llama-2-7b |
| `seq_q_len` | query sequence length, 1 in this case |
| `seq_kv_len` | key-value sequence length, which increments by 1 for each decoding step |
| `head_dim` | head dimension, 128 for llama-2-7b |
The folder name takes the form `kv-size-<seq_kv_len>_seed-<seed>`.
For example, `kv-size-4_seed-0` means the key-value sequence length is 4 and the random seed is 0.
The following tensors are dumped:
| File Name | Description |
| --- | --- |
| `softmaxed_v.qk_T` | BF16 `qk_T` tensor |
| `softmaxed_v.exp-bf16` | BF16 `exp_bf16` tensor |
| `softmaxed_v.exp_sum` | BF16 `exp_sum` tensor |
| `softmaxed_v.v` | MXINT8 `v` tensor |
| `softmaxed_v.exp_mxint8` | MXINT8 `exp_mxint8` tensor |
| `softmaxed_v.scaled_v` | BF16 `scaled_v` tensor |
| `softmaxed_v.inv` | BF16 `inv` tensor |
| `softmaxed_v.out` | BF16 `out` tensor |
提供机构:
edteams



