Qwen2.5-VL 性能测试

环境设置

1
2
3
4
5
6
7
# vllm
cd /workspace/vllm
git reset --hard 2918c1b49c88c29783c86f78d2c4221cb9622379

# vllm-ascend: main
cd /workspace/vllm-ascend
pip install -r benchmarks/requirements-bench.txt

Run:

1
bash benchmarks/scripts/run-performance-benchmarks.sh

Benchmark 结果

Before (未移除任何 layer 之前):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 16.00
Benchmark duration (s): 45.35
Total input tokens: 20026
Total generated tokens: 20430
Request throughput (req/s): 4.41
Output token throughput (tok/s): 450.48
Peak output token throughput (tok/s): 2055.00
Peak concurrent requests: 194.00
Total Token throughput (tok/s): 892.06
---------------Time to First Token----------------
Mean TTFT (ms): 11300.73
Median TTFT (ms): 11307.59
P99 TTFT (ms): 23844.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 243.95
Median TPOT (ms): 235.41
P99 TPOT (ms): 454.32
---------------Inter-token Latency----------------
Mean ITL (ms): 219.73
Median ITL (ms): 79.90
P99 ITL (ms): 666.86
==================================================

Before (无优化,直接进行卷积):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 16.00
Benchmark duration (s): 37.62
Total input tokens: 20026
Total generated tokens: 20727
Request throughput (req/s): 5.32
Output token throughput (tok/s): 551.01
Peak output token throughput (tok/s): 2257.00
Peak concurrent requests: 194.00
Total Token throughput (tok/s): 1083.38
---------------Time to First Token----------------
Mean TTFT (ms): 8246.29
Median TTFT (ms): 8305.02
P99 TTFT (ms): 17224.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 196.13
Median TPOT (ms): 193.99
P99 TPOT (ms): 362.01
---------------Inter-token Latency----------------
Mean ITL (ms): 174.21
Median ITL (ms): 75.02
P99 ITL (ms): 544.52
==================================================

After (只替换卷积):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 16.00
Benchmark duration (s): 36.47
Total input tokens: 20026
Total generated tokens: 21020
Request throughput (req/s): 5.48
Output token throughput (tok/s): 576.31
Peak output token throughput (tok/s): 2275.00
Peak concurrent requests: 194.00
Total Token throughput (tok/s): 1125.37
---------------Time to First Token----------------
Mean TTFT (ms): 7706.34
Median TTFT (ms): 7604.02
P99 TTFT (ms): 16479.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 192.98
Median TPOT (ms): 189.42
P99 TPOT (ms): 352.45
---------------Inter-token Latency----------------
Mean ITL (ms): 171.75
Median ITL (ms): 73.73
P99 ITL (ms): 569.89
==================================================

After (替换整个模型):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 16.00
Benchmark duration (s): 37.55
Total input tokens: 20026
Total generated tokens: 19561
Request throughput (req/s): 5.33
Output token throughput (tok/s): 520.90
Peak output token throughput (tok/s): 1950.00
Peak concurrent requests: 191.00
Total Token throughput (tok/s): 1054.19
---------------Time to First Token----------------
Mean TTFT (ms): 8501.95
Median TTFT (ms): 8653.75
P99 TTFT (ms): 16959.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 200.08
Median TPOT (ms): 193.59
P99 TPOT (ms): 358.28
---------------Inter-token Latency----------------
Mean ITL (ms): 173.30
Median ITL (ms): 81.29
P99 ITL (ms): 574.16
==================================================

error:

1
2
3
AttributeError: 'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'

AttributeError: '_OpNamespace' '_C_ascend' object has no attribute 'weak_ref_tensor'