Qwen2.5-VL 性能测试

环境设置

# vllm
cd /workspace/vllm
git reset --hard 2918c1b49c88c29783c86f78d2c4221cb9622379

# vllm-ascend: main
cd /workspace/vllm-ascend
pip install -r benchmarks/requirements-bench.txt

Run:

1	bash benchmarks/scripts/run-performance-benchmarks.sh

Benchmark 结果

Before (未移除任何 layer 之前):

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  45.35     
Total input tokens:                      20026     
Total generated tokens:                  20430     
Request throughput (req/s):              4.41      
Output token throughput (tok/s):         450.48    
Peak output token throughput (tok/s):    2055.00   
Peak concurrent requests:                194.00    
Total Token throughput (tok/s):          892.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          11300.73  
Median TTFT (ms):                        11307.59  
P99 TTFT (ms):                           23844.70  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          243.95    
Median TPOT (ms):                        235.41    
P99 TPOT (ms):                           454.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           219.73    
Median ITL (ms):                         79.90     
P99 ITL (ms):                            666.86    
==================================================

Before (无优化，直接进行卷积):

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  37.62     
Total input tokens:                      20026     
Total generated tokens:                  20727     
Request throughput (req/s):              5.32      
Output token throughput (tok/s):         551.01    
Peak output token throughput (tok/s):    2257.00   
Peak concurrent requests:                194.00    
Total Token throughput (tok/s):          1083.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          8246.29   
Median TTFT (ms):                        8305.02   
P99 TTFT (ms):                           17224.27  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          196.13    
Median TPOT (ms):                        193.99    
P99 TPOT (ms):                           362.01    
---------------Inter-token Latency----------------
Mean ITL (ms):                           174.21    
Median ITL (ms):                         75.02     
P99 ITL (ms):                            544.52    
==================================================

After (只替换卷积):

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  36.47     
Total input tokens:                      20026     
Total generated tokens:                  21020     
Request throughput (req/s):              5.48      
Output token throughput (tok/s):         576.31    
Peak output token throughput (tok/s):    2275.00   
Peak concurrent requests:                194.00    
Total Token throughput (tok/s):          1125.37   
---------------Time to First Token----------------
Mean TTFT (ms):                          7706.34   
Median TTFT (ms):                        7604.02   
P99 TTFT (ms):                           16479.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          192.98    
Median TPOT (ms):                        189.42    
P99 TPOT (ms):                           352.45    
---------------Inter-token Latency----------------
Mean ITL (ms):                           171.75    
Median ITL (ms):                         73.73     
P99 ITL (ms):                            569.89    
==================================================

After (替换整个模型):

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  37.55     
Total input tokens:                      20026     
Total generated tokens:                  19561     
Request throughput (req/s):              5.33      
Output token throughput (tok/s):         520.90    
Peak output token throughput (tok/s):    1950.00   
Peak concurrent requests:                191.00    
Total Token throughput (tok/s):          1054.19   
---------------Time to First Token----------------
Mean TTFT (ms):                          8501.95   
Median TTFT (ms):                        8653.75   
P99 TTFT (ms):                           16959.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          200.08    
Median TPOT (ms):                        193.59    
P99 TPOT (ms):                           358.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.30    
Median ITL (ms):                         81.29     
P99 ITL (ms):                            574.16    
==================================================

error:

1
2
3

AttributeError: 'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'

AttributeError: '_OpNamespace' '_C_ascend' object has no attribute 'weak_ref_tensor'