Qwen3-VL 视频模态 Full CUDA Graph 支持实现 基于 PR #35963 (已合入 main),在其图像 CUDA Graph 支持的基础上,扩展实现视频模态的 Full CUDA Graph 支持。
修改文件总览
文件
改动类型
说明
vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py
数据结构扩展
新增多模态配置字段和状态字段
vllm/v1/worker/gpu/mm/encoder_cudagraph.py
核心逻辑扩展
per-modality budget 管理、自动模态检测、时序帧约束检查
vllm/model_executor/models/interfaces.py
协议扩展
新增 get_encoder_cudagraph_num_seqs 方法
vllm/model_executor/models/qwen3_vl.py
模型实现
为所有协议方法添加视频支持
tests/v1/cudagraph/test_encoder_cudagraph.py
测试更新
适配新的多模态 API
1. encoder_cudagraph_defs.py 新增字段 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 @dataclass class EncoderCudaGraphConfig : modalities: list [str ] input_key: str buffer_keys: list [str ] out_hidden_size: int modality_input_keys: dict [str , str ] | None = None @dataclass class EncoderCudaGraphCaptureInputs : mm_kwargs: dict [str , Any ] buffers: dict [str , torch.Tensor] max_num_seqs: int = 0 @dataclass class EncoderCudaGraphReplayBuffers : buffers: dict [str , torch.Tensor | None ] fits: bool = True
2. encoder_cudagraph.py 关键改动 BudgetGraphMetadata :新增 max_num_seqs: int = 0
budget_graphs 类型变更 :
1 2 3 4 5 6 7 budget_graphs: dict [int , BudgetGraphMetadata] budget_graphs: dict [str , dict [int , BudgetGraphMetadata]]
新增辅助方法 :
1 2 3 4 5 6 def _detect_modality (self, mm_kwargs ) -> str : def _get_input_key (self, modality: str ) -> str :
capture() 循环所有模态 :
1 2 3 4 def capture (self ): for modality in self .config.modalities: for token_budget in self .token_budgets: self ._capture_budget_graph(modality, token_budget)
_run_budget_graph() 增加时序约束检查 :
1 2 3 4 if graph_meta.max_num_seqs > 0 and actual_num_seqs > graph_meta.max_num_seqs: self .graph_misses += num_items return None
3. interfaces.py(SupportsEncoderCudaGraph 协议) 新增方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 def get_encoder_cudagraph_num_seqs (self, mm_kwargs: dict [str , Any ] ) -> int : """ 返回 cu_seqlens 的总段数。 图像:== 图片数(t=1 固定) 视频:== sum(t for each video)(总时序帧数) """ ... def prepare_encoder_cudagraph_capture_inputs ( self, token_budget, max_batch_size, device, dtype, modality: str = "image" , ) -> EncoderCudaGraphCaptureInputs: ...
4. qwen3_vl.py(核心实现) 设计常量 1 2 3 4 _VIDEO_CUDAGRAPH_T_CAPTURE: int = 8
get_encoder_cudagraph_config() 1 2 3 4 5 6 7 8 9 10 11 return EncoderCudaGraphConfig( modalities=["image" , "video" ], input_key="pixel_values" , buffer_keys=["pos_embeds" , "rotary_pos_emb_cos" , "rotary_pos_emb_sin" , "cu_seqlens" , "max_seqlen" , "sequence_lengths" ], out_hidden_size=self .visual.out_hidden_size, modality_input_keys={ "image" : "pixel_values" , "video" : "pixel_values_videos" , }, )
模态自动检测辅助方法 1 2 3 4 5 6 7 @staticmethod def _get_grid_thw_key (mm_kwargs ) -> str : return "video_grid_thw" if "video_grid_thw" in mm_kwargs else "image_grid_thw" @staticmethod def _get_pixel_values_key (mm_kwargs ) -> str : return "pixel_values_videos" if "pixel_values_videos" in mm_kwargs else "pixel_values"
get_encoder_cudagraph_num_seqs() 1 2 3 4 5 def get_encoder_cudagraph_num_seqs (self, mm_kwargs ) -> int : grid_key = self ._get_grid_thw_key(mm_kwargs) return sum (t for t, h, w in mm_kwargs[grid_key])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 if modality == "video" : t_capture = self ._VIDEO_CUDAGRAPH_T_CAPTURE per_video_output = token_budget // max_batch_size per_frame_spatial = max (1 , per_video_output // t_capture) grid_config = [ [t_capture, spatial_merge_size, per_frame_spatial * spatial_merge_size] for _ in range (max_batch_size) ] max_num_seqs = max_batch_size * t_capture buffers = self .visual.prepare_encoder_metadata( grid_config, max_batch_size=max_num_seqs, max_seqlen_override=token_budget * (spatial_merge_size**2 ), ) mm_kwargs = { "pixel_values_videos" : dummy_pixel_values, "video_grid_thw" : grid_config, }
prepare_encoder_cudagraph_replay_buffers()(视频分支) 1 2 3 4 5 6 7 8 9 10 if grid_key == "video_grid_thw" : actual_total_seqs = sum (t for t, h, w in grid_thw_list) max_seqs_for_capture = max_batch_size * self ._VIDEO_CUDAGRAPH_T_CAPTURE if actual_total_seqs > max_seqs_for_capture: return EncoderCudaGraphReplayBuffers(buffers={}, fits=False ) buffers = self .visual.prepare_encoder_metadata( grid_thw_list, max_batch_size=max_seqs_for_capture, )
5. 执行流程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 flowchart TD A[execute mm_kwargs] --> B[_detect_modality] B -->|image| C[image budget_graphs] B -->|video| D[video budget_graphs] C --> E[greedy pack items] D --> E E --> F{token_budget found?} F -->|No| G[eager fallback\ngraph_misses++] F -->|Yes| H[prepare_replay_buffers] H --> I{fits?} I -->|False, t超限| G I -->|True| J[get_encoder_cudagraph_num_seqs] J --> K[_run_budget_graph\nactual_num_seqs check] K --> L{actual_num_seqs\n≤ max_num_seqs?} L -->|No| G L -->|Yes| M[graph.replay\ngraph_hits++] M --> N[scatter output slices] G --> N N --> O[return list of tensors]
6. 设计要点 为什么需要 max_num_seqs? 视频的 cu_seqlens buffer 按总时序帧数 而非视频数分配。捕获时固定 t_capture=8,cu_seqlens 大小为 max_batch_size * 8。若 replay 时实际总帧数超过此值,buffer 会溢出,必须拒绝该 replay。
prepare_encoder_metadata 内部用 max_batch_size 来分配 cu_seqlens 的 padding 大小。视频 attention 是按每个时序帧划分序列的,所以 cu_seqlens 的段数 = 总时序帧数,不是视频数。
为什么 gpu_model_runner.py 无需修改? EncoderCudaGraphManager._detect_modality() 通过 mm_kwargs 中的 key(pixel_values vs pixel_values_videos)自动推断模态,调用方无需感知模态。
视频 t 超限时的行为 t > _VIDEO_CUDAGRAPH_T_CAPTURE(如长视频)→ prepare_encoder_cudagraph_replay_buffers 返回 fits=False → manager 直接走 eager forward,不进入 budget graph 查找,保证功能正确性。
7. 测试文件更新(test_encoder_cudagraph.py)
改动
内容
_make_manager_with_budgets
budget_graphs = {"image": {}}
_make_manager_for_gpu
budget_graphs = {"image": {}}
SimpleMockViTModel
新增 get_encoder_cudagraph_num_seqs();prepare_encoder_cudagraph_capture_inputs 新增 modality 参数
test_capture_creates_one_graph_per_budget
断言改为 budget_graphs["image"]