torch.cuda.Event() 是PyTorch中用于精确测量GPU操作时间和实现细粒度同步的工具。
基本用法
创建CUDA Event
1 2 3 4 5
| import torch
start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True)
|
record() 方法
record() 方法在GPU的当前流中插入一个事件记录点。
基本record用法
1 2 3 4 5 6 7 8
| start_event.record()
x = torch.randn(1000, 1000, device='cuda') y = x * x + 2 * x + 1
end_event.record()
|
synchronize() 方法
synchronize() 方法阻塞CPU执行,直到事件完成。
基本synchronize用法
1 2 3 4 5 6
| end_event.synchronize()
elapsed_time = start_event.elapsed_time(end_event) print(f"GPU操作耗时: {elapsed_time:.2f} ms")
|
实际应用示例
1. 测量GPU操作时间
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| def measure_gpu_operation(): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) a = torch.randn(10000, 10000, device='cuda') b = torch.randn(10000, 10000, device='cuda') start.record() c = torch.mm(a, b) end.record() torch.cuda.synchronize() elapsed = start.elapsed_time(end) print(f"矩阵乘法耗时: {elapsed:.2f} ms") return elapsed
|
2. 测量数据传输时间
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| def measure_data_transfer(): start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) cpu_tensor = torch.randn(5000, 5000).pin_memory() start_event.record() gpu_tensor = cpu_tensor.to('cuda', non_blocking=True) end_event.record() end_event.synchronize() elapsed = start_event.elapsed_time(end_event) print(f"CPU->GPU数据传输耗时: {elapsed:.2f} ms") return elapsed
|
3. 训练循环中的精确计时
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| def training_loop_with_timing(dataloader, model, optimizer): model = model.cuda() data_transfer_start = torch.cuda.Event(enable_timing=True) data_transfer_end = torch.cuda.Event(enable_timing=True) forward_start = torch.cuda.Event(enable_timing=True) forward_end = torch.cuda.Event(enable_timing=True) backward_start = torch.cuda.Event(enable_timing=True) backward_end = torch.cuda.Event(enable_timing=True) for batch_idx, (data, target) in enumerate(dataloader): data_transfer_start.record() data = data.to('cuda', non_blocking=True) target = target.to('cuda', non_blocking=True) data_transfer_end.record() forward_start.record() output = model(data) loss = torch.nn.functional.cross_entropy(output, target) forward_end.record() backward_start.record() optimizer.zero_grad() loss.backward() optimizer.step() backward_end.record() torch.cuda.synchronize() transfer_time = data_transfer_start.elapsed_time(data_transfer_end) forward_time = forward_start.elapsed_time(forward_end) backward_time = backward_start.elapsed_time(backward_end) if batch_idx % 100 == 0: print(f"Batch {batch_idx}: 传输={transfer_time:.2f}ms, " f"前向={forward_time:.2f}ms, 反向={backward_time:.2f}ms")
|
4. 多流同步
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| def multi_stream_sync(): stream1 = torch.cuda.Stream() stream2 = torch.cuda.Stream() sync_event = torch.cuda.Event() with torch.cuda.stream(stream1): a = torch.randn(1000, 1000, device='cuda') b = torch.mm(a, a) sync_event.record(stream=stream1) with torch.cuda.stream(stream2): torch.cuda.current_stream().wait_event(sync_event) c = b * 2 torch.cuda.synchronize()
|
5. 异步操作的正确同步
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| def async_operations(): data_ready = torch.cuda.Event() data = torch.randn(1000, 1000, device='cuda') data_ready.record() compute_stream = torch.cuda.Stream() with torch.cuda.stream(compute_stream): compute_stream.wait_event(data_ready) result = data * data + 2 compute_stream.synchronize() return result
|
重要注意事项
enable_timing参数:
1 2
| event = torch.cuda.Event(enable_timing=True)
|
流关联:
1 2
| event.record(stream=torch.cuda.current_stream())
|
同步选择:
1 2 3
| event.synchronize() torch.cuda.synchronize()
|
性能分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| def profile_operation(): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) warmup = torch.randn(100, 100, device='cuda') _ = warmup * warmup torch.cuda.synchronize() start.record() end.record() torch.cuda.synchronize() return start.elapsed_time(end)
|
使用torch.cuda.Event可以精确控制GPU操作的同步和计时,对于性能优化和调试非常有用。