Neural Networks: From Fundamentals to Modern AI · Interpretation and Visualization of Neural Networks

Model profiling: parameters, FLOPs, inference time

Interpretation and Visualization of Neural Networks

Introduction

Any discussion of how "fast" or "big" a neural network is quickly reduces to three metrics: the number of parameters (in-memory size, on-disk model footprint, RAM usage), the number of floating-point operations FLOPs (a proxy for compute cost), and actual inference time (end-to-end latency on specific hardware). These three are loosely correlated — the same model may have twice as many parameters but fewer FLOPs, or fewer FLOPs but higher latency. This lesson covers: (1) how to count convolution parameters (k_h · k_w · C_in · C_out + C_out bias) and Linear (D_in · D_out + D_out), (2) how to count FLOPs at the layer level (Conv2d: 2 · k_h · k_w · C_in · C_out · H_out · W_out, with "×2" sometimes omitted, two conventions), (3) the difference between FLOPs and MACs (Multiply-Accumulate operations), (4) typical budgets: AlexNet 60M params / 0.7 GFLOPs, ResNet-50 25M / 4.1 GFLOPs, EfficientNet-B0 5.3M / 0.39 GFLOPs, MobileNet-V3 5.4M / 0.22 GFLOPs, ViT-B 86M / 17.6 GFLOPs, (5) memory-bound vs compute-bound (arithmetic intensity = FLOPs / bytes accessed, Roofline model), (6) acceleration techniques: depthwise-separable convs, group convs, channel pruning, quantization (FP32 → INT8 → INT4), knowledge distillation, (7) practical profilers (PyTorch Profiler, torch.profiler, nvprof / Nsight Systems, TensorBoard, py-spy), (8) how to properly measure latency (warmup, CUDA synchronization, statistics over N=100 runs, separately for batch=1 and batch=32).