TensorRT란?

NVIDIA GPU를 활용하여 딥러닝 모델을 경량화하는 라이브러리

Quantization & Precision Calibration (양자화 및 정밀도 캘리브레이션)
- TensorRT는 Symmetric Linear Quantization을 사용하고 있으며, 이를 통하여 딥러닝 프레임워크의 일반적인 FP32의 데이터를 FP16 및 INT8 의 데이터 타입으로 정밀도를 낮출 수 있다.
- 낮은 정밀도를 가지는 신경망은 Weight와 데이터의 bit 수가 작기 때문에 빠르고 효율적인 연산이 가능하다.
- Quantization & Precision Calibration
- FP16 의 데이터 타입으로 정밀도를 낮추는 것은 모델 정확도에 큰 영향이 없다
- 하지만, INT8의 데이터 타입으로 정밀도를 낮추는 것은 모델 정확도에 영향이 크다.
- 따라서 TensorRT 에서는 추가적으로 캘리브레이션 작업인 EntropyCalibrator, EntropyCalibrator2, MinMaxCalibrator 를 지원한다.
  - 이를 활용하여 양자화 시 가중치 및 intermediate tensor 들의 정보의 손실을 최소화 할 수 있다.

Graph Optimization (그래프 최적화)
- TensorRT 에서는 Layer Fusion 방식과 Tensor Fusion 방식을 동시에 적용 그래프 최적화한다.
- Layer Fusion은 Vertical Layer Fusion, Horizontal Layer Fusion이 적용된다.
- Tensor Fusion은 모델 그래프를 단순화 시켜 모델의 Layer 수가 감소하게 된다.
  - 실제로 TensorRT를 사용하여 ResNet, MobileNet 과 같은 백본 신경망들을 최적화 하면 기존 노드 수가 몇 십배 까지 줄어드는 효과를 볼 수 있다.
- Graph Optimization
Kernel Auto-tuning (커널 자동 튜닝)
- TensorRT 는 NVIDIA의 다양한 플랫폼 및 아키텍쳐에 맞는 Runtime 생성을 도와준다.
- 각 제품들은 CUDA engine 의 갯수, 아키텍쳐, 메모리 그리고 Serialized engine 포함 여부에 따라 최적화 된 커널이 다르다.
- TensorRT Runtime engine을 build 할 때 선택적으로 수행하여 최적의 engine binary 생성한다.
Dynamic Tensor Memory & Multi-stream execution (동적 텐서 메모리 및 멀티 스트림 실행)
- 메모리 관리를 통하여 footprint 를 줄여 재사용 할 수 있도록 도와주는 Dynamic tensor memory 기능이 존재한다.
- CUDA Stream 기술을 이용하여 multiple input stream 의 스케쥴링을 통해 병렬 효율을 그대화 할 수 있는 Multi-stream execution 기능도 존재한다.

TensorRT 엔진 생성

pytorch에서 onnx 변환

Onnx 모델을 생성할 때는 Pytorch 모델에 입력되는 input shape 과 동일해야한다.
shape 만 맞춰준다면 어떠한 랜덤 값이 들어가도 무방하다.
torch.onnx.export 시 중요한 것은 파이토치 모델, 입력 값 만 있으면 Onnx 모델을 만들 수 있다.
torch.onnx.export 함수는 기본적으로 scripting 이 아닌 tracing 을 사용하기 때문에 example input 을 넣어주어야 한다.
opset_version 에 따라 지원하는 함수가 다르다. (따라서 버전 변경이 필요할 수 있음)

import io
import numpy as np

from torch import nn
import torch.utils.model_zoo as model_zoo

import torch.onnx
import torch.nn as nn
import torch.nn.init as init

batch_size = 1   
model.load_state_dict(torch.load(PATH))
torch_model.eval()

x = torch.randn(batch_size, 1, 224, 224, requires_grad=True)
torch_out = torch_model(x)

torch.onnx.export(torch_model,               # 실행될 모델
                  x,                         # 모델 입력값 (튜플 또는 여러 입력값들도 가능)
                  "super_resolution.onnx",   # 모델 저장 경로 (파일 또는 파일과 유사한 객체 모두 가능)
                  export_params=True,        # 모델 파일 안에 학습된 모델 가중치를 저장할지의 여부
                  opset_version=10,          # 모델을 변환할 때 사용할 ONNX 버전
                  do_constant_folding=True,  # 최적하시 상수폴딩을 사용할지의 여부
                  input_names = ['input'],   # 모델의 입력값을 가리키는 이름
                  output_names = ['output'], # 모델의 출력값을 가리키는 이름
                  dynamic_axes={'input' : {0 : 'batch_size'},    # 가변적인 길이를 가진 차원
                                'output' : {0 : 'batch_size'}})

- 위 코드를 실행 시, super_resolution.onnx 파일이 생성된다.

- .onnx 파일은 다양한 딥러닝 프레임 워크와 호환되는 확장자로 널리 사용된다.

onnx에서 TensorRT 엔진 변환

- 이전 단계에서 생성한 .onnx 파일을 TensorRT를 활용하여 모델 경량화를 진행한다.

- TensorRT의 경우, CUDA, cudnn 등 호환성을 맞추기 어렵기 때문에 docker 이미지를 활용하는 것이 환경 구성에 시간 절약이 가능하다.

- 예시는 아래와 같으며 볼드체만 버전 정보를 보고 수정해서 도커 이미지를 생성하면 된다.

docker pull nvcr.io/nvidia/tensorrt:23.01-py3

- 버전 정보는 아래 링크를 첨부한다.

https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/running.html

- 도커 이미지를 받으면 아래 경로에서 trtexec를 활용하여 onnx 파일을 tensorRT의 .engine 변환하여 사용 가능하다.

workspace/tensorrt/bin#

trtexec 사용법(20.11-py3)

&&&& RUNNING TensorRT.trtexec # ./trtexec --help
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                  Set max batch size and build an implicit batch engine (default = 1)
  --explicitBatch             Use explicit batch sizes when building the engine (default = implicit)
  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
  --minShapesCalib=spec       Calibrate with dynamic shapes using a profile with the min shapes provided
  --optShapesCalib=spec       Calibrate with dynamic shapes using a profile with the opt shapes provided
  --maxShapesCalib=spec       Calibrate with dynamic shapes using a profile with the max shapes provided
                              Note: All three of min, opt and max shapes must be supplied.
                                    However, if only opt shapes is supplied then it will be expanded so
                                    that min shapes and max shapes are set to the same values as opt shapes.
                                    In addition, use of dynamic shapes implies explicit batch.
                                    Input names can be wrapped with escaped single quotes (ex: \\'Input:0\\').
                              Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --inputIOFormats=spec       Type and format of each of the input tensors (default = all inputs in fp32:chw)
                              See --outputIOFormats help for the grammar of type and format list.
                              Note: If this option is specified, please set comma-separated types and formats for all
                                    inputs following the same order as network inputs ID (even if only one input
                                    needs specifying IO format) or set the type and format once for broadcasting.
  --outputIOFormats=spec      Type and format of each of the output tensors (default = all outputs in fp32:chw)
                              Note: If this option is specified, please set comma-separated types and formats for all
                                    outputs following the same order as network outputs ID (even if only one output
                                    needs specifying IO format) or set the type and format once for broadcasting.
                              IO Formats: spec  ::= IOfmt[","spec]
                                          IOfmt ::= type:fmt
                                          type  ::= "fp32"|"fp16"|"int32"|"int8"
                                          fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8")["+"fmt]
  --workspace=N               Set workspace size in megabytes (default = 16)
  --noBuilderCache            Disable timing cache in builder (default is to enable timing cache)
  --nvtxMode=mode             Specify NVTX annotation verbosity. mode ::= default|verbose|none
  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  --noTF32                    Disable tf32 precision (default is to enable tf32, in addition to fp32)
  --refit                     Mark the engine as refittable. This will allow the inspection of refittable layers 
                              and weights within the engine.
  --fp16                      Enable fp16 precision, in addition to fp32 (default = disabled)
  --int8                      Enable int8 precision, in addition to fp32 (default = disabled)
  --best                      Enable all precisions to achieve the best performance (default = disabled)
  --calib=<file>              Read INT8 calibration cache file
  --safe                      Only test the functionality available in safety restricted flows
  --saveEngine=<file>         Save the serialized engine
  --loadEngine=<file>         Load a serialized engine
  --tacticSources=tactics     Specify the tactics to be used by adding (+) or removing (-) tactics from the default 
                              tactic sources (default = all available tactics).
                              Note: Currently only cuBLAS and cuBLAS LT are listed as optional tactics.
                              Tactic Sources: tactics ::= [","tactic]
                                              tactic  ::= (+|-)lib
                                              lib     ::= "cublas"|"cublasLt"

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
  --shapes=spec               Set input shapes for dynamic shapes inference inputs.
                              Note: Use of dynamic shapes implies explicit batch.
                                    Input names can be wrapped with escaped single quotes (ex: \\'Input:0\\').
                              Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                              Input values spec ::= Ival[","spec]
                                           Ival ::= name":"file
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --streams=N                 Instantiate N engines to use concurrently (default = 1)
  --exposeDMA                 Serialize DMA transfers to and from device. (default = disabled)
  --noDataTransfers           Do not transfer data to and from the device during inference. (default = disabled)
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  --threads                   Enable multithreading to drive engines with independent threads (default = disabled)
  --useCudaGraph              Use cuda graph to capture engine execution and then launch inference (default = disabled)
  --separateProfileRun        Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
  --buildOnly                 Skip inference perf measurement (default = disabled)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given, 
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they 
                              will be used also as min/opt/max in the build profile; if shapes are 
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is 
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P              Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  --dumpRefit                 Print the refittable layers and weights from a refittable engine
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --plugins                   Plugin library (.so) to load (can be specified multiple times)

=== Help ===
  --help, -h                  Print this message

TensorRT engine(20.11-py3 버전) 설정
- Model Options
  - --model=<file>
    - Caffe model (default = no model, random weights used)
- Build Options
  - --maxBatch
    - Set max batch size and build an implicit batch engine (default = 1)
  - --explicitBatch
    - Use explicit batch sizes when building the engine (default = implicit)
  - --inputIOFormats=spec
    - Type and format of each of the input tensors (default = all inputs in fp32:chw)
      - See --outputIOFormats help for the grammar of type and format list. Note: If this option is specified, please set comma-separated types and formats for all inputs following the same order as network inputs ID (even if only one input needs specifying IO format) or set the type and format once for broadcasting.
  - --outputIOFormats=spec
    - Type and format of each of the output tensors (default = all outputs in fp32:chw)
    - Note: If this option is specified, please set comma-separated types and formats for all outputs following the same order as network outputs ID (even if only one output needs specifying IO format) or set the type and format once for broadcasting.
    - IO Formats:
      - spec ::= IOfmt[","spec]
      - IOfmt ::= type:fmt
      - type ::= "fp32"|"fp16"|"int32"|"int8"
      - fmt ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8")["+"fmt]
  - --workspace=N
    - Set workspace size in megabytes (default = 16)
  - --noBuilderCache
    - Disable timing cache in builder (default is to enable timing cache)
  - --minTiming=M
    - Set the minimum number of iterations used in kernel selection (default = 1)
  - --avgTiming=M
    - Set the number of times averaged in each iteration for kernel selection (default = 8)
  - --noTF32
    - Disable tf32 precision (default is to enable tf32, in addition to fp32)
  - --fp16
    - Enable fp16 precision, in addition to fp32 (default = disabled)
  - --int8
    - Enable int8 precision, in addition to fp32 (default = disabled)
  - --best
    - Enable all precisions to achieve the best performance (default = disabled)
- Inference Options
  - --batch=N
    - Set batch size for implicit batch engines (default = 1)
  - --loadInputs=spec
    - Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0') Input values spec ::= Ival[","spec] Ival ::= name":"file
  - --iterations=N
    - Run at least N inference iterations (default = 10)
  - --warmUp=N
    - Run for N milliseconds to warmup before measuring performance (default = 200)
  - --duration=N
    - Run performance measurements for at least N seconds wallclock time (default = 3)
  - --sleepTime=N
    - Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  - --streams=N
    - Instantiate N engines to use concurrently (default = 1)
  - --exposeDMA
    - Serialize DMA transfers to and from device. (default = disabled)
  - --noDataTransfers
    - Do not transfer data to and from the device during inference. (default = disabled)
  - --useSpinWait
    - Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  - --threads
    - Enable multithreading to drive engines with independent threads (default = disabled)
  - --useCudaGraph
    - Use cuda graph to capture engine execution and then launch inference (default = disabled)
  - --separateProfileRun
    - Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
  - --buildOnly
    - Skip inference perf measurement (default = disabled)
- Build and Inference Batch Options
  - When using implicit batch, the max batch size of the engine, if not given, is set to the inference batch size;
  - when using explicit batch, if shapes are specified only for inference, they will be used also as min/opt/max in the build profile; if shapes are specified only for the build, the opt shapes will be used also for inference;
  - if both are specified, they must be compatible; and if explicit batch is enabled but neither is specified, the model must provide complete static dimensions, including batch size, for all inputs
- Reporting Options
  - --verbose
    - Use verbose logging (default = false)
  - --avgRuns=N
    - Report performance measurements averaged over N consecutive iterations (default = 10)
  - --percentile=P
    - Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  - --dumpRefit
    - Print the refittable layers and weights from a refittable engine
  - --dumpOutput
    - Print the output tensor(s) of the last inference iteration (default = disabled)
  - --dumpProfile
    - Print profile information per layer (default = disabled)
  - --exportTimes=<file>
    - Write the timing results in a json file (default = disabled)
  - --exportOutput=<file>
    - Write the output tensors to a json file (default = disabled)
  - --exportProfile=<file>
    - Write the profile information per layer in a json file (default = disabled)
- System Options
  - --device=N
    - Select cuda device N (default = 0)
  - --useDLACore=N
    - Select DLA core N for layers that support DLA (default = none)
  - --allowGPUFallback
    - When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  - --plugins
    - Plugin library (.so) to load (can be specified multiple times)

- trtexec를 활용하여 onnx를 engine 파일로 변환한다.(아래 예시)

workspace/tensorrt/bin# ./trtexec --onnx=/root/model.onnx --saveEngine=./model.engine --verbose

- engine 파일을 생성 시, 아래와 같은 로그가 출력 된다.