Overview

Nvidia GPU Passthrough Enabled

This container offers ready-to-use environment with optimzied AI frameworks, GPU computing passthrough and industrial-grade reliability on Nvidia Jetson platforms, allowing users to focus on building AI applications on Advantech Edge AI platforms accelerated by Nvidia harware chipset, without troubles from hardware and AI framework installation and incompatibility.

Key Features

Full Hardware Acceleration: Optimized access to GPU, NVENC/NVDEC, and DLA
Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT
Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
Performance Optimized: Tuned specifically for Jetson Orin NX 8GB

Hardware Specifications

Component	Specification
Target Hardware	Advantech EPC-7300 L2-02 / NVIDIA Jetson Orin NX
GPU	NVIDIA Ampere architecture with 1024 CUDA cores
DLA Cores	1 (Deep Learning Accelerator)
Memory	8GB shared GPU/CPU memory
JetPack Version	5.. (L4T R35.2.1)

Software Components

Component	Version	Description
CUDA	11.4.315	GPU computing platform
cuDNN	8.6.0	Deep Neural Network library
TensorRT	8.5.2.2	Inference optimizer and runtime
PyTorch	2.0.0+nv23.02	Deep learning framework
TensorFlow	2.12.0+nv23.05	Machine learning framework
ONNX Runtime	1.16.3	Cross-platform inference engine
OpenCV	4.5.0	Computer vision library with CUDA
GStreamer	1.16.2	Multimedia framework

Supported AI Capabilities

Vision Models

Model Family	Versions	Performance (FPS)	Quantization Support
YOLO	v3/v4/v5 (up to v5.6.0), v6 (up to v6.2), v7 (up to v7.0), v8 (up to v8.0)	YOLOv5s: 45-60 @ 640x640, YOLOv8n: 40-55 @ 640x640, YOLOv8s: 30-40 @ 640x640	INT8, FP16, FP32
SSD	MobileNetV1/V2 SSD, EfficientDet-D0/D1	MobileNetV2 SSD: 50-65 @ 300x300, EfficientDet-D0: 25-35 @ 512x512	INT8, FP16, FP32
Faster R-CNN	ResNet50/ResNet101 backbones	ResNet50: 3-5 @ 1024x1024	FP16, FP32
Segmentation	DeepLabV3+, UNet	DeepLabV3+ (MobileNetV2): 12-20 @ 512x512	INT8, FP16, FP32
Classification	ResNet (18/50), MobileNet (V1/V2/V3), EfficientNet (B0-B2)	ResNet18: 120-150 @ 224x224, MobileNetV2: 180-210 @ 224x224	INT8, FP16, FP32
Pose Estimation	PoseNet, HRNet (up to W18)	PoseNet: 15-25 @ 256x256	FP16, FP32

Language Models Recommendation

Model Family	Versions	Memory Requirements	Performance Notes
DeepSeek Coder	Mini (1.3B), Light (1.5B)	2-3 GB	10-15 tokens/sec in FP16
TinyLlama	1.1B	2 GB	8-12 tokens/sec in FP16
Phi	Phi-1.5 (1.3B), Phi-2 (2.7B)	1.5-3 GB	Phi-1.5: 8-12 tokens/sec in FP16, Phi-2: 4-8 tokens/sec in FP16
Llama 2	7B (Quantized to 4-bit)	3-4 GB	1-2 tokens/sec in INT4/INT8
Mistral	7B (Quantized to 4-bit)	3-4 GB	1-2 tokens/sec in INT4/INT8

DeepSeek R1 1.5B Optimizations Recommendations:

Supports INT4-8 quantization for inference
Best performance with TensorRT engine conversion
Typical throughput: 8-12 tokens/sec in FP16, 12-18 tokens/sec in INT8
Recommended batch size: 1-2 for real-time applications

Supported AI Model Formats

Format	Support Level	Compatible Versions	Notes
ONNX	Full	1.10.0 - 1.16.3	Recommended for cross-framework compatibility
TensorRT	Full	7.x - 8.5.x	Best for performance-critical applications
PyTorch (JIT)	Full	1.8.0 - 2.0.0	Native support via TorchScript
TensorFlow SavedModel	Full	2.8.0 - 2.12.0	Recommended TF deployment format
TFLite	Partial	Up to 2.12.0	May have limited hardware acceleration

Hardware Acceleration Support

Accelerator	Support Level	Compatible Libraries	Notes
CUDA	Full	PyTorch, TensorFlow, OpenCV, ONNX Runtime	Primary acceleration method
TensorRT	Full	ONNX, TensorFlow, PyTorch (via export)	Recommended for inference optimization
cuDNN	Full	PyTorch, TensorFlow	Accelerates deep learning primitives
NVDEC	Full	GStreamer, FFmpeg	Hardware video decoding
NVENC	Full	GStreamer, FFmpeg	Hardware video encoding
DLA	Partial	TensorRT	Requires specific model optimization

Precision Support

Precision	Support Level	Compatible Frameworks	Notes
FP32	Full	All	Baseline precision, highest accuracy
FP16	Full	All	2x memory reduction, minimal accuracy impact
INT8	Partial	TensorRT, ONNX Runtime	4x memory reduction, requires calibration
INT4	Limited	TensorRT (via plugins)	For models specifically optimized for INT4

Video/Camera Processing

GStreamer Integration

Built with NVIDIA-accelerated GStreamer plugins supporting:

Feature	Support Level	Compatible Versions	Notes
H.264 Encoding	Full	Up to High profile	Hardware accelerated via NVENC
H.265/HEVC Encoding	Full	Up to Main10 profile	Hardware accelerated via NVENC
VP9 Encoding	Full	Up to Profile 0	Hardware accelerated
AV1 Encoding	Partial	Limited feature set	Experimental support
Hardware Decoding	Full	H.264/H.265/VP9	Via NVDEC
RTSP Server	Full	GStreamer RTSP Server 1.16.2	Streaming capabilities
RTSP Client	Full	GStreamer 1.16.2	Low-latency streaming reception
Camera Capture	Full	V4L2, ArgusCamera	Direct camera integration

Quick Start Guide

Installation

# Clone the repository
git clone https://github.com/Advantech_COE/L2-02
cd L2-02
 
# Make the build script executable
chmod +x build.sh
 
# Launch the container
./build.sh

Model Optimization Workflows

For optimal performance, follow these recommended model conversion paths:

ONNX Models

Original Framework → ONNX → TensorRT Engine

PyTorch Models

PyTorch → TorchScript → ONNX → TensorRT Engine

PyTorch → ONNX → TensorRT Engine

TensorFlow Models

TensorFlow → SavedModel → ONNX → TensorRT Engine

TensorFlow → SavedModel → TensorRT Engine

Best Practices

Memory Management

Pre-allocate GPU memory where possible
Batch inference for better throughput
Use stream processing for continuous data

Precision Selection

Start with FP16 for most models (good balance of accuracy/performance)
Quantize to INT8 for vision models requiring higher throughput
Use FP32 only when precision is critical

Model Optimization

Optimize model architecture for edge deployment
Remove training-specific layers
Apply pruning techniques for smaller models
Use TensorRT for inference optimization

Video Processing

Use hardware-accelerated decoders/encoders
Process at native resolution when possible
Consider downsampling for higher throughput

Known Limitations

LLM Support: Large language models over 3B parameters are not recommended due to memory constraints.
ONNX Runtime: Limited GPU acceleration for complex operators.
Mixed Precision: Some operations may fall back to FP32 even in FP16 mode.
Display Acceleration: X11 forwarding performance may be limited.
ONNX Runtime GPU Support: For optimal performance, convert ONNX models to TensorRT engines.