Overview
Nvidia GPU Passthrough Enabled
This container offers ready-to-use environment with optimzied AI frameworks, GPU computing passthrough and industrial-grade reliability on Nvidia Jetson platforms, allowing users to focus on building AI applications on Advantech Edge AI platforms accelerated by Nvidia harware chipset, without troubles from hardware and AI framework installation and incompatibility.
Key Features
- Full Hardware Acceleration: Optimized access to GPU, NVENC/NVDEC, and DLA
- Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT
- Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
- Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
- Performance Optimized: Tuned specifically for Jetson Orin NX 8GB
Hardware Specifications
Component | Specification |
---|---|
Target Hardware | Advantech EPC-7300 L2-02 / NVIDIA Jetson Orin NX |
GPU | NVIDIA Ampere architecture with 1024 CUDA cores |
DLA Cores | 1 (Deep Learning Accelerator) |
Memory | 8GB shared GPU/CPU memory |
JetPack Version | 5.. (L4T R35.2.1) |
Software Components
Component | Version | Description |
---|---|---|
CUDA | 11.4.315 | GPU computing platform |
cuDNN | 8.6.0 | Deep Neural Network library |
TensorRT | 8.5.2.2 | Inference optimizer and runtime |
PyTorch | 2.0.0+nv23.02 | Deep learning framework |
TensorFlow | 2.12.0+nv23.05 | Machine learning framework |
ONNX Runtime | 1.16.3 | Cross-platform inference engine |
OpenCV | 4.5.0 | Computer vision library with CUDA |
GStreamer | 1.16.2 | Multimedia framework |
Supported AI Capabilities
Vision Models
Model Family | Versions | Performance (FPS) | Quantization Support |
---|---|---|---|
YOLO | v3/v4/v5 (up to v5.6.0), v6 (up to v6.2), v7 (up to v7.0), v8 (up to v8.0) | YOLOv5s: 45-60 @ 640x640, YOLOv8n: 40-55 @ 640x640, YOLOv8s: 30-40 @ 640x640 | INT8, FP16, FP32 |
SSD | MobileNetV1/V2 SSD, EfficientDet-D0/D1 | MobileNetV2 SSD: 50-65 @ 300x300, EfficientDet-D0: 25-35 @ 512x512 | INT8, FP16, FP32 |
Faster R-CNN | ResNet50/ResNet101 backbones | ResNet50: 3-5 @ 1024x1024 | FP16, FP32 |
Segmentation | DeepLabV3+, UNet | DeepLabV3+ (MobileNetV2): 12-20 @ 512x512 | INT8, FP16, FP32 |
Classification | ResNet (18/50), MobileNet (V1/V2/V3), EfficientNet (B0-B2) | ResNet18: 120-150 @ 224x224, MobileNetV2: 180-210 @ 224x224 | INT8, FP16, FP32 |
Pose Estimation | PoseNet, HRNet (up to W18) | PoseNet: 15-25 @ 256x256 | FP16, FP32 |
Language Models Recommendation
Model Family | Versions | Memory Requirements | Performance Notes |
---|---|---|---|
DeepSeek Coder | Mini (1.3B), Light (1.5B) | 2-3 GB | 10-15 tokens/sec in FP16 |
TinyLlama | 1.1B | 2 GB | 8-12 tokens/sec in FP16 |
Phi | Phi-1.5 (1.3B), Phi-2 (2.7B) | 1.5-3 GB | Phi-1.5: 8-12 tokens/sec in FP16, Phi-2: 4-8 tokens/sec in FP16 |
Llama 2 | 7B (Quantized to 4-bit) | 3-4 GB | 1-2 tokens/sec in INT4/INT8 |
Mistral | 7B (Quantized to 4-bit) | 3-4 GB | 1-2 tokens/sec in INT4/INT8 |
DeepSeek R1 1.5B Optimizations Recommendations:
- Supports INT4-8 quantization for inference
- Best performance with TensorRT engine conversion
- Typical throughput: 8-12 tokens/sec in FP16, 12-18 tokens/sec in INT8
- Recommended batch size: 1-2 for real-time applications
Supported AI Model Formats
Format | Support Level | Compatible Versions | Notes |
---|---|---|---|
ONNX | Full | 1.10.0 - 1.16.3 | Recommended for cross-framework compatibility |
TensorRT | Full | 7.x - 8.5.x | Best for performance-critical applications |
PyTorch (JIT) | Full | 1.8.0 - 2.0.0 | Native support via TorchScript |
TensorFlow SavedModel | Full | 2.8.0 - 2.12.0 | Recommended TF deployment format |
TFLite | Partial | Up to 2.12.0 | May have limited hardware acceleration |
Hardware Acceleration Support
Accelerator | Support Level | Compatible Libraries | Notes |
---|---|---|---|
CUDA | Full | PyTorch, TensorFlow, OpenCV, ONNX Runtime | Primary acceleration method |
TensorRT | Full | ONNX, TensorFlow, PyTorch (via export) | Recommended for inference optimization |
cuDNN | Full | PyTorch, TensorFlow | Accelerates deep learning primitives |
NVDEC | Full | GStreamer, FFmpeg | Hardware video decoding |
NVENC | Full | GStreamer, FFmpeg | Hardware video encoding |
DLA | Partial | TensorRT | Requires specific model optimization |
Precision Support
Precision | Support Level | Compatible Frameworks | Notes |
---|---|---|---|
FP32 | Full | All | Baseline precision, highest accuracy |
FP16 | Full | All | 2x memory reduction, minimal accuracy impact |
INT8 | Partial | TensorRT, ONNX Runtime | 4x memory reduction, requires calibration |
INT4 | Limited | TensorRT (via plugins) | For models specifically optimized for INT4 |
Video/Camera Processing
GStreamer Integration
Built with NVIDIA-accelerated GStreamer plugins supporting:
Feature | Support Level | Compatible Versions | Notes |
---|---|---|---|
H.264 Encoding | Full | Up to High profile | Hardware accelerated via NVENC |
H.265/HEVC Encoding | Full | Up to Main10 profile | Hardware accelerated via NVENC |
VP9 Encoding | Full | Up to Profile 0 | Hardware accelerated |
AV1 Encoding | Partial | Limited feature set | Experimental support |
Hardware Decoding | Full | H.264/H.265/VP9 | Via NVDEC |
RTSP Server | Full | GStreamer RTSP Server 1.16.2 | Streaming capabilities |
RTSP Client | Full | GStreamer 1.16.2 | Low-latency streaming reception |
Camera Capture | Full | V4L2, ArgusCamera | Direct camera integration |
Quick Start Guide
Installation
# Clone the repository
git clone https://github.com/Advantech_COE/L2-02
cd L2-02
# Make the build script executable
chmod +x build.sh
# Launch the container
./build.sh
Model Optimization Workflows
For optimal performance, follow these recommended model conversion paths:
ONNX Models
Original Framework → ONNX → TensorRT Engine
PyTorch Models
PyTorch → TorchScript → ONNX → TensorRT Engine
or
PyTorch → ONNX → TensorRT Engine
TensorFlow Models
TensorFlow → SavedModel → ONNX → TensorRT Engine
or
TensorFlow → SavedModel → TensorRT Engine
Best Practices
Memory Management
- Pre-allocate GPU memory where possible
- Batch inference for better throughput
- Use stream processing for continuous data
Precision Selection
- Start with FP16 for most models (good balance of accuracy/performance)
- Quantize to INT8 for vision models requiring higher throughput
- Use FP32 only when precision is critical
Model Optimization
- Optimize model architecture for edge deployment
- Remove training-specific layers
- Apply pruning techniques for smaller models
- Use TensorRT for inference optimization
Video Processing
- Use hardware-accelerated decoders/encoders
- Process at native resolution when possible
- Consider downsampling for higher throughput
Known Limitations
- LLM Support: Large language models over 3B parameters are not recommended due to memory constraints.
- ONNX Runtime: Limited GPU acceleration for complex operators.
- Mixed Precision: Some operations may fall back to FP32 even in FP16 mode.
- Display Acceleration: X11 forwarding performance may be limited.
- ONNX Runtime GPU Support: For optimal performance, convert ONNX models to TensorRT engines.
Copyright © 2025 Advantech Corporation. All rights reserved.