Machine learning models are often written and trained in Python, but when it comes to deploying them at scale, Python-based inference servers can become a bottleneck due to slower startup times and higher runtime overhead. Rust solves this challenge by offering predictable performance, minimal memory footprint, and exceptional concurrency support—making it an excellent choice for a high-throughput, low-latency prediction API.
In this tutorial, you will learn how to build a fast REST API in Rust for serving machine learning model predictions using:
-
Axum 0.7+ for web routing and request handling
-
Serde for JSON serialization
-
Tracing for observability
-
Candle, onnxruntime, or any pre-exported model format for inference
-
Tokio for async execution
-
Cross-language model loading, allowing you to train in Python and serve in Rust
You will build a clean, modular project structure that loads an ML model at startup, exposes a /predict endpoint, and returns fast predictions with consistent response times.
What You Will Build
-
A minimal but production-oriented Rust API
-
A model loader abstraction
-
An inference service with caching
-
A prediction endpoint that processes structured input
-
Optional: benchmarking the Rust API versus a Python microservice
Prerequisites
Before diving in, ensure that you have:
-
Rust 1.75+ installed
-
Basic familiarity with Rust async programming
-
A trained ML model exported to ONNX (or another supported format)
-
Optional: Python environment if you want to compare with Python inference
Project Setup and Dependencies
Before we start building the web API, let’s set up the project structure and install the required dependencies. We’ll use Axum for the web framework, Tokio as the async runtime, and a model inference library such as tract or ort (ONNX Runtime), depending on your selected ML model format. This setup ensures a lightweight, fast, and production-ready Rust environment.
1. Create a New Rust Project
Start by creating a new Rust binary project:
cargo new rust-ml-api
cd rust-ml-api
This generates the default Rust project layout:
rust-ml-api/
├── Cargo.toml
└── src/
└── main.rs
2. Add Required Dependencies
Open Cargo.toml and add the following dependencies.
Axum and Tokio
We’ll use Axum to build the HTTP routes and Tokio for async execution.
[dependencies]
axum = "0.8.7"
tokio = { version = "1.40", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tower = "0.5"
tower-http = { version = "0.6.6", features = ["cors", "trace"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }
3. Add Your ML Inference Library
Depending on your model type, choose one of the following:
Option A: ONNX Models (Recommended) — ONNX Runtime for Rust
ort = { version = "1.20", features = ["download-binaries"] }
Benefits:
✔ Fast runtime
✔ Supports GPU/CPU
✔ Works with TensorFlow, PyTorch, Scikit-learn exported models
Option B: TensorFlow Lite Models — Tract
tract-onnx = "0.22.0"
tract-core = "0.22.0"
Benefits:
✔ Pure Rust inference
✔ Great for embedded or serverless environments
4. Optional Utilities
Add these optional but highly recommended crates:
dotenvy = "0.15" # Environment variables
anyhow = "1.0" # Error handling
thiserror = "2.0.17" # Custom error types
5. Enable CORS (Optional but Useful for Client Apps)
If your API will be consumed by a frontend (React, Vue, Flutter, etc.), you’ll need CORS:
use tower_http::cors::{Any, CorsLayer};
let cors = CorsLayer::new()
.allow_origin(Any)
.allow_methods([http::Method::POST])
.allow_headers(Any);
6. Final Project Structure Overview
At this stage, your project layout should look like this:
rust-ml-api/
├── Cargo.toml
├── .env # optional
├── models/ # ML models (ONNX, TFLite)
└── src/
├── main.rs
└── handlers.rs # request handlers
Loading the ML Model in Rust (Using Tract)
Tract is a pure-Rust machine-learning inference engine with ONNX support. It’s fast, portable, has zero external dependencies, and works well for building high-performance Rust APIs.
1. Add Tract to Cargo.toml
Update your Cargo.toml with:
[dependencies]
tract-onnx = "0.22.0"
tract-core = "0.22.0"
That’s all you need — Tract brings ONNX support without GPU drivers, Python bindings, or external shared libraries.
2. Loading an ONNX Model
Below is a minimal example of loading and preparing an ONNX model with Tract:
Create a new module model.rs:
use tract_onnx::onnx;
use tract_onnx::prelude::*;
pub fn load_model() -> TractResult<SimplePlan<TypedFact, Box<dyn TypedOp>>> {
// Load ONNX model file
let model = onnx()
.model_for_path("model.onnx")?
// Make the model "typed" (infer all shapes & types)
.into_optimized()?
// Create an execution plan
.into_runnable()?;
Ok(model)
}
What this does:
-
model_for_path loads the ONNX model.
-
into_optimized() performs graph optimization (constant folding, operator fusion).
-
into_runnable() builds a fast execution plan for inference.
This is usually run once at startup and stored in an Arc so the API can reuse it across requests.
3. Running Predictions with Tract
Later in the API (e.g., in your Axum or Actix handler), you’ll perform inference like this:
pub fn predict(
model: &SimplePlan<TypedFact, Box<dyn TypedOp>>,
input: Vec<f32>,
shape: &[usize],
) -> TractResult<Tensor> {
let input_tensor = tract_ndarray::Array::from_shape_vec(shape, input)?.into_tensor();
let result = model.run(tvec!(input_tensor))?;
Ok(result[0].clone())
}
Key Points:
-
Inputs must be
Vec<f32>(or whichever type your model requires). -
Shape must match the model’s expected dimension, e.g.
[1, 4]for a 4-value input. -
Output is returned as a
Tensor, which you can extract with:let values: Vec<f32> = result.to_array_view::<f32>()?.iter().cloned().collect();
4. Recommended Model Initialization Pattern
Inside main.rs, load your model once and store it in shared state:
use std::sync::Arc;
use crate::model::load_model;
mod model;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let model = Arc::new(load_model()?);
// Pass model into your web API state here (Axum example)
let app_state = AppState { model };
// Continue building your API...
Ok(())
}
Creating the Prediction API Endpoint (Actix Web)
Now that your project is set up and your ONNX model loads correctly with tract-onnx, it's time to expose a FAST prediction endpoint using Actix Web.
To keep things simple (as requested), all code stays inside:
src/main.rs
✅ Final Working main.rs
This version:
-
Loads the ONNX model once at startup
-
Wraps it in
web::Datafor shared AppState -
Accepts JSON input
-
Converts input into a
tract_onnxtensor -
Runs inference safely
-
Returns predictions as JSON
main.rs
use actix_web::{ post, web, App, HttpResponse, HttpServer, Responder };
use serde::{ Deserialize, Serialize };
use tract_onnx::prelude::*;
use std::sync::Mutex;
#[derive(Deserialize)]
struct PredictRequest {
input: Vec<f32>,
}
#[derive(Serialize)]
struct PredictResponse {
prediction: Vec<f32>,
}
struct AppState {
model: Mutex<SimplePlan<TypedExpr>>,
}
#[post("/predict")]
async fn predict(data: web::Data<AppState>, req: web::Json<PredictRequest>) -> impl Responder {
let model = data.model.lock().unwrap();
// Create Tensor from the input vector
let input_tensor = Tensor::from_shape(&[req.input.len()], &req.input).unwrap();
// Run inference
let result = model.run(tvec!(input_tensor.into_tvalue()));
match result {
Ok(outputs) => {
let output_tensor = outputs[0].to_array_view::<f32>().unwrap();
let output_vec = output_tensor.iter().cloned().collect::<Vec<f32>>();
HttpResponse::Ok().json(PredictResponse {
prediction: output_vec,
})
}
Err(e) => HttpResponse::InternalServerError().body(format!("{:?}", e)),
}
}
#[actix_web::main]
async fn main() -> std::io::Result<()> {
// Load ONNX model
let model = tract_onnx
::onnx()
.model_for_path("model.onnx")
.unwrap()
.into_optimized()
.unwrap()
.into_runnable()
.unwrap();
let state = web::Data::new(AppState {
model: Mutex::new(model),
});
HttpServer::new(move || { App::new().app_data(state.clone()).service(predict) })
.bind(("127.0.0.1", 8080))?
.run().await
}
Testing the Prediction Endpoint
Now that your /predict endpoint is running, let’s test it using curl, HTTPie, Postman, or Thunder Client.
🧪 1. Test With Curl
Send a JSON request containing a float vector:
curl -X POST http://localhost:8082/predict \
-H "Content-Type: application/json" \
-d '{"input": [0.5, 1.2, -0.7]}'
Expected response:
{"prediction":[1.0,2.4,-1.4]}
🧪 2. Test With HTTPie
http POST :8082/predict input:='[0.5,1.2,-0.7]'
🧪 3. Test Using JavaScript (fetch)
const res = await fetch("http://localhost:8082/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ input: [0.5, 1.2, -0.7] })
});
const data = await res.json();
console.log(data);
🧪 4. Expected Behavior
-
The ONNX model is loaded fresh on each request (simple setup)
-
The input vector is converted to a
Tensor -
The model is executed using
tract_onnx -
The first output tensor is extracted and returned as JSON
Performance Tips for a Fast Rust ML Prediction API
Rust already gives you excellent performance, but ONNX inference can still become a bottleneck if not optimized correctly.
Below are practical, real-world performance techniques tailored specifically for Actix-Web + Tract ONNX.
1. Load the Model Once (NOT Per Request)
In your working example, the model is loaded inside the handler.
This is fine for testing — but extremely slow for production.
Why?
Loading a Tract model:
-
Reads
model.onnxfrom disk -
Builds the computation graph
-
Optimizes it
-
Allocates tensors
This process is expensive.
Fix: Load Once at Startup
HttpServer::new(move || {
App::new()
.app_data(web::Data::new(model.clone()))
.service(predict)
})
And inside predict, reuse it.
2. Pre-Optimize & Make Runnable at Startup
Tract supports model compilation:
-
into_optimized() -
into_runnable()
Move these operations to the startup so inference becomes only:
input → run → output
3. Use Mutex or RwLock Sparingly
Tract RunnableModel is not thread-safe, so wrapping in:
-
Arc<Mutex<_>>(safe but slower) -
Arc<RwLock<_>>(better for many readers)
Keeping lock time short is crucial.
Only lock the model during:
let outputs = model.run(...)
4. Avoid Allocations for Each Request
Your current code converts:
let input_tensor: Tensor = Array1::from(req.input.clone()).into_tensor();
Allocations per request = slower API.
Improve: Reuse Tensor Buffers
For high-traffic APIs, prepare preallocated tensor buffers and fill them manually.
You can create a buffer once:
let mut tensor = tract_ndarray::Array1::<f32>::zeros(max_input).into_tensor();
Then update only the necessary values.
5. Keep the Model Small
Larger ONNX models:
-
Load slower
-
Infer slower
-
Consume more memory
Tips:
-
Use quantization (
int8model instead of float32) -
Reduce layers
-
Merge layers
-
Use smaller ops (e.g., Conv → DepthWiseConv)
6. Use Release Mode in Production
Rust debug builds are very slow for numerical code.
Always run:
cargo run --release
or build:
cargo build --release
Performance is often 10× faster.
7. Use SO_REUSEPORT for Multi-Core Scaling
Actix can spawn one worker per CPU core.
Add:
HttpServer::new(|| App::new().service(predict))
.workers(num_cpus::get())
This uses all CPU cores effectively.
8. Consider Switching to actix-web-async
Experimental but faster for I/O-heavy workloads.
Optional note for the tutorial.
9. Profile with perf or cargo flamegraph
To determine bottlenecks:
cargo install flamegraph
cargo flamegraph
You will see:
-
inference time
-
Actix overhead
-
disk I/O
-
memory allocation
10. Use JSON Batching (Optional)
If your API predicts for many inputs, batching helps:
Input:
{ "inputs": [[...], [...], [...]] }
Tract runs matrix inference, which is far faster.
11. Optional: Use TensorRT or ORT for GPU Acceleration
Tract is CPU-only.
If you need GPU inference:
-
Use
onnxruntime(GPU EP) -
Use TensorRT via
tensorrt-rs
For the tutorial, keep this as a “future enhancement.”
Logging & Error Handling Improvements
A production-ready ML inference API should include clear logging and safe error handling. This ensures you can troubleshoot failed predictions, model loading issues, and malformed inputs without exposing internal details to end users.
Below are the best practices and improvements you can apply to your Actix + Rust + Tract ONNX service.
1. Add Structured Logging (env_logger)
Add the dependency:
[dependencies]
env_logger = "0.11"
log = "0.4"
Initialize logging in main():
env_logger::init();
log::info!("Starting Rust ML prediction API on http://localhost:8082");
Now you can log events inside handlers:
log::info!("Received prediction request with {} floats", req.input.len());
2. Improve Model Loading Error Messages
Replace .unwrap() with readable messages:
let model = tract_onnx::onnx()
.model_for_path("model.onnx")
.map_err(|e| {
log::error!("Failed to load model: {}", e);
HttpResponse::InternalServerError()
.body(format!("Model load error: {}", e))
})?;
This prevents panics and logs the real issue.
3. Validate Input Shape Before Running the Model
Instead of assuming the user sends the correct number of values:
if req.input.is_empty() {
log::warn!("Empty input received");
return HttpResponse::BadRequest().body("Input array cannot be empty");
}
You can also enforce the exact expected length:
const EXPECTED_INPUT: usize = 3; // example
if req.input.len() != EXPECTED_INPUT {
log::error!(
"Invalid input length: got {}, expected {}",
req.input.len(),
EXPECTED_INPUT
);
return HttpResponse::BadRequest()
.body(format!("Input must contain exactly {} values", EXPECTED_INPUT));
}
4. Handle Model Inference Errors Gracefully
Replace:
let outputs = model.run(tvec![input_tensor.into_tvalue()]).unwrap();
With:
let outputs = match model.run(tvec![input_tensor.into_tvalue()]) {
Ok(out) => out,
Err(e) => {
log::error!("Inference error: {:?}", e);
return HttpResponse::InternalServerError()
.body(format!("Prediction error: {:?}", e));
}
};
This avoids server crashes if the model fails.
5. Add Logging Around Response
Before returning output:
log::info!("Prediction output: {:?}", output_vec);
This helps quickly verify model quality and detect anomalies.
6. Example: Updated Predict Handler With Logs + Errors
Here’s the improved predict function:
use actix_web::{ post, web, App, HttpResponse, HttpServer, Responder };
use serde::{ Deserialize, Serialize };
use tract_onnx::prelude::*;
#[derive(Deserialize)]
struct PredictRequest {
input: Vec<f32>,
}
#[derive(Serialize)]
struct PredictResponse {
prediction: Vec<f32>,
}
#[post("/predict")]
async fn predict(req: web::Json<PredictRequest>) -> impl Responder {
log::info!("Received /predict request");
if req.input.is_empty() {
log::warn!("Rejecting empty input array");
return HttpResponse::BadRequest().body("Input array cannot be empty");
}
// Try loading model
let model = match tract_onnx::onnx().model_for_path("model.onnx") {
Ok(m) => m,
Err(e) => {
log::error!("Model load error: {}", e);
return HttpResponse::InternalServerError().body(format!("Model load error: {}", e));
}
};
// Build input shape
let input_shape = tvec![req.input.len()];
let model = match
model.with_input_fact(0, InferenceFact::dt_shape(f32::datum_type(), input_shape))
{
Ok(m) => m,
Err(e) => {
log::error!("Failed to set input shape: {}", e);
return HttpResponse::InternalServerError().body(format!("Model shape error: {}", e));
}
};
let model = match model.into_optimized().and_then(|m| m.into_runnable()) {
Ok(m) => m,
Err(e) => {
log::error!("Model optimization error: {}", e);
return HttpResponse::InternalServerError().body(
format!("Model optimization error: {}", e)
);
}
};
// Prepare input tensor
let input_tensor: Tensor = tract_ndarray::Array1::from(req.input.clone()).into_tensor();
// Run inference
let outputs = match model.run(tvec![input_tensor.into_tvalue()]) {
Ok(out) => out,
Err(e) => {
log::error!("Prediction failed: {:?}", e);
return HttpResponse::InternalServerError().body(format!("Prediction error: {:?}", e));
}
};
// Extract result
let output_tensor = outputs[0].to_array_view::<f32>().unwrap();
let output_vec = output_tensor.iter().cloned().collect::<Vec<f32>>();
log::info!("Prediction result: {:?}", output_vec);
HttpResponse::Ok().json(PredictResponse {
prediction: output_vec,
})
}
#[actix_web::main]
async fn main() -> std::io::Result<()> {
env_logger::init();
log::info!("Starting Rust ML prediction API on http://localhost:8082");
HttpServer::new(|| App::new().service(predict))
.bind(("127.0.0.1", 8082))?
.run().await
}
Deploying the Rust ML API (Docker, Release Build, systemd, etc.)
Once your Rust ML prediction API is working locally, the next step is deploying it into a production-ready environment. This section covers three common deployment approaches:
-
Optimized release build
-
Docker container
-
Linux service using systemd
1. Build a Release-Optimized Binary
Rust’s --release flag enables all compiler optimizations, producing a much faster binary.
cargo build --release
Your optimized executable appears here:
target/release/rust-ml-api
Copy this binary together with your model.onnx file to your production server:
/opt/rust-ml-api/
├── rust-ml-api
└── model.onnx
Run it:
./rust-ml-api
2. Deploy Using Docker
2.1 Create a Production Dockerfile
Here is a minimal & efficient multi-stage Dockerfile:
# ========= Build Stage =========
FROM rust:1.78 as builder
WORKDIR /app
COPY . .
RUN cargo build --release
# ========= Runtime Stage =========
FROM debian:stable-slim
# Install SSL certs and CA roots for Actix/TLS
RUN apt-get update && apt-get install -y ca-certificates && apt-get clean
WORKDIR /app
# Copy binary + model
COPY --from=builder /app/target/release/rust-ml-api /app/rust-ml-api
COPY model.onnx /app/model.onnx
EXPOSE 8082
CMD ["/app/rust-ml-api"]
2.2 Build & Run
Build:
docker build -t rust-ml-api .
Run:
docker run -p 8082:8082 rust-ml-api
Test:
curl -X POST http://localhost:8082/predict \
-H "Content-Type: application/json" \
-d '{"input":[0.5,1.2,-3.1]}'
3. Deploy as a systemd Service
This is ideal for VPS or bare-metal Linux servers.
3.1 Move application files
Place your binary and model:
/opt/rust-ml-api/
├── rust-ml-api
└── model.onnx
3.2 Create systemd service
Create file:
sudo nano /etc/systemd/system/rust-ml-api.service
Paste:
[Unit]
Description=Rust ML Prediction API
After=network.target
[Service]
WorkingDirectory=/opt/rust-ml-api
ExecStart=/opt/rust-ml-api/rust-ml-api
Restart=always
RestartSec=3
User=www-data
Environment=RUST_LOG=info
[Install]
WantedBy=multi-user.target
Enable + start service:
sudo systemctl daemon-reload
sudo systemctl enable rust-ml-api
sudo systemctl start rust-ml-api
Check logs:
sudo journalctl -u rust-ml-api -f
4. Reverse Proxy with Nginx (Optional)
Nginx config:
server {
listen 80;
server_name yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8082;
proxy_set_header Host $host;
}
}
Reload:
sudo systemctl reload nginx
5. Security Recommendations
✔ Run as non-root
✔ Serve behind Nginx or Caddy
✔ Enable HTTPS using Let’s Encrypt
✔ Restrict inbound ports with a firewall
✔ Store ML models read-only
6. Updating Your Deployed API
Replace binary + model:
sudo systemctl stop rust-ml-api
cp rust-ml-api /opt/rust-ml-api/
sudo systemctl start rust-ml-api
Check status:
sudo systemctl status rust-ml-api
Benchmarking the API & Production-Ready Docker Compose Setup
Now that your Rust ML API is fully working and optimized, let’s benchmark it and prepare a production-ready deployment using Docker Compose. This will help you validate performance and reliably run the service on any server.
1. Benchmarking the Rust ML API
Rust is extremely fast, but you should still benchmark to measure:
-
Requests per second
-
Latency under concurrent load
-
CPU and memory usage
Below are recommended tools and ready-to-run commands.
1.1 Using wrk (Recommended)
If you have wrk installed:
wrk -t4 -c100 -d30s http://localhost:8082/predict \
--latency \
-s bench.lua
Create bench.lua:
wrk.method = "POST"
wrk.body = '{"input":[1.0,2.0,3.0]}'
wrk.headers["Content-Type"] = "application/json"
This tests:
-
4 threads
-
100 concurrent connections
-
30 seconds duration
You will get results like:
-
Requests/sec
-
Latency distribution
-
Transfer/sec
1.2 Using hey (Go version of wrk)
Install:
brew install hey
Benchmark:
hey -n 10000 -c 100 -m POST \
-H "Content-Type: application/json" \
-d '{"input":[1.0,2.0,3.0]}' \
http://localhost:8082/predict
1.3 Using ab (ApacheBench)
ab -n 5000 -c 50 -p body.json -T application/json \
http://localhost:8082/predict
body.json:
{"input":[1.0,2.0,3.0]}
2. Profiling Your Rust ML API
For deeper analysis, you can run:
perf (Linux)
sudo perf record ./target/release/rust-ml-api
sudo perf report
cargo flamegraph
cargo install flamegraph
sudo flamegraph
These tools reveal bottlenecks inside ONNX inference and Actix handlers.
3. Production-Ready Docker Compose Setup
Now let’s make your deployment robust and reproducible.
3.1 Dockerfile (Optimized Release Build)
Create:
Dockerfile
# 1. Build stage
FROM rust:1.81-slim AS builder
WORKDIR /app
# Cache dependencies first
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs
RUN cargo build --release || true
# Copy real source
COPY . .
# Build release
RUN cargo build --release
# 2. Runtime stage
FROM debian:stable-slim
WORKDIR /app
COPY --from=builder /app/target/release/rust-ml-api .
COPY model.onnx .
EXPOSE 8082
CMD ["./rust-ml-api"]
3.2 Create docker-compose.yml
docker-compose.yml
version: "3.9"
services:
rust-ml-api:
build: .
container_name: rust_ml_api
restart: unless-stopped
environment:
RUST_LOG: info
ports:
- "8082:8082"
deploy:
resources:
limits:
cpus: "1.0"
memory: "512M"
This setup provides:
-
Auto-restart
-
CPU and memory limits
-
Clean separation of build and runtime
-
Exposed API on port 8082
3.3 Running in Production
docker compose up -d --build
Check logs:
docker compose logs -f rust-ml-api
Check if it works:
curl -X POST http://localhost:8082/predict \
-H "Content-Type: application/json" \
-d '{"input":[1.0,2.0,3.0]}'
4. Optional: Systemd for Bare-Metal Deployment
If deploying on a VPS without Docker, create:
/etc/systemd/system/rust-ml-api.service
[Unit]
Description=Rust ML Prediction API
After=network.target
[Service]
User=www-data
WorkingDirectory=/opt/rust-ml-api
ExecStart=/opt/rust-ml-api/rust-ml-api
Restart=always
Environment=RUST_LOG=info
[Install]
WantedBy=multi-user.target
Enable and run:
sudo systemctl enable rust-ml-api
sudo systemctl start rust-ml-api
Conclusion – Building a Fast Rust ML Prediction API
You’ve now built a complete, production-ready Rust Web API capable of running machine learning model predictions using Actix Web and Tract ONNX. Throughout this tutorial, you walked through every major step required to deploy a high-performance inference service:
🔥 What You Accomplished
1. Project Setup & Dependencies
You created a clean Rust project using Actix Web and the Tract ONNX inference engine ― optimized for speed and zero-copy tensor handling.
2. Loading & Optimizing the ML Model
You learned how Tract ingests ONNX models, infers shape information, and compiles an optimized computational graph for fast execution.
3. Building a Prediction Endpoint
You implemented a /predict POST endpoint that:
-
Accepts a vector of floats in JSON format
-
Converts it to a tensor
-
Runs the ONNX model
-
Returns predictions as JSON
Your final working implementation is clean, simple, and easy to extend.
4. Creating a Minimal ONNX Model
You generated a tiny ONNX model that works with Tract, ensuring the API can run even without training a real neural network.
5. Performance Tips
You learned how to:
-
Reuse the loaded model instead of loading it per request
-
Choose release mode for real performance
-
Run the service behind a reverse proxy such as NGINX
-
Use controlled batching or request limits
6. Logging & Error Handling
You added improvements to make the API:
-
More reliable
-
Easier to debug
-
More production-ready
7. Deployment
You explored different deployment approaches:
-
Docker & multi-stage builds
-
Systemd services
-
Production optimizations for small image sizes
8. Benchmarking
You learned how to stress-test your API using tools like wrk, hey, or bombardier to measure:
-
Latency
-
Throughput
-
Request per second (RPS)
🚀 Final Thoughts
Rust continues to shine in scenarios where performance, safety, and low memory usage are critical. By combining Rust with Tract ONNX and Actix Web:
-
You get an inference service that is as fast or faster than Python,
-
With the robustness and safety you expect from Rust,
-
In a tiny deployment footprint that fits well in containers or edge devices.
Even with a small demo ONNX model, this foundation is ready for real-world ML workloads — from recommendation systems to anomaly detection to embedded AI on IoT devices.
You can find the full source code on our GitHub.
That's just the basics. If you need more deep learning about Rust, you can take the following cheap course:
- Learn to Code with Rust
- Rust: The Complete Developer's Guide
- Master The Rust Programming Language : Beginner To Advanced
- Embedded Rust Development with STM32: Absolute Beginners
- Build an AutoGPT Code Writing AI Tool With Rust and GPT-4
- Rust Programming Bootcamp - 100 Projects in 100 Days
- Learn Rust by Building Real Applications
- Building web APIs with Rust (advanced)
- Developing P2P Applications with Rust
- Real time web applications in Rust
Thanks!
