Creating a Fast Rust Web API for ML Model Predictions

by Didin J. on Nov 24, 2025 Creating a Fast Rust Web API for ML Model Predictions

Create a fast Rust ML prediction API using Actix Web and Tract. Learn model loading, inference, performance tuning, deployment, and benchmarking.

Machine learning models are often written and trained in Python, but when it comes to deploying them at scale, Python-based inference servers can become a bottleneck due to slower startup times and higher runtime overhead. Rust solves this challenge by offering predictable performance, minimal memory footprint, and exceptional concurrency support—making it an excellent choice for a high-throughput, low-latency prediction API.

In this tutorial, you will learn how to build a fast REST API in Rust for serving machine learning model predictions using:

  • Axum 0.7+ for web routing and request handling

  • Serde for JSON serialization

  • Tracing for observability

  • Candle, onnxruntime, or any pre-exported model format for inference

  • Tokio for async execution

  • Cross-language model loading, allowing you to train in Python and serve in Rust

You will build a clean, modular project structure that loads an ML model at startup, exposes a /predict endpoint, and returns fast predictions with consistent response times.

What You Will Build

  • A minimal but production-oriented Rust API

  • A model loader abstraction

  • An inference service with caching

  • A prediction endpoint that processes structured input

  • Optional: benchmarking the Rust API versus a Python microservice

Prerequisites

Before diving in, ensure that you have:

  • Rust 1.75+ installed

  • Basic familiarity with Rust async programming

  • A trained ML model exported to ONNX (or another supported format)

  • Optional: Python environment if you want to compare with Python inference


Project Setup and Dependencies

Before we start building the web API, let’s set up the project structure and install the required dependencies. We’ll use Axum for the web framework, Tokio as the async runtime, and a model inference library such as tract or ort (ONNX Runtime), depending on your selected ML model format. This setup ensures a lightweight, fast, and production-ready Rust environment.

1. Create a New Rust Project

Start by creating a new Rust binary project:

cargo new rust-ml-api
cd rust-ml-api

This generates the default Rust project layout:

rust-ml-api/
 ├── Cargo.toml
 └── src/
      └── main.rs

2. Add Required Dependencies

Open Cargo.toml and add the following dependencies.

Axum and Tokio

We’ll use Axum to build the HTTP routes and Tokio for async execution.

[dependencies]
axum = "0.8.7"
tokio = { version = "1.40", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tower = "0.5"
tower-http = { version = "0.6.6", features = ["cors", "trace"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }

3. Add Your ML Inference Library

Depending on your model type, choose one of the following:

Option A: ONNX Models (Recommended) — ONNX Runtime for Rust

ort = { version = "1.20", features = ["download-binaries"] }

Benefits:
✔ Fast runtime
✔ Supports GPU/CPU
✔ Works with TensorFlow, PyTorch, Scikit-learn exported models

Option B: TensorFlow Lite Models — Tract

tract-onnx = "0.22.0"
tract-core = "0.22.0"

Benefits:
✔ Pure Rust inference
✔ Great for embedded or serverless environments

4. Optional Utilities

Add these optional but highly recommended crates:

dotenvy = "0.15"                                   # Environment variables
anyhow = "1.0"                                     # Error handling
thiserror = "2.0.17"                               # Custom error types

5. Enable CORS (Optional but Useful for Client Apps)

If your API will be consumed by a frontend (React, Vue, Flutter, etc.), you’ll need CORS:

use tower_http::cors::{Any, CorsLayer};

let cors = CorsLayer::new()
    .allow_origin(Any)
    .allow_methods([http::Method::POST])
    .allow_headers(Any);

6. Final Project Structure Overview

At this stage, your project layout should look like this:

rust-ml-api/
 ├── Cargo.toml
 ├── .env                # optional
 ├── models/             # ML models (ONNX, TFLite)
 └── src/
      ├── main.rs
      └── handlers.rs    # request handlers


Loading the ML Model in Rust (Using Tract)

Tract is a pure-Rust machine-learning inference engine with ONNX support. It’s fast, portable, has zero external dependencies, and works well for building high-performance Rust APIs.

1. Add Tract to Cargo.toml

Update your Cargo.toml with:

[dependencies]
tract-onnx = "0.22.0"
tract-core = "0.22.0"

That’s all you need — Tract brings ONNX support without GPU drivers, Python bindings, or external shared libraries.

2. Loading an ONNX Model

Below is a minimal example of loading and preparing an ONNX model with Tract:

Create a new module model.rs:

use tract_onnx::onnx;
use tract_onnx::prelude::*;

pub fn load_model() -> TractResult<SimplePlan<TypedFact, Box<dyn TypedOp>>> {
    // Load ONNX model file
    let model = onnx()
        .model_for_path("model.onnx")?
        // Make the model "typed" (infer all shapes & types)
        .into_optimized()?
        // Create an execution plan
        .into_runnable()?;

    Ok(model)
}

What this does:

  • model_for_path loads the ONNX model.

  • into_optimized() performs graph optimization (constant folding, operator fusion).

  • into_runnable() builds a fast execution plan for inference.

This is usually run once at startup and stored in an Arc so the API can reuse it across requests.

3. Running Predictions with Tract

Later in the API (e.g., in your Axum or Actix handler), you’ll perform inference like this:

pub fn predict(
    model: &SimplePlan<TypedFact, Box<dyn TypedOp>>,
    input: Vec<f32>,
    shape: &[usize],
) -> TractResult<Tensor> {
    let input_tensor = tract_ndarray::Array::from_shape_vec(shape, input)?.into_tensor();

    let result = model.run(tvec!(input_tensor))?;
    Ok(result[0].clone())
}

Key Points:

  • Inputs must be Vec<f32> (or whichever type your model requires).

  • Shape must match the model’s expected dimension, e.g. [1, 4] for a 4-value input.

  • Output is returned as a Tensor, which you can extract with:

     
    let values: Vec<f32> = result.to_array_view::<f32>()?.iter().cloned().collect();

     

4. Recommended Model Initialization Pattern

Inside main.rs, load your model once and store it in shared state:

use std::sync::Arc;

use crate::model::load_model;

mod model;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let model = Arc::new(load_model()?);

    // Pass model into your web API state here (Axum example)
    let app_state = AppState { model };

    // Continue building your API...
    Ok(())
}


Creating the Prediction API Endpoint (Actix Web)

Now that your project is set up and your ONNX model loads correctly with tract-onnx, it's time to expose a FAST prediction endpoint using Actix Web.

To keep things simple (as requested), all code stays inside:

src/main.rs

✅ Final Working main.rs

This version:

  • Loads the ONNX model once at startup

  • Wraps it in web::Data for shared AppState

  • Accepts JSON input

  • Converts input into a tract_onnx tensor

  • Runs inference safely

  • Returns predictions as JSON

main.rs

use actix_web::{ post, web, App, HttpResponse, HttpServer, Responder };
use serde::{ Deserialize, Serialize };
use tract_onnx::prelude::*;
use std::sync::Mutex;

#[derive(Deserialize)]
struct PredictRequest {
    input: Vec<f32>,
}

#[derive(Serialize)]
struct PredictResponse {
    prediction: Vec<f32>,
}

struct AppState {
    model: Mutex<SimplePlan<TypedExpr>>,
}

#[post("/predict")]
async fn predict(data: web::Data<AppState>, req: web::Json<PredictRequest>) -> impl Responder {
    let model = data.model.lock().unwrap();

    // Create Tensor from the input vector
    let input_tensor = Tensor::from_shape(&[req.input.len()], &req.input).unwrap();

    // Run inference
    let result = model.run(tvec!(input_tensor.into_tvalue()));

    match result {
        Ok(outputs) => {
            let output_tensor = outputs[0].to_array_view::<f32>().unwrap();
            let output_vec = output_tensor.iter().cloned().collect::<Vec<f32>>();

            HttpResponse::Ok().json(PredictResponse {
                prediction: output_vec,
            })
        }
        Err(e) => HttpResponse::InternalServerError().body(format!("{:?}", e)),
    }
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    // Load ONNX model
    let model = tract_onnx
        ::onnx()
        .model_for_path("model.onnx")
        .unwrap()
        .into_optimized()
        .unwrap()
        .into_runnable()
        .unwrap();

    let state = web::Data::new(AppState {
        model: Mutex::new(model),
    });

    HttpServer::new(move || { App::new().app_data(state.clone()).service(predict) })
        .bind(("127.0.0.1", 8080))?
        .run().await
}


Testing the Prediction Endpoint

Now that your /predict endpoint is running, let’s test it using curl, HTTPie, Postman, or Thunder Client.

🧪 1. Test With Curl

Send a JSON request containing a float vector:

curl -X POST http://localhost:8082/predict \
  -H "Content-Type: application/json" \
  -d '{"input": [0.5, 1.2, -0.7]}'

Expected response:

{"prediction":[1.0,2.4,-1.4]}

🧪 2. Test With HTTPie

http POST :8082/predict input:='[0.5,1.2,-0.7]'

🧪 3. Test Using JavaScript (fetch)

const res = await fetch("http://localhost:8082/predict", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ input: [0.5, 1.2, -0.7] })
});

const data = await res.json();
console.log(data);

🧪 4. Expected Behavior

  • The ONNX model is loaded fresh on each request (simple setup)

  • The input vector is converted to a Tensor

  • The model is executed using tract_onnx

  • The first output tensor is extracted and returned as JSON


Performance Tips for a Fast Rust ML Prediction API

Rust already gives you excellent performance, but ONNX inference can still become a bottleneck if not optimized correctly.
Below are practical, real-world performance techniques tailored specifically for Actix-Web + Tract ONNX.

1. Load the Model Once (NOT Per Request)

In your working example, the model is loaded inside the handler.
This is fine for testing — but extremely slow for production.

Why?

Loading a Tract model:

  • Reads model.onnx from disk

  • Builds the computation graph

  • Optimizes it

  • Allocates tensors

This process is expensive.

Fix: Load Once at Startup

HttpServer::new(move || {
    App::new()
        .app_data(web::Data::new(model.clone()))
        .service(predict)
})

And inside predict, reuse it.

2. Pre-Optimize & Make Runnable at Startup

Tract supports model compilation:

  • into_optimized()

  • into_runnable()

Move these operations to the startup so inference becomes only:

input → run → output

3. Use Mutex or RwLock Sparingly

Tract RunnableModel is not thread-safe, so wrapping in:

  • Arc<Mutex<_>> (safe but slower)

  • Arc<RwLock<_>> (better for many readers)

Keeping lock time short is crucial.
Only lock the model during:

let outputs = model.run(...)

4. Avoid Allocations for Each Request

Your current code converts:

let input_tensor: Tensor = Array1::from(req.input.clone()).into_tensor();

Allocations per request = slower API.

Improve: Reuse Tensor Buffers

For high-traffic APIs, prepare preallocated tensor buffers and fill them manually.

You can create a buffer once:

let mut tensor = tract_ndarray::Array1::<f32>::zeros(max_input).into_tensor();

Then update only the necessary values.

5. Keep the Model Small

Larger ONNX models:

  • Load slower

  • Infer slower

  • Consume more memory

Tips:

  • Use quantization (int8 model instead of float32)

  • Reduce layers

  • Merge layers

  • Use smaller ops (e.g., Conv → DepthWiseConv)

6. Use Release Mode in Production

Rust debug builds are very slow for numerical code.

Always run:

cargo run --release

or build:

cargo build --release

Performance is often 10× faster.

7. Use SO_REUSEPORT for Multi-Core Scaling

Actix can spawn one worker per CPU core.

Add:

HttpServer::new(|| App::new().service(predict))
    .workers(num_cpus::get())

This uses all CPU cores effectively.

8. Consider Switching to actix-web-async

Experimental but faster for I/O-heavy workloads.
Optional note for the tutorial.

9. Profile with perf or cargo flamegraph

To determine bottlenecks:

cargo install flamegraph
cargo flamegraph

You will see:

  • inference time

  • Actix overhead

  • disk I/O

  • memory allocation

10. Use JSON Batching (Optional)

If your API predicts for many inputs, batching helps:

Input:

{ "inputs": [[...], [...], [...]] }

Tract runs matrix inference, which is far faster.

11. Optional: Use TensorRT or ORT for GPU Acceleration

Tract is CPU-only.

If you need GPU inference:

  • Use onnxruntime (GPU EP)

  • Use TensorRT via tensorrt-rs

For the tutorial, keep this as a “future enhancement.”


Logging & Error Handling Improvements

A production-ready ML inference API should include clear logging and safe error handling. This ensures you can troubleshoot failed predictions, model loading issues, and malformed inputs without exposing internal details to end users.

Below are the best practices and improvements you can apply to your Actix + Rust + Tract ONNX service.

1. Add Structured Logging (env_logger)

Add the dependency:

[dependencies]
env_logger = "0.11"
log = "0.4"

Initialize logging in main():

env_logger::init();
log::info!("Starting Rust ML prediction API on http://localhost:8082");

Now you can log events inside handlers:

log::info!("Received prediction request with {} floats", req.input.len());

2. Improve Model Loading Error Messages

Replace .unwrap() with readable messages:

let model = tract_onnx::onnx()
    .model_for_path("model.onnx")
    .map_err(|e| {
        log::error!("Failed to load model: {}", e);
        HttpResponse::InternalServerError()
            .body(format!("Model load error: {}", e))
    })?;

This prevents panics and logs the real issue.

3. Validate Input Shape Before Running the Model

Instead of assuming the user sends the correct number of values:

if req.input.is_empty() {
    log::warn!("Empty input received");
    return HttpResponse::BadRequest().body("Input array cannot be empty");
}

You can also enforce the exact expected length:

const EXPECTED_INPUT: usize = 3; // example

if req.input.len() != EXPECTED_INPUT {
    log::error!(
        "Invalid input length: got {}, expected {}",
        req.input.len(),
        EXPECTED_INPUT
    );
    return HttpResponse::BadRequest()
        .body(format!("Input must contain exactly {} values", EXPECTED_INPUT));
}

4. Handle Model Inference Errors Gracefully

Replace:

let outputs = model.run(tvec![input_tensor.into_tvalue()]).unwrap();

With:

let outputs = match model.run(tvec![input_tensor.into_tvalue()]) {
    Ok(out) => out,
    Err(e) => {
        log::error!("Inference error: {:?}", e);
        return HttpResponse::InternalServerError()
            .body(format!("Prediction error: {:?}", e));
    }
};

This avoids server crashes if the model fails.

5. Add Logging Around Response

Before returning output:

log::info!("Prediction output: {:?}", output_vec);

This helps quickly verify model quality and detect anomalies.

6. Example: Updated Predict Handler With Logs + Errors

Here’s the improved predict function:

use actix_web::{ post, web, App, HttpResponse, HttpServer, Responder };
use serde::{ Deserialize, Serialize };
use tract_onnx::prelude::*;

#[derive(Deserialize)]
struct PredictRequest {
    input: Vec<f32>,
}

#[derive(Serialize)]
struct PredictResponse {
    prediction: Vec<f32>,
}

#[post("/predict")]
async fn predict(req: web::Json<PredictRequest>) -> impl Responder {
    log::info!("Received /predict request");

    if req.input.is_empty() {
        log::warn!("Rejecting empty input array");
        return HttpResponse::BadRequest().body("Input array cannot be empty");
    }

    // Try loading model
    let model = match tract_onnx::onnx().model_for_path("model.onnx") {
        Ok(m) => m,
        Err(e) => {
            log::error!("Model load error: {}", e);
            return HttpResponse::InternalServerError().body(format!("Model load error: {}", e));
        }
    };

    // Build input shape
    let input_shape = tvec![req.input.len()];

    let model = match
        model.with_input_fact(0, InferenceFact::dt_shape(f32::datum_type(), input_shape))
    {
        Ok(m) => m,
        Err(e) => {
            log::error!("Failed to set input shape: {}", e);
            return HttpResponse::InternalServerError().body(format!("Model shape error: {}", e));
        }
    };

    let model = match model.into_optimized().and_then(|m| m.into_runnable()) {
        Ok(m) => m,
        Err(e) => {
            log::error!("Model optimization error: {}", e);
            return HttpResponse::InternalServerError().body(
                format!("Model optimization error: {}", e)
            );
        }
    };

    // Prepare input tensor
    let input_tensor: Tensor = tract_ndarray::Array1::from(req.input.clone()).into_tensor();

    // Run inference
    let outputs = match model.run(tvec![input_tensor.into_tvalue()]) {
        Ok(out) => out,
        Err(e) => {
            log::error!("Prediction failed: {:?}", e);
            return HttpResponse::InternalServerError().body(format!("Prediction error: {:?}", e));
        }
    };

    // Extract result
    let output_tensor = outputs[0].to_array_view::<f32>().unwrap();
    let output_vec = output_tensor.iter().cloned().collect::<Vec<f32>>();
    log::info!("Prediction result: {:?}", output_vec);

    HttpResponse::Ok().json(PredictResponse {
        prediction: output_vec,
    })
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    env_logger::init();
    log::info!("Starting Rust ML prediction API on http://localhost:8082");

    HttpServer::new(|| App::new().service(predict))
        .bind(("127.0.0.1", 8082))?
        .run().await
}


Deploying the Rust ML API (Docker, Release Build, systemd, etc.)

Once your Rust ML prediction API is working locally, the next step is deploying it into a production-ready environment. This section covers three common deployment approaches:

  1. Optimized release build

  2. Docker container

  3. Linux service using systemd

1. Build a Release-Optimized Binary

Rust’s --release flag enables all compiler optimizations, producing a much faster binary.

cargo build --release

Your optimized executable appears here:

target/release/rust-ml-api

Copy this binary together with your model.onnx file to your production server:

/opt/rust-ml-api/
  ├── rust-ml-api
  └── model.onnx

Run it:

./rust-ml-api

2. Deploy Using Docker

2.1 Create a Production Dockerfile

Here is a minimal & efficient multi-stage Dockerfile:

# ========= Build Stage =========
FROM rust:1.78 as builder

WORKDIR /app
COPY . .

RUN cargo build --release

# ========= Runtime Stage =========
FROM debian:stable-slim

# Install SSL certs and CA roots for Actix/TLS
RUN apt-get update && apt-get install -y ca-certificates && apt-get clean

WORKDIR /app

# Copy binary + model
COPY --from=builder /app/target/release/rust-ml-api /app/rust-ml-api
COPY model.onnx /app/model.onnx

EXPOSE 8082

CMD ["/app/rust-ml-api"]

2.2 Build & Run

Build:

docker build -t rust-ml-api .

Run:

docker run -p 8082:8082 rust-ml-api

Test:

curl -X POST http://localhost:8082/predict \
  -H "Content-Type: application/json" \
  -d '{"input":[0.5,1.2,-3.1]}'

3. Deploy as a systemd Service

This is ideal for VPS or bare-metal Linux servers.

3.1 Move application files

Place your binary and model:

/opt/rust-ml-api/
  ├── rust-ml-api
  └── model.onnx

3.2 Create systemd service

Create file:

sudo nano /etc/systemd/system/rust-ml-api.service

Paste:

[Unit]
Description=Rust ML Prediction API
After=network.target

[Service]
WorkingDirectory=/opt/rust-ml-api
ExecStart=/opt/rust-ml-api/rust-ml-api
Restart=always
RestartSec=3
User=www-data
Environment=RUST_LOG=info

[Install]
WantedBy=multi-user.target

Enable + start service:

sudo systemctl daemon-reload
sudo systemctl enable rust-ml-api
sudo systemctl start rust-ml-api

Check logs:

sudo journalctl -u rust-ml-api -f

4. Reverse Proxy with Nginx (Optional)

Nginx config:

server {
    listen 80;
    server_name yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:8082;
        proxy_set_header Host $host;
    }
}

Reload:

sudo systemctl reload nginx

5. Security Recommendations

✔ Run as non-root
✔ Serve behind Nginx or Caddy
✔ Enable HTTPS using Let’s Encrypt
✔ Restrict inbound ports with a firewall
✔ Store ML models read-only

6. Updating Your Deployed API

Replace binary + model:

sudo systemctl stop rust-ml-api
cp rust-ml-api /opt/rust-ml-api/
sudo systemctl start rust-ml-api

Check status:

sudo systemctl status rust-ml-api


Benchmarking the API & Production-Ready Docker Compose Setup

Now that your Rust ML API is fully working and optimized, let’s benchmark it and prepare a production-ready deployment using Docker Compose. This will help you validate performance and reliably run the service on any server.

1. Benchmarking the Rust ML API

Rust is extremely fast, but you should still benchmark to measure:

  • Requests per second

  • Latency under concurrent load

  • CPU and memory usage

Below are recommended tools and ready-to-run commands.

1.1 Using wrk (Recommended)

If you have wrk installed:

wrk -t4 -c100 -d30s http://localhost:8082/predict \
  --latency \
  -s bench.lua

Create bench.lua:

wrk.method = "POST"
wrk.body   = '{"input":[1.0,2.0,3.0]}'
wrk.headers["Content-Type"] = "application/json"

This tests:

  • 4 threads

  • 100 concurrent connections

  • 30 seconds duration

You will get results like:

  • Requests/sec

  • Latency distribution

  • Transfer/sec

1.2 Using hey (Go version of wrk)

Install:

brew install hey

Benchmark:

hey -n 10000 -c 100 -m POST \
  -H "Content-Type: application/json" \
  -d '{"input":[1.0,2.0,3.0]}' \
  http://localhost:8082/predict

1.3 Using ab (ApacheBench)

ab -n 5000 -c 50 -p body.json -T application/json \
  http://localhost:8082/predict

body.json:

{"input":[1.0,2.0,3.0]}

2. Profiling Your Rust ML API

For deeper analysis, you can run:

perf (Linux)

sudo perf record ./target/release/rust-ml-api
sudo perf report

cargo flamegraph

cargo install flamegraph
sudo flamegraph

These tools reveal bottlenecks inside ONNX inference and Actix handlers.

3. Production-Ready Docker Compose Setup

Now let’s make your deployment robust and reproducible.

3.1 Dockerfile (Optimized Release Build)

Create:

Dockerfile

# 1. Build stage
FROM rust:1.81-slim AS builder

WORKDIR /app

# Cache dependencies first
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs
RUN cargo build --release || true

# Copy real source
COPY . .

# Build release
RUN cargo build --release

# 2. Runtime stage
FROM debian:stable-slim

WORKDIR /app

COPY --from=builder /app/target/release/rust-ml-api .
COPY model.onnx .

EXPOSE 8082

CMD ["./rust-ml-api"]

3.2 Create docker-compose.yml

docker-compose.yml

version: "3.9"

services:
  rust-ml-api:
    build: .
    container_name: rust_ml_api
    restart: unless-stopped
    environment:
      RUST_LOG: info
    ports:
      - "8082:8082"
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: "512M"

This setup provides:

  • Auto-restart

  • CPU and memory limits

  • Clean separation of build and runtime

  • Exposed API on port 8082

3.3 Running in Production

docker compose up -d --build

Check logs:

docker compose logs -f rust-ml-api

Check if it works:

curl -X POST http://localhost:8082/predict \
  -H "Content-Type: application/json" \
  -d '{"input":[1.0,2.0,3.0]}'

4. Optional: Systemd for Bare-Metal Deployment

If deploying on a VPS without Docker, create:

/etc/systemd/system/rust-ml-api.service

[Unit]
Description=Rust ML Prediction API
After=network.target

[Service]
User=www-data
WorkingDirectory=/opt/rust-ml-api
ExecStart=/opt/rust-ml-api/rust-ml-api
Restart=always
Environment=RUST_LOG=info

[Install]
WantedBy=multi-user.target

Enable and run:

sudo systemctl enable rust-ml-api
sudo systemctl start rust-ml-api


Conclusion – Building a Fast Rust ML Prediction API

You’ve now built a complete, production-ready Rust Web API capable of running machine learning model predictions using Actix Web and Tract ONNX. Throughout this tutorial, you walked through every major step required to deploy a high-performance inference service:

🔥 What You Accomplished

1. Project Setup & Dependencies

You created a clean Rust project using Actix Web and the Tract ONNX inference engine ― optimized for speed and zero-copy tensor handling.

2. Loading & Optimizing the ML Model

You learned how Tract ingests ONNX models, infers shape information, and compiles an optimized computational graph for fast execution.

3. Building a Prediction Endpoint

You implemented a /predict POST endpoint that:

  • Accepts a vector of floats in JSON format

  • Converts it to a tensor

  • Runs the ONNX model

  • Returns predictions as JSON

Your final working implementation is clean, simple, and easy to extend.

4. Creating a Minimal ONNX Model

You generated a tiny ONNX model that works with Tract, ensuring the API can run even without training a real neural network.

5. Performance Tips

You learned how to:

  • Reuse the loaded model instead of loading it per request

  • Choose release mode for real performance

  • Run the service behind a reverse proxy such as NGINX

  • Use controlled batching or request limits

6. Logging & Error Handling

You added improvements to make the API:

  • More reliable

  • Easier to debug

  • More production-ready

7. Deployment

You explored different deployment approaches:

  • Docker & multi-stage builds

  • Systemd services

  • Production optimizations for small image sizes

8. Benchmarking

You learned how to stress-test your API using tools like wrk, hey, or bombardier to measure:

  • Latency

  • Throughput

  • Request per second (RPS)

🚀 Final Thoughts

Rust continues to shine in scenarios where performance, safety, and low memory usage are critical. By combining Rust with Tract ONNX and Actix Web:

  • You get an inference service that is as fast or faster than Python,

  • With the robustness and safety you expect from Rust,

  • In a tiny deployment footprint that fits well in containers or edge devices.

Even with a small demo ONNX model, this foundation is ready for real-world ML workloads — from recommendation systems to anomaly detection to embedded AI on IoT devices.

You can find the full source code on our GitHub.

That's just the basics. If you need more deep learning about Rust, you can take the following cheap course:

Thanks!