RustML Inference Server

A high-performance, production-grade machine learning inference server built in Rust. This project demonstrates how to deploy ONNX models using techniques common in industrial inference systems, including dynamic batching, session pooling, backpressure, and gRPC endpoints.

The architecture is model-agnostic and can be reused for any deep learning model exported to ONNX (vision, NLP, audio, multimodal, etc.). The system is optimized for CPU inference and works from local development to containerized deployments.

Features

gRPC API for sending inference requests
Dynamic batching to group smaller requests into larger model runs
Parallel session pool for multi-core CPU utilization
Backpressure to prevent overload under high traffic
Axum metrics endpoint for monitoring
ONNX Runtime for executing neural network models
Dockerfile for reproducible deployments

Although the demo uses ResNet-50, the architecture is universal and can support any ONNX model with minimal changes.

Architecture

High-level flow:

Client → gRPC Server → Dynamic Batcher → Session Pool → ONNX Runtime → Response

Components

Inference Engine
- Loads an ONNX model and creates a pool of inference sessions sized to CPU cores
- Round-robin selection distributes load evenly across sessions
- Each session runs ONNX Runtime with tuned CPU threading
Dynamic Batcher
- Collects incoming requests and combines them into batches
- Batch size and wait time are configurable
- Improves throughput for CNNs, transformers, and most feed-forward architectures
Backpressure
- Rejects requests when too many are in flight
- Preserves latency under traffic spikes
gRPC Layer
- Strongly typed interface for inference
- Clients generate code from the .proto file in many languages
Metrics Server
- Axum endpoint exposing internal metrics: queue depth, batch sizes, total items, and latency stats
Docker Support
- Multi-stage build creates a minimal deployment image

Project Structure

rustml-inference/ ├── build.rs ├── Cargo.toml ├── Dockerfile ├── data.json ├── models/ │ └── resnet50.onnx ├── proto/ │ └── inference.proto └── src/ ├── main.rs ├── inference/ │ └── mod.rs ├── batching/ │ └── mod.rs ├── grpc/ │ └── server.rs └── monitor/ ├── mod.rs └── prometheus.rs

Running the Server

Build the Docker Image docker build -t rustml-inference:local .
Run the Service docker run --rm
-p 50051:50051
-p 3000:3000
-v $(pwd)/models:/app/models
rustml-inference:local
Service Endpoints

gRPC Inference Service: localhost:50051

Metrics HTTP server: localhost:3000/metrics

Health check: localhost:3000/health

Example Client Usage (Python)

Generate client code:

python -m grpc_tools.protoc
-I./proto
--python_out=.
--grpc_python_out=.
proto/inference.proto

Call the inference RPC:

import grpc import inference_pb2 import inference_pb2_grpc

channel = grpc.insecure_channel("localhost:50051") stub = inference_pb2_grpc.InferenceServiceStub(channel)

req = inference_pb2.PredictRequest( data=[0.0] * (1 * 3 * 224 * 224), shape=[1,3,224,224] )

res = stub.Predict(req) print(res.shape) print(res.predictions[:10])

Load Testing

Using ghz:

ghz --insecure
--proto proto/inference.proto
--call inference.InferenceService.Predict
-D data.json
-c 2
-n 5
localhost:50051

Future Extensions

Possible enhancements for a production deployment include:

GPU support (CUDA EP, TensorRT EP)

Prometheus exporters with Grafana dashboards

Distributed autoscaling (Docker Compose or Kubernetes)

Authentication and request signing

Model registry and hot-swappable models

Pre-processing and post-processing pipelines

Streaming RPC for audio/speech models

Conclusion

This project demonstrates how modern machine learning inference servers are designed in practice. By incorporating batching, pooling, backpressure, and efficient model execution, it provides a strong foundation for deploying ONNX models at scale.

The architecture is intentionally universal so that different models can be served with minimal changes, making it suitable for both research and real-world production environments.# Rust_Projects

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
proto		proto
src		src
target		target
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
ReadMe.md		ReadMe.md
build.rs		build.rs
data.json		data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RustML Inference Server

Features

Architecture

Components

Project Structure

Running the Server

Future Extensions

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RustML Inference Server

Features

Architecture

Components

Project Structure

Running the Server

Future Extensions

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages