HomeTechnologyONNX Runtime Web—operating your machine learning model in browser

ONNX Runtime Web—operating your machine learning model in browser

We are introducing ONNX Runtime Web (ORT Web), a new feature in ONNX Runtime to enable JavaScript developers to run and deploy machine learning models in browsers. It also helps enable new classes of on-device computation. ORT Web will be replacing the soon to be deprecated onnx.js, with improvements such as a more constant developer experience between packages for server-side and client-side inferencing and improved inference performance and model coverage. This blog gives you a quick overview of ORT Web, as well as getting initiated sources for trying it out.

A glance at ONNX Runtime (ORT)

ONNX Runtime is a high-performance cross-platform inference engine to run all kinds of machine learning models. It helps all the most wellliked training frameworks including TensorFlow, PyTorch, SciKit Learn, and more. ONNX Runtime aims to provide an easy-to-use experience for AI developers to run models on various hardware and software platforms. Beyond accelerating server-side inference, ONNX Runtime for Mobile is available since ONNX Runtime 1.5. Now ORT Web is a new offering with the ONNX Runtime 1.8 release, focusing on in-browser inference.

In-browser inference with ORT Web

Running machine-learning-powered web applications in browsers has drawn a lot of consideration from the AI community. It is challenging to make native AI applications portable to multiple platforms given the variations in programming languages and deployment environments. Web applications can easily enable cross-platform portability with the same implementation through the browser. Additionally, operating machine learning models in browsers can accelerate performance by reducing server-client communications and simplify the distribution experience without needing any additional libraries and driver installations.

How does it work?

ORT Web accelerates model inference in the browser on both CPUs and GPUs, through WebAssembly (WASM) and WebGL backends individually. For CPU inference, ORT Web compiles the native ONNX Runtime CPU engine into the WASM backend by using Emscripten. WebGL is a wellliked standard for accessing GPU capabilities and adopted by ORT Web for reaching high performance on GPUs.

Figure 1: ORT Web Overview

Figure 1: ORT web overview.

WebAssembly (WASM) backend for CPU

WebAssembly allows you to use server-side code on the client-side in the browser. Before WebAssembly only JavaScript was available in the browser. There are some advantages of WebAssembly compared to JavaScript such as faster load time and execution efficiency. Furthermore, WebAssembly helps multi-threading by utilizing SharedArrayBuffer, Web Worker, and SIMD128 (128-bits Single Instruction Multiple Data) to accelerate bulk data processing. This makes WebAssembly an attractive technique to execute the model at near-native velocity on the web.

We leverage Emscripten, an open-source compiler toolchain, to compile ONNXRuntime C++ code into WebAssembly so that they can be loaded in browsers. This allows us to reuse the ONNX Runtime core and native CPU engine. By doing that ORT Web WASM backend can run any ONNX model and support most functionalities native ONNX Runtime offers, including full ONNX operator coverage, quantized ONNX model, and mini runtime. We utilize multi-threading and features in WebAssembly to further accelerate model inferencing. Note that SIMD is a new feature and isn’t yet available in all browsers with WebAssembly support. The browsers supporting new WebAssembly features could be found on the webassembly.org website.

During initialization, ORT Web checks the capabilities of the runtime environment to detect whether multi-threading and SIMD features are available. If not, there is a fallback version based on the environment. Taking Mobilenet V2 as an example, the CPU inference performance can be accelerated by 3.4x with 2 threads together with SIMD enabled, evaluating the pure WebAssembly without enabling these 2 features.

Figure 2: 3.4x performance acceleration on CPU with multi-threading and SIMD enabled in WebAssembly (Test machine: Processor Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s))

Figure 2: 3.4x performance acceleration on CPU with multi-threading and SIMD enabled in WebAssembly (Test machine: Processor Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)).

WebGL backend for GPU

WebGL is a JavaScript API that conforms to OpenGL ES 2.0 standard, which is supported by all major browsers and on various platforms including Windows, Linux, macOS, Android, and iOS. The GPU backend of ORT Web is constructed on WebGL and works with a variety of supported environments. This enables users to seamlessly port their deep learning models across different platforms.

In addition to portability, the ORT WebGL backend offers superior inference performance by deploying the following optimizations: pack mode, data cache, code cache, and node fusion. Pack mode reduces up to 75 percent memory footprint while improving parallelism. To avoid creating the same GPU data multiple times, ORT Web reuses as much GPU data (texture) as possible. WebGL uses OpenGL Shading Language (GLSL) to construct shaders to execute GPU programs. However, shaders should be compiled at runtime, introducing unacceptably high overhead. The code cache addresses this issue by ensuring each shader will be compiled only once. WebGL backend is succesful of quite a few typical node fusions and has plans to take advantage of the graph optimization infrastructure to support a large collection of graph-based optimizations.

All ONNX operators are supported by the WASM backend but a subset by the WebGL backend. You can get supported operators by each backend. And below are the suitable platforms that each backend helps in ORT Web.

Figure 3: Compatible platforms that ORT Web supports.

Figure 3: Compatible platforms that ORT Web helps.

Get initiated

In this section, we’ll show you how you can incorporate ORT Web to construct machine-learning-powered web applications.

Get an ONNX model

Thanks to the framework interoperability of ONNX, you can convert a model skilled in any framework supporting ONNX to ONNX format. Torch.onnx.export is the constructed-in API in PyTorch for model exporting to ONNX and Tensorflow-ONNX is a standalone tool for TensorFlow and TensorFlow Lite to ONNX model conversion. Also, there are various pre-skilled ONNX models masking common situations in the ONNX Model Zoo for a quick start.

Inference ONNX model in the browser

There are 2 ways to use ORT-Web, through a script tag or a bundler. The APIs in ORT Web to score the model are identical to the native ONNX Runtime, first creating an ONNX Runtime inference session with the model and then operating the session with enter data. By providing a constant development experience, we aim to save time and effort for developers to combine ML into applications and services for different platforms through ONNX Runtime.

The following code snippet shows how to call ORT Web API to inference a model with different backends.

const ort = require('onnxruntime-web');

// create an inference session, using WebGL backend. (default is 'wasm')
const session = await ort.InferenceSession.create('./model.onnx', { executionProviders: ['webgl'] });
// feed inputs and run
const results = await session.run(feeds);

Figure 4: Code snippet of ORT Web APIs.

Some advanced features can be configured via setting properties of object `ort.env`, such as setting the maximum thread number and enabling/disabling SIMD.

// set maximum thread number for WebAssembly backend. Setting to 1 to disable multi-threads
ort.wasm.numThreads = 1;

// set flag to enable/disable SIMD (default is true)
ort.wasm.simd = untrue;

Figure 5: Code snippet of properties setting in ORT Web.

Pre- and post-processing needs to be handled in JS before inputs are fed into ORT Web for inference. ORT Web Demo shows several interesting In-Browser vision situations powered by image models with ORT Web. You can find the code source including image enter processing and inference through ORT Web. Another E2E tutorial is created by the Cloud Advocate curriculum team about building a Cuisine Recommender Web App with ORT Web. It goes through exporting a Scikit-Learn model to ONNX as well as operating this model with ORT Web using script tag.

Figure 6: A cuisine recommender web app with ORT Web.

Figure 6: A delicacies recommender web app with ORT Web.

Looking forward

We hope this has inspired you to try out ORT Web in your web applications. We would love to hear your suggestions and feedback. You can participate or leave feedback in our GitHub repos (ONNX Runtime). We proceed to work on and improve the performance, model coverage as well as adding new features. On-device training is another interesting possibility we want to research for ORT Web. Stay tuned for our updates.


Most Popular