Rust for Machine Learning: SIMD, BLAS, and Lapack

I love Rust. But, as a data scientist, it’s still hard to use Rust on a daily basis. 90% of my programming these days is in Python.

My interest in Rust-based machine learning sparked several months ago. But, the key limitation I found was the lack of an ergonomic linear algebra library. There’s nalgebra and ndarray and a few others. Yet, I found none of them at the time ergonomic to work with, nor fast in comparison to writing the lower-level SIMD, BLAS, and Lapack code (I have picked up ndarray more in recent weeks and months).

While inconvenient, a few months later, I’m glad I had to take things a step further. Rust is great for writing performant code. The resources for writing quite low-level mathematics operations in Rust are quite good. Using blas-src and lapack-src, as well as Rust’s built in SIMD functions, we can write fast and surprisingly portable Rust code. You can even run Rust on the GPU using, at least, the same underlying code.

About SIMD in Rust

Arguably the least useful of the three for machine learning, but still valuable for everyday tasks are Single-Instruction Multiple Data operations, also known as vectorization (the references are interchangeable in the Rust documentation). On modern CPUs, SIMD allows us to perform mathematical operations up to eight times faster than standard approaches by packing numbers together and multiplying them in parallel. Let’s do a quick comparison to see how SIMD looks in practice.

Consider multiplying two scalars in a standard approach:

fn standard_128_f32(bench: &mut Bencher) {
	let a_values: [f32; 4] = [8.1239412, 931.20100, 5.531, 6.030100];
	let b_values: [f32; 4] = [9.0003, 20.202, 81325.20230, 195132.00999];
    bench.iter(|| {
      	for _i in 1..1000000 {
            black_box([a_values[0] * b_values[0], a_values[1] * b_values[1],
             			a_values[2] * b_values[2], a_values[3] * b_values[3]]);
        }
    })
}

Or in SIMD:

fn simd_128_f32(bench: &mut Bencher) {
	unsafe{
		let a_values = _mm_set_ps(8.1239412, -931.20100, 5.531, -6.030100);
		let b_values = _mm_set_ps(9.0003, -20.202, 81325.20230, 195132.00999);
	    bench.iter(|| {
	        for _i in 1..1000000 {
	            black_box(_mm_mul_ps(a_values, b_values));
	        }
	    });
	}
}

The standard approach runs 1M multiplications in 670,925 ns / iteration, with a +/- . In comparison, the SIMD approach runs 315,533 ns / iteration with a +/- of only 32,635 ns. So, just over two times faster in exchange for working with a little bit of unsafe code. Generally, I find this to be reasonably consistent across SIMD functions. ~1.7x faster has been my normal standard for SIMD operations, once everything is transferred to the proper data structures.

The big challenge with SIMD in the first place in Rust is transferring all the data to the required data structures in the first place—so even in mathematically intense operations the gain can be relatively limited if you have to convert all the data between structs, vecs and __mm type data structures. Consider the computation of the Haversine distance between two points on the Earth’s surface. Sadly, the performance gains will be be limited: I found them to only be in the range of 8%. Why is this? The cost of moving the data to and from SIMD-compatible structures in order to compute the cosine function (which, as Rust does not support the Intel intrinsic function __mm_cos_pd has to be done with standard cosine functions on a standard vec or other data structure.

For machine learning SIMD is useful to know. For very low-dimensional datasets, being able to structure data into multiples of four can result in noticeably better performance. It’s also useful to have this as a baseline when evaluating BLAS and Lapacke.

What’s BLAS? How do you use it in Rust?

The Basic Linear Algebra Subprograms (BLAS) are low-level routines that started out as a Fortran library in 1979, and have since become quite widespread. It’s worth noting that while most all platforms either can install BLAS, or have some form of BLAS installed (on Mac, it’s merged with Lapack in accelerate. BLAS provides pretty standard stuff: dot products and the like. The limitation of BLAS, however, is that it has no built in functionality to take the inverse. It’s worth also noting that while it’s very fast, it’s far from easy to work with.

Let’s multiply two matrices using BLAS:

let mone = vec![
	2.0, 3.0,
	4.0, 5.0
];
let mtwo = vec![
	0.5, 1.0,
	1.5, 2.0
];
let mut result = vec![
 	0.0, 0.0,
	0.0, 0.0,
];
unsafe {
	dgemm(b'N', b'N', 2, 2, 2, 1.0, &mone, 2, &mtwo, 2, 1.0, &result, 2);
}

Congratulations! The result of this multiplication is now stored in result. I think few of us would classify this procedure as particularly pleasant. It’s a little less bad than it looks. The first two b’N' are how we instruct the stringly-typed API of BLAS not to transpose either mone or mtwo. We specify the m, n, and k dimensions a grand total of six, (all two). Next, we provide the scalar by which to multiply mone and mtwo (1.0). Finally, we provide a reference to our output variable.

After this inconvenience and a bit of dependency hell, we finally accomplished something tremendous. Our code is now running another 2-4x faster, and we have GPU support now with no further effort through NVIDIA’s cuBLAS.

So, great! Let’s do something with this. How about computing ordinary least squares. Unfortunately, we can get close—but we can’t quite get there. We cannot easily perform the inverse in BLAS. We’ll keep moving forwards to Lapack.

All about Lapack in Rust

Lapack is great: it provides high-performance functions for many operations that are essential to machine learning. The downside is we’re really getting off the beaten path a bit here, and the documentation is thin. Intel provides the best documentation on the subject, and actually beats Google at being able to find the relevant function. And, if you thought that BLAS was a tad unwieldy, Lapack gets a bit more that way.

So let’s perform a mixed BLAS and Lapack ordinary least squares (OLS) regression. We’ll do a fairly standard setup.

let (rows, cols) = (4, 1);
let scale_by = 1.0;
let xs = vec![
	2.0, 
	3.0,
	4.0, 
	5.0
];
let ys = vec![
	1.0,
  2.0,
	2.3,
	0.4
];
let mut xtx = vec![
	0.0
];
 
let mut xty = vec![
	0.0,
];
 
let mut result = vec![0.0];

Then, we’ll start by doing some simple BLAS multiplication—we’ll compute X transposed X and X transposed Y

unsafe {
	dgemm(b'Y', b'N', cols, cols, rows, scale_by, &xs, cols, &xs, rows, scale_by, &result, rows);
	dgemm(b'Y', b'N', cols, cols, rows, scale_by, &xs, cols, &ys, rows, scale_by, &xty, rows);
}

Then, we’ll need to compute the LU decomposition first using Lapack in order to get the inverse of X transposed X.

// IPIV will be the pivot indices for the LU decomposition, with minimum dimensions of (m, n)
let mut ipiv = vec![
	0.0
];
 
// Then we'll need the inverse of xtx
unsafe {
	// In production, check status after each step
	let mut status = 0;
	// First we need to generate the LU decomposition of the XTX matrix
	dgetrf_(1, 1, &xtx, 1, &ipiv, status);
}

Once we have the LU decomposition and the values are stored in the pivot indices, we can then compute the inverse.

unsafe{
	// Then we can generate the inverse of the matrix from the LU decomposition
	let mut workspace = vec![0.0];
	dgetri_(1, &xtx, 1, ipiv, workspace, 1, status);
}
// XTX has now been overwritten by its inverse

Finally, we’ll go back to BLAS to wrap it up and get our results

// Back to BLAS
unsafe{
	dgemm(b'Y', b'N', 1, 1, 4, scale_by, &xtx, 1, &xty, 4, scale_by, &result, 4);
}

Our result is now stored nicely in our vec called result. While this process may not have been simple, it is fast, and it provides an excellent foundation in how the underlying operations are performed. Most performant matrix libraries in any programming language ultimately depend on Lapack and/or BLAS, and it’s good to know the fundamentals underneath our libraries.

Why should we care?

Performance matters. Moore’s law is ending, so if we want to continue the thriving machine learning ecosystem, we’ll need to learn to optimize our machine learning. The performance achievable using this compared to Python calling C is an improvement of 4-10x. That’s around six years of Moore’s law improvements on the high end.

Rust is also quite nice to debug and the strict types encourage good practices when performing data engineering. I urge you: start trying to use Rust for your machine learning. It’ll be a bit frustrating now, but it’ll pay off in the coming years. If you want to learn more about BLAS and Lapack, I’ll be honest, there’s not much out there—if you see something, please send it my way--I'll update this post.