Multi Signals Performance Optimization of the ACT C++ Library on CPU and GPU

[Github : ACT C++ Library]

In the previous post Performance Optimization of the ACT C++ Library on CPU and GPU we covered hardware acceleration and performance optimization to transform a single signal using the ACT algorithm. Here we will cover how to transform multiple signals in parallel in real time!

Handling multiple signals

As the performance profiling we did in the previous post showed, the bulk of the computation in the ACT algorithm is the dictionary search, which is a matrix multiplication between the dictionary matrix and the signal vector, all other operations, such as the BFGS optimization, are minor in comparison.

Given that multiple signals vectors can be represented as a matrix, we can perform the dictionary search in parallel by simply multiplying the dictionary matrix by the matrix of signals.

BLAS offers the gemm function to perform matrix multiplication for our CPU only implementation, and MLX offers the mx::matmul function to do the same on GPUs.

Here are the relevant code snippets for the CPU and GPU implementations:

CPU with BLAS

The gemm_colmajor function is a wrapper for the cblas_sgemm and cblas_dgemm functions from the BLAS library for each of the supported scalar types (float32 and double).

// gemm (column-major): C := alpha * op(A) * op(B) + beta * C
inline void gemm_colmajor(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB,
                          int M, int N, int K,
                          float alpha,
                          const float* A, int lda,
                          const float* B, int ldb,
                          float beta,
                          float* C, int ldc) {
    cblas_sgemm(CblasColMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);
}
inline void gemm_colmajor(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB,
                          int M, int N, int K,
                          double alpha,
                          const double* A, int lda,
                          const double* B, int ldb,
                          double beta,
                          double* C, int ldc) {
    cblas_dgemm(CblasColMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);
}

The search is implemented as a matrix multiplication between the dictionary matrix and the signal matrix.

    for (int it = 0; it < opts.order; ++it) {

        // Coarse scores for all signals at once: S = A^T * R
        act::blas::gemm_colmajor(CblasTrans, CblasNoTrans,
                                  n, k, m,
                                  Scalar(1),
                                  A.data(), m,
                                  R.data(), m,
                                  Scalar(0),
                                  S.data(), n);

        for (int j = 0; j < k; ++j) {

            const Scalar* Sj = S.data() + static_cast<size_t>(j) * static_cast<size_t>(n);
            int best_idx = act::blas::iamax(n, Sj, 1);

            // Refinement if requested
            Eigen::Vector4d refined = init;
            if (opts.refine) {
                refined = act.refine_params_bfgs(init, R.col(j));
            }

All signals are sequentially processed at the same order opts.order in the main for loop. For each order it, we multiply the dictionary matrix A by the signal matrix R to get the scores matrix S.

Then we find the best match for each signal in the scores matrix S using the act::blas::iamax function. The best match is used as an initial guess for the BFGS optimization.

GPU with MLX

        std::vector<float> R_rowmajor(static_cast<size_t>(m) * static_cast<size_t>(k));

        for (int it = 0; it < opts.order; ++it) {
            if (active_count == 0) break;

            // Upload R
            for (int i = 0; i < m; ++i) {
                for (int j = 0; j < k; ++j) {
                    R_rowmajor[static_cast<size_t>(i) * static_cast<size_t>(k) + static_cast<size_t>(j)] = R(i, j);
                }
            }
            mx::array R_gpu(R_rowmajor.data(), mx::Shape{m, k}, mx::float32);

            // Coarse selection on device
            auto S = mx::matmul(mx::transpose(A_gpu), R_gpu); // {n, k}
            auto idxs = mx::argmax(mx::abs(S), 0);            // {k}

            for (int j = 0; j < k; ++j) {

                int best_idx = mx::take(idxs, j).template item<int>();

                // Coarse init
                Eigen::Vector4d init;
                init[0] = P(best_idx, 0);
                init[1] = P(best_idx, 1);
                init[2] = P(best_idx, 2);
                init[3] = P(best_idx, 3);

                // Refine in double (public wrapper handles internal double optimize)
                Eigen::Vector4d refined = init;
                if (opts.refine) {
                    // Map current float residual to Eigen::Map<float> and pass; wrapper accepts Scalar=Scalar(act)
                    refined = act.refine_params_bfgs(init, R.col(j));
                }

Similar to the CPU implementation we process all of the signals at the same order opts.order in the main for loop.

The signal matrix R is uploaded to the GPU, and mx::matmul is used to perform the matrix multiplication between the transposed dictionary matrix A_gpu and the signal matrix R_gpu to get the scores matrix S.

For each signal we then use mx::take to find the best match for the signal in the scores matrix S. The best match is used as an initial guess for the BFGS optimization.

Profiling results

The profiler for the multi-signal is configured similarly to the single signal profiler, the code is in profile-act-mt.cpp. The code generates a fairly large dictionary of approximately 600k atoms with a memory footprint of 2.3GB with double precision, 1.1GB with single precision. The signal length is 2 seconds, 512 samples at 256Hz. By default 16 separate signals are generated and transformed in parallel, the performance numbers collected below use this configuration. The transform order for all signals is 10.

Mac : Apple MacBook Pro M1 Max 32GB running MacOS 14.6
PC : custom built Intel Core i9 14th gen, 64GB RAM, NVIDIA RTX 4090 running Ubuntu 24.04.
CPU : the ACT implementation using BLAS
MLX : the ACT implementation using Apple MLX
double : double precision (double|double64)
single : single precision (float|float32)
Transform : the full transform operation at order 10, this includes the dictionary search operation and the BFGS optimization at each turn
SNR : Signal to Noise Ratio, higher is better
RT Factor : Real Time Factor, higher is better, this is the ratio of the transform time to the signal length in seconds, in this specific case how many 2 seconds of EEG data can be processed in one second, which will give us a confidence factor on the real time performance of the algorithm.

Platform	Transform	SNR	RT Factor
Mac / CPU / double	Total batch: 1123.43 ms; Per-signal avg: 70.21 ms	Mean: 13.10 dB; Range: [12.74, 13.77] dB	28.5x
Mac / MLX / single	Total batch: 423.13 ms; Per-signal avg: 26.45 ms	Mean: 10.96 dB; Range: [10.56, 11.69] dB	75.6x
PC / MLX / single	Total batch: 208.20 ms; Per-signal avg: 13.01 ms	Mean: 11.15 dB; Range: [10.26, 12.20] dB	153.7x

The results clearly show that the ACT library is capable of processing 16 signals at a pace much faster than real time on all platforms.

Demo

To showcase the realtime capabilities I added an OSC receiver and parser to the ACT EEG workbench application. The OSC receiver is configured to receive data from the Muse headset and parse it into a matrix of signals, which are then transformed in real time using the ACT algorithm and played back as audio waveforms to the platform sound system. The current implementation uses the raw signal from each of the 4 EEG channels of the Muse headset without any filtering or preprocessing.

Here is a video of my brainwave signals played back as chirplets: Loom