Quick update, I managed to get the python benchmarking script working and added a convolution kernel as well as more matrix multiplication configurations. As far as I can tell there seems to be no performance gain by the new loop order this change introduces (at least for the sizes and sparsity factors I tried out).

