Polybench had large execution time due to the successive call to
fprintf as much as 4000*4000 times. For most programs, this was more
than 1/2 of its execution time.
The current solution is to transform the values into a stream of
nibbles as a char string, and print it once for every row, ie.
only as much as 4000 times, by using fputs instead of fprintf.
Overall new execution time is 47% of previous with some as low as 5%.
The reduction on x86_64 was 53%, on ARM was 51% and on AArch64 was 55%,
which means most of the time was spent on I/O, not the actual benchmark.
I ran this on all three architectures with small and full workloads. Checksums updated.
Do I understand correctly that this code basically only prints out the values of the last row of the entire matrix (the offset is j*8)? I think we'd want whatever the hash function implementation we end up with to still take all elements as input, to improve the chance of detecting a mis-compilation.
I think the hash function can be really simple - no need for anything complex or secure; but we probably should feed in all matrix elements into the hash function. Maybe the straightforward solution here is to just print out the sum of all elements in a row, rather than each element in the row?
Tobias may now these tests better: are we expecting bit-reproducible results for these tests? I'm guessing so unless DATA_PRINTF_MODIFIER in the original code was chosen so that it prints out with less precision?