Polybench had large execution time due to the successive call to
fprintf as much as 4000*4000 times. For most programs, this was more
than 1/2 of its execution time.
The current solution is to transform the values into a stream of
nibbles as a char string, and print it once for every row, ie.
only as much as 4000 times, by using fputs instead of fprintf.
Overall new execution time is 47% of previous with some as low as 5%.
The reduction on x86_64 was 53%, on ARM was 51% and on AArch64 was 55%,
which means most of the time was spent on I/O, not the actual benchmark.
I ran this on all three architectures with small and full workloads. Checksums updated.