This is an archive of the discontinued LLVM Phabricator instance.

[LNT] Reduce I/O execution time for Polybench
ClosedPublic

Authored by rengolin on Jul 7 2015, 5:58 AM.

Details

Summary

Polybench had large execution time due to the successive call to
fprintf as much as 4000*4000 times. For most programs, this was more
than 1/2 of its execution time.

The current solution is to transform the values into a stream of
nibbles as a char string, and print it once for every row, ie.
only as much as 4000 times, by using fputs instead of fprintf.

Overall new execution time is 47% of previous with some as low as 5%.
The reduction on x86_64 was 53%, on ARM was 51% and on AArch64 was 55%,
which means most of the time was spent on I/O, not the actual benchmark.

I ran this on all three architectures with small and full workloads. Checksums updated.

Diff Detail

Repository
rL LLVM

Event Timeline

rengolin updated this revision to Diff 29167.Jul 7 2015, 5:58 AM
rengolin retitled this revision from to [LNT] Reduce I/O execution time for Polybench.
rengolin updated this object.
rengolin set the repository for this revision to rL LLVM.
rengolin added subscribers: llvm-commits, cmatthews.
grosser edited edge metadata.Jul 7 2015, 6:09 AM

Hi Renato,

thank you for working on this. This is goes definitely in the right direction. We could possibly even compute a single output hash in Polybench and just print this hash. In the end, we only want to know if the output changed.

thank you for working on this. This is goes definitely in the right direction. We could possibly even compute a single output hash in Polybench and just print this hash. In the end, we only want to know if the output changed.

That was the original plan, but I'd need to add a hash function to Polybench.

So I did the intermediate solution: a simple dump function that would be quick enough to avoid being the hottest function in the benchmark, but concise enough to not rely on I/O speed or to invoke printf-like functions for floating point data.

In the end, we have one malloc+free, ~4000 inlined calls to a bunch of op+write and one memcpy. All of them easy to optimise on almost every target.

The speed up is already impressive, I think over-engineering this would be counter-productive. :)

cheers,
--renato

The speed up is already impressive, I think over-engineering this would be counter-productive. :)

Not to mention that now we'll be tracking actual performance in Polybench, not I/O randomness.

cheers,
--renato

Hi Renato,

I also don't think this needs to be over-engineered. Maybe I missed something, but could
we not just copy https://en.wikipedia.org/wiki/MurmurHash, add a call to

int32_t hash = murmur3_32(A, N * N * sizeof(float), 0)
printf("%d", hash);

and we are done as well. The solutions are almost identical, but here we are really fully eliminate IO cost.

In the end, the difference is probably minor. So if you prefer to go with your hand-written approach, that's fine with me as well.

I also don't think this needs to be over-engineered. Maybe I missed something, but could
we not just copy https://en.wikipedia.org/wiki/MurmurHash, add a call to

Choosing an arbitrary hash function may encourage people to disagree, etc. Also, the I/O is already negligible.

I have compiled with and without -DPOLYBENCH_DUMP_ARRAYS, and on most cases, the difference is lower than noise.

kristof.beyls added inline comments.Jul 7 2015, 9:50 AM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

Do I understand correctly that this code basically only prints out the values of the last row of the entire matrix (the offset is j*8)? I think we'd want whatever the hash function implementation we end up with to still take all elements as input, to improve the chance of detecting a mis-compilation.
I think the hash function can be really simple - no need for anything complex or secure; but we probably should feed in all matrix elements into the hash function. Maybe the straightforward solution here is to just print out the sum of all elements in a row, rather than each element in the row?

Tobias may now these tests better: are we expecting bit-reproducible results for these tests? I'm guessing so unless DATA_PRINTF_MODIFIER in the original code was chosen so that it prints out with less precision?

SingleSource/Benchmarks/Polybench/utilities/polybench.h
609–630

I'm not sure, but it looks like this may give different answers on big versus little endian machines (which is something the previous implementation didn't have)?
Maybe just printing out the sum of each row of a matrix (i.e. 4000 floats being printed) instead of the entire matrix (4000*4000 floats being printed) already reduces the IO overhead to be in the noise? If not, a couple of rows could be summed up together to reduce the amount of IO further?

rengolin added inline comments.Jul 7 2015, 10:00 AM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

No.

print_element receives a float value (4 bytes) and expand into 8 nibles (8 bytes). So, every iteration of the print of A[i][j] will be on printmat[j*8]. In there, j*8 is only the initial position, on a streak of 8, not the *only* position printed.

SingleSource/Benchmarks/Polybench/utilities/polybench.h
609–630

Yes, it does, and that's ok. We already have many tests with different output for big and little endian, and we deal with it by having a file called *.reference_outputs.big-endian. I can't run it, as I don't have any big-endian box. If you do, and want to do it now, feel free to send me some reference outputs. If not, we can wait until someone that runs it sends us. That's how we've done it in the past.

A sum of all the elements would *also* not be perfect, as a double change would go unnoticed.

I/O is not an issue any more, and the differences are indistinguishable from noise.

kristof.beyls added inline comments.Jul 7 2015, 10:09 AM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

Yes, I got that, but given that this is a 2-dimensional matrix, with i indicating the row and j indicating the column, only using j to index the printed out result means that every iteration of the i-loop overwrites the results in printmat written on the previous iteration, right? The mallox(n*8) also indicates there is only room to print a single row, not the entire matrix. Or maybe I'm still missing something?

rengolin added inline comments.Jul 7 2015, 10:22 AM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

That's why the fputs below is inside the i loop. I'm printing one row at a time. This also saves a lot of memory and avoids trashing the allocators, helps caching, etc.

Since the runtime now is indistinguishable from when *not* printing anything, I think it's a good trade-off.

kristof.beyls added inline comments.Jul 7 2015, 10:32 AM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

D'oh - I missed that.
Cool, so we're still producing roughly the same amount of output - but way more efficiently.

Provided these tests were already checking for bit-accurate results (I'm not sure - it probably depends on the DATA_PRINTF_MODIFIER), this looks good to me.

rengolin added inline comments.Jul 7 2015, 12:14 PM
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c
46

Cool, so we're still producing roughly the same amount of output - but way more efficiently.

Precisely. :)

Provided these tests were already checking for bit-accurate results (I'm not sure - it probably depends on the DATA_PRINTF_MODIFIER), this looks good to me.

They weren't, the modifier was mostly "%0.2f", and that's why one of the tests (fdtd-apml) didn't work with the new technique. But the gain was too small to justify any work towards that goal.

However, most of the others worked out of the box on ARM, AArch64 and x86_64. I think having a more strict checking is ok, as long as we understand that this was not a requirement. Though, I think it's a good thing.

I'll add a comment before print_element() to that effect, so if people see it failing, we can revert the ones that weren't exact.

I can't see how an alternative would work without either using printf or accumulation of values.

The results of Polybench are supposed to be bit-identical. However, some of the tests yield nans/infs (which they should not, but which they may in some configurations do), and for those I am unsure about the bitwise identity.

The results of Polybench are supposed to be bit-identical. However, some of the tests yield nans/infs (which they should not, but which they may in some configurations do), and for those I am unsure about the bitwise identity.

Ok, so IIUC, my changes make the test more accurate and will not only increase benchmarking quality, but also testing quality.

I'll push these changes as they are, since they have clear value, but I'll add a note to fdtd-apml why I haven't moved it. This is a job for another commit.

cheers,
--renato

rengolin accepted this revision.Jul 8 2015, 3:23 AM
rengolin added a reviewer: rengolin.
This revision is now accepted and ready to land.Jul 8 2015, 3:23 AM
rengolin closed this revision.Jul 8 2015, 3:23 AM

r241675

lhames added a subscriber: lhames.Jul 8 2015, 11:37 AM

Hi Renato,

I'm seeing intermittent failures on some of our internal arm64 builders
that I think are related to this. For example, I'm seeing:

  • TEST (simple) 'reg_detect' FAILED! ****

Execution Context Diff:
/.../build/tools/fpcmp: files differ without tolerance allowance

  • TEST (simple) 'reg_detect' ********

I ran "reg_detect" manually 10 times and grabbed the raw output, and I'm
seeing variation from run to run:

000000220000003300000033000000330000003300000033000000000000003300000044000000440000004400008844000000000000000000000044000000440000004400000055000000000000000000000000000000440000005500008855000000000000000000000000000000??0000<<550000<<55000000000000000000000000000000??0000000000006655exit
0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033000000000000003300000044000000440000004400008844000000000000000000000044000000440000004400000055000000000000000000000000000000440000005500008855000000000000000000000000000000??0000<<550000<<55000000000000000000000000000000??0000000000006655exit
0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033000000000000003300000044000000440000004400008844000000000000000000000044000000440000004400000055000000000000000000000000000000440000005500008855000000000000000000000000000000??0000<<550000<<55000000000000000000000000000000??0000000000006655exit
0

000000220000003300000033000000330000003300000033@000000000000003300000044000000440000004400008844
@000000000000000000000044000000440000004400000055@000000000000000000000000000000440000005500008855
@000000000000000000000000000000??0000<<550000<<55@000000000000000000000000000000
??0000000000006655@exit 0

000000220000003300000033000000330000003300000033000000000000003300000044000000440000004400008844000000000000000000000044000000440000004400000055000000000000000000000000000000440000005500008855000000000000000000000000000000??0000<<550000<<55000000000000000000000000000000??0000000000006655exit
0

Any thoughts on what could be causing this?

Cheers,
Lang.

Hi Lang,

Yes, the test should be bit-exact, but not all of them are. I'll revert that one test for now, and add a FIXME to someone look into it.

cheers,
--renato

Should be fixed in r241709.

Hi Lang,

Yes, the test should be bit-exact, but not all of them are. I'll revert that one test for now, and add a FIXME to someone look into it.

-Nan, Nan and such values are not required to have the same bitvalue. We can probably just map them to a fixed bitpattern.

Tobias

Not even on the same architecture across multiple runs? This one seems odd...

I think it's a bug that was masked because people moved the precision
way to low to avoid variations in the past...

Mh, interesting. But yes, you are probably right.

Polybench was known to have some issues, but this was not one I was aware of. There is also a newer version of polybench (4.x) which we may want to upgrade to at some point. It has most performance issues resolved.

Hi Renato,

I'm also seeing a similar failure on gramschmidt. Could you fix or revert
this there too? Sorry!

  • Lang.