This is an archive of the discontinued LLVM Phabricator instance.

TLI: Add option to generate dependent stores in scalarization.
AbandonedPublic

Authored by jvesely on Sep 19 2016, 2:16 PM.

Details

Summary

This is used by targets (like r600) that do RMW for truncating stores.
Future optimization should remove most redundant stores+loads
Fixes R600 regressions since r274397.

Diff Detail

Repository
rL LLVM

Event Timeline

jvesely updated this revision to Diff 71881.Sep 19 2016, 2:16 PM
jvesely retitled this revision from to TLI: Add option to generate dependent stores in scalarization..
jvesely updated this object.
jvesely added reviewers: tstellarAMD, bogner.
jvesely set the repository for this revision to rL LLVM.
jvesely added subscribers: llvm-commits, arsenm.

ping

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]}
This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4. so 1 and 4 are loads from the same location and get merged into single load:
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
5: r4 ={r1[0:8],y,r1[15:31]}
This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4
(note that sequences 2,3 and 5,6 are independent so the writes can occur in any order).
which results in data corruption at A.

This patch adds chain dependency between 3 and 4 preventing the load elimination. It should still be possible to eliminate both 3 and 4 (which in turn should enable combining of the bit ops in 2 and 5).

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]}
This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4.

Why not? I understand why the truncated stores and extending loads might be independent, but if you have code that widens those without looking at potential aliased peers and adjusting the chain, I think the problem lies there.

jvesely abandoned this revision.Sep 26 2016, 4:52 PM

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]}
This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4.

Why not? I understand why the truncated stores and extending loads might be independent, but if you have code that widens those without looking at potential aliased peers and adjusting the chain, I think the problem lies there.

OK. I guess I'll need to look into how to do that then.