This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions.
ClosedPublic

Authored by FarhanaAleen on Dec 10 2018, 6:20 PM.

Details

Summary

Promote constant offset to immediate by recomputing the relative 13bit offset from nearby instructions.

E.g.
s_movk_i32 s0, 0x1800
v_add_co_u32_e32 v0, vcc, s0, v2
v_addc_co_u32_e32 v1, vcc, 0, v6, vcc

s_movk_i32 s0, 0x1000
v_add_co_u32_e32 v5, vcc, s0, v2
v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
global_load_dwordx2 v[5:6], v[5:6], off
global_load_dwordx2 v[0:1], v[0:1], off
=>
s_movk_i32 s0, 0x1000
v_add_co_u32_e32 v5, vcc, s0, v2
v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
global_load_dwordx2 v[5:6], v[5:6], off
global_load_dwordx2 v[0:1], v[5:6], off offset:2048

Diff Detail

Repository
rL LLVM

Event Timeline

FarhanaAleen created this revision.Dec 10 2018, 6:20 PM

Why aren't these matched in the first place? These shouldn't have gotten this far

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

I think even if the same is implemented at selection there always be cases only visible that late. I.e. both may be needed.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
986 ↗(On Diff #177643)

OffsetOp != ...

1007 ↗(On Diff #177643)

Do you expect it to be negative? If so this should not be safe.

1073 ↗(On Diff #177643)

Can you avoid explicit dynamic allocation?

1088 ↗(On Diff #177643)

As far as can see they can be signed int<12> as well. Also this can be target dependent. I suggest using SITargetLowering::isLegalGlobalAddressingMode().

1109 ↗(On Diff #177643)

Same here.

promote-constOffset-to-imm.ll
1 ↗(On Diff #177643)

I think we will need to regenerate these checks too often. Can you factor out just specific checks?

Can we have a mir test with more than two loads? I want to see a situation where 3 loads are foldable with the same offset, but lowest address is in the middle. I.e.:

load a[1000];
load a[800];
load a[1800];

Here we could compute &a[800] and use as a base for all three. If we start with &a[1000] that will not be possible. It is interesting to see if a number of address calculations will actually go down in this case, or at least will not increase.

The other interesting situation is when a negative offset can be used:

load a[1000];
load a[3000];
load a[5000];

In general we can set base to &a[3000] and use negative offset for the first load. I think your pass does not handle it, but it is worth a test to see an ISA.

FarhanaAleen retitled this revision from [AMDGPU] Promote offset to immediate. to [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .Dec 12 2018, 2:36 PM
FarhanaAleen marked an inline comment as done.

Updated with the reviewer's comments.

Can we have a mir test with more than two loads? I want to see a situation where 3 loads are foldable with the same offset, but lowest address is in the middle. I.e.:

Yes, I added two more mir tests called LowestInMiddle and NegativeDistance.

load a[1000];
load a[800];
load a[1800]; << I changed this to 1400 in order to fit 13bit immediate.

Here we could compute &a[800] and use as a base for all three. If we start with &a[1000] that will not be possible. It is interesting to see if a number of address calculations will actually go down in this case, or at least will not increase.

In the best case, we should consider &a[1000] as Anchor and compute the offset. So, it should generate:
%32:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, -1600, 0, 0, implicit $exec :: (load 8 from %ir.addr1, addrspace 1)
%37:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, 0, 0, 0, implicit $exec :: (load 8 from %ir.addr2, addrspace 1)
%42:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, 3200, 0, 0, implicit $exec :: (load 8 from %ir.addr3, addrspace 1)

But the current heuristic traverse first-come-first serve basis. Starting from a[1000], it looks for the longest distanced base. It will consider &a[800] as an anchor and compute &a[1000] from the anchor but will miss &a[1400] because the distance between (a[1000], a[1400]) does not fit in 13bit immediate. But the heuristic will never increase instruction count.

The other interesting situation is when a negative offset can be used:

load a[1000];
load a[3000];
load a[5000];

In general we can set base to &a[3000] and use negative offset for the first load. I think your pass does not handle it, but it is worth a test to see an ISA.

The algorithm does handle negative distance. I added a mir test called NegativeDistance, I had to modify the index count in order to maintain the 13bit distance.

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

This a good approach and has a very low overhead. But without the information of available bases, this approach will generate some higher bits which won't get CSEd.

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Yes, changed it.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
986 ↗(On Diff #177643)

So, we need to store the return value of extractConstOffset which is used later and also checking if extractConstOffset returns null. But if the checking is done suggested way, it will not store the return value.

rampitec added inline comments.Dec 12 2018, 4:30 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
1102 ↗(On Diff #177965)

80 chars per line.

1112 ↗(On Diff #177965)

80 chars per line.

test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
1 ↗(On Diff #177965)

We usually use -check-prefixes=GCN,GFX8 and then use a common GCN-LABEL.

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
219–222 ↗(On Diff #177965)

This test can be reduced a lot. For example you can also use -run-pass=none to compact the register numbers

230 ↗(On Diff #177965)

It would be better to avoid a call in a test that doesn't need to specifically test calls

Maintained 80 chars per line, added GCN-LABEL, reduced mir tests.

FarhanaAleen marked an inline comment as done.Dec 13 2018, 3:29 PM

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

Sure, we can implement the other approach. But it will take some time.

And, I am not seeing any reason why this one needs to wait on the other one. If the benefit of this approach is not clear to you, I can give you an example extracted from a real world application. This approach takes into account of available base computations, therefore, can optimize away more computations.

load1 = load(&a + 4096)
load2 = load(&a + 6144)
load3 = load(&a + 8192)
load4 = load(&a + 10240)
load5 = load(&a + 12288)
load6 = load(&a + 14336)
load7 = load(&a + 16384)
load8 = load(&a + 18432)

The other approach will generate:
load1 = load(t1: &a + 4096, 0)
load2 = load(t1, 2048)

load3 = load(t2: &a + 8192, 0)
load4 = load(t2, 2048)

load5 = load(t3: &a + 12288, 0)
load6 = load(t3, 2048)

load7 = load(t4: &a + 16384, 0)
load8 = load(t4, 2048)

This patch will generate:
load1 = load(t1: &a + 8192, -4096)
load2 = load(t1, -2048)
load3 = load(t1, 0)
load4 = load(t1, 2048)

load5 = load(t2: &a + 16384, -4096)
load6 = load(t2, -2048)
load7 = load(t2: 0)
load8 = load(t2, 2048)

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗(On Diff #177965)

Removing the call does not generate the expected IR.

rampitec added inline comments.Dec 13 2018, 3:53 PM
test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗(On Diff #177965)

Call to get_global_id shall never survive until MI. You probably need to redesign the test.

Removed calls from mir tests.

rampitec added inline comments.Dec 13 2018, 10:47 PM
test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
79 ↗(On Diff #178182)

Do you still need adjust stack? This is just a mir test running a single pass. It does not have to be complete. Even the store at the end is not needed, only relevant instructions.

Removed adjust stack and stores.

This revision is now accepted and ready to land.Dec 13 2018, 11:13 PM

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

Sure, we can implement the other approach. But it will take some time.

And, I am not seeing any reason why this one needs to wait on the other one. If the benefit of this approach is not clear to you, I can give you an example extracted from a real world application. This approach takes into account of available base computations, therefore, can optimize away more computations.

I mean if you implement the simpler version first, you'll see the improvements in the test diff from the more complex version

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗(On Diff #177965)

You only care about loads and the addressing code, so you can construct a case that doesn't have a call

This revision was automatically updated to reflect the committed changes.