Page MenuHomePhabricator

Make SROA handle addrspacecast
Needs ReviewPublic

Authored by wengxt on Jun 16 2015, 12:21 PM.

Details

Reviewers
arsenm
Summary

This is basically D4501 from Matt with fixed testcase. In GPU related
world addrspacecast instructions are commonly used to avoid overhead
when access memory with generic load and store. SROA pass currently does
not take addrspaceCast into account. This patch uses the same way that SROA
handles bitcast to process addrspacecast.

Diff Detail

Event Timeline

wengxt updated this revision to Diff 27778.Jun 16 2015, 12:21 PM
wengxt retitled this revision from to Make SROA handle addrspacecast.
wengxt updated this object.
wengxt edited the test plan for this revision. (Show Details)
wengxt added reviewers: arsenm, chandlerc.
wengxt added subscribers: jingyue, Unknown Object (MLST).
sanjoy added a subscriber: sanjoy.Jun 16 2015, 1:20 PM
sanjoy added inline comments.
lib/Transforms/Scalar/SROA.cpp
1825

Will this transform a "gep(addrspacecast X), Y)" to "addrspacecast(gep X, Y')"? If so, I think we need a target specific hook that does a legality check on the addrspacecast we're about to insert -- I don't think "addrspacecast X" is legal implies an addrspacecast on a GEP derived from X is also legal.

chandlerc edited edge metadata.Jun 16 2015, 1:35 PM

I'm actually quite alarmed at the idea of needing to handle addrspacecast of *alloca*s. I feel like their existence is a bigger problem.

I'd like to understand the use case that GPU folks have for this better. For an un-escaped alloca, why are there addrspacecasts at all? Why would they be useful? Why couldn't you lower them with arbitrarily fast loads and stores due to being unescaped allocas?

lib/Transforms/Scalar/SROA.cpp
1825

The whole point of addrspacecast was to avoid the need for target specific checks... I think we just shouldn't transform GEPs around them if that isn't legal everywhere.

Regarding your question on why these addrspacecasts even exist, the short answer is that emitting addrspacecasts after alloca eases NVPTX's address space inference which lowers access to special address spaces into fast loads/stores.

The long answer requires some background on CUDA and how the NVPTX backend works. I tried to make it brief. Let me know if it makes sense.

First of all, address spaces are not included in CUDA's type system. All pointers in a CUDA source code are generic. Only when a variable is declared, the programmer can annotate the address space that this variable lives in. For example, __shared__ int *a = xxx; means a, the pointer itself, lives in the shared address space rather than a points to an int that lives in the shared space.

So, LLVM's middle-end and/or the NVPTX backend bear the burden of optimizing the memory accesses to variables in special address spaces. Given an address-taken variable annotated with a special address space, the clang front-end simply casts the resultant address to generic (i.e. addrspace(0)) right after the declaration. The optimizer needs to eliminate unnecessary addrspacecasts and emits fast loads/stores.

Currently, the pass that performs this address space optimization in NVPTX is NVPTXFavorNonGenericAddrSpaces. It works by pushing addrspacecast (from special to generic) towards loads and stores as far as possible. After that, the codegen can simply lower load/store <ty>, <ty> addrspace(x)* to fast loads/stores in machine code. For example, if a function reads from a shared variable, the IR clang emits without optimization looks like

@a = addrspace(3) global float
define ... {
  %1 = addrspacecast float addrspace(3)* @a to float* ; cast from special to generic
  %2 = load float, float* %1
  ...
}

then NVPTXFavorNonGenericAddrSpaces optimizes it to

@a = addrspace(3) global float
define ... {
  %2 = load float, float addrspace(3)* @a
  ...
}

and finally NVPTX's codegen lowers this "load from shared" to a ld.shared.f32 instruction.

This pass works well for GlobalVariables whose address spaces are explicitly annotated. However, alloca always returns addrspace(0) pointers, so the clang front-end doesn't (need to) emit any explicit addrspacecast from local to generic in the initial IR. This creates troubles for the address space inference, because NVPTXFavorNonGenericAddrSpaces has no addrspacecasts to push for accesses to the local address space.

To address this issue, we plan to run another pass called NVPTXLowerAlloca (D10483) that emits two addrspacecasts after an alloca, one casting generic to local and the other casting local back to generic. For example,

define ...() {
  %1 = alloca float
  store float 0, float* %1
}

becomes

define ...() {
  %1 = alloca float
  %2 = addrspacecast float* %1 to float addrspace(4)*
  %3 = addrspacecast float addrspace(4)* %2 to float*
  store float 0, float* %3
}

Then, NVPTXFavorNonGenericAddrSpaces pushes the second addrspacecast as usual towards the store.

define ...() {
  %1 = alloca float
  %2 = addrspacecast float* %1 to float addrspace(4)*
  store float 0, float addrspace(4)* %2
}

This is why NVPTX emits addrspacecasts after alloca at some stage.

If having SROA to handle addrspacecast of allocas is too alarming, I think the easiest way to fix the issue on our side is to run SROA between NVPTXLowerKernelArgs (another address space inference pass, which must run before SROA) and NVPTXLowerAlloca. It will work in the short term. However, Justin Holewinski and I have some concerns on how much we depend on correctly ordering these address-space inference passes, and think we should merge all of them into one big pass. If we want that merging to happen, we won't be able to run SROA in the middle. Still, workarounds exist such as merging most of the address space inference and only running a small part before SROA.

lib/Transforms/Scalar/SROA.cpp
1825

Thanks for pointing this out, Sanjoy. I understand your concerns. Curiously, on which existing targets such transformation from gep(addrspacecast X), Y to addrspacecast(gep X, Y) is illegal? Just asking so that I can keep a concrete example in mind when I later write target-independent optimizations on addrspacecast.

arsenm edited edge metadata.Jun 17 2015, 2:46 PM

The last time this came up, I was able to get instcombine to eliminate trivial addrspacecasts of alloca which worked just as well. I just never got around to actually testing / posting the patch. The main piece was just removing an outdated check for the address space mismatch from before addrspacecast existed

sanjoy added inline comments.Jun 17 2015, 3:36 PM
lib/Transforms/Scalar/SROA.cpp
1825

In practice, I think gep(addrspacecast X), Y and addrspacecast(gep X, Y) should compute the same result in all existing targets. However, this caps how complex an addrspacecast implementation can be. For instance, targets won't be able to implement addrspacecast as

addrspacecast(P) { return abs(P) }

since with this change, gep(addrspacecast(-1), 2) != addrspacecast(gep -1, 2).

I'm okay with moving ahead with this gep transform as long as we change the langref to reflect that addrspacecast instructions cannot be arbitrarily complex value modifications (so that an abs implementation for addrspacecast can be ruled out).

arsenm resigned from this revision.Jun 3 2019, 11:14 AM