This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Drop memory references of LDG/LDU
ClosedPublic

Authored by asavonic on Oct 25 2021, 9:56 AM.

Details

Summary

This patch fixes machine verifier errors:

    *** Bad machine code: Missing mayLoad flag ***
    - function:    foo1
    - basic block: %bb.0  (0x5560fc64ef08)
    - instruction: %4:float32regs =
	INT_PTX_LDG_GLOBAL_f32areg64 killed %3:int64regs
	:: (load (s32) from %ir.from1, addrspace 1)

LDG/LDU should not be treated as loads because they operates on
read-only memory, and therefore these instructions should not have mayLoad flag.
Machine verifier checks for memoperands to determine whether an instruction is
a load, so dropping them during lowering fixes the problem.

Diff Detail

Event Timeline

asavonic created this revision.Oct 25 2021, 9:56 AM
asavonic requested review of this revision.Oct 25 2021, 9:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2021, 9:56 AM

It appears that mayLoad was omitted for a reason, but I cannot figure out why. If a load is done from a read-only memory, it should still be treated as a load, right? Is there anything that requires special handling for these instructions?

Git blame shows that these comments were introduced by @jlebar in https://reviews.llvm.org/D17471, but there are no details either, so maybe I'm missing something.

It appears that mayLoad was omitted for a reason, but I cannot figure out why. If a load is done from a read-only memory, it should still be treated as a load, right?

I think the notion is, if the load is from read-only memory that always has the same value for the entirety of the program, in what way is it like a "load", as opposed to e.g. "move immediate into register"?

You can safely reorder these instructions with any other memory operation.

Definitely a hack, and maybe there is today a better way of representing this in LLVM.

Loads from constant memory that is known dereferenceable are not modeled as loads/reads in the IR, e.g., we derive readnone in their presence.
At least if you have "constant mem" TBAA annotations or some other encodings that inform the analyses about the constness of the memory.
That said, if the verifier complains we need to teach the verifier or avoid this. I am inclined to suggest the former if there is no reason not to.

asavonic updated this revision to Diff 382062.Oct 25 2021, 11:16 AM

That makes sense, thanks a lot for the explanation!
Machine verifier checks for memoperands to determine whether an instruction is a load or a store. Perhaps we can just drop them for LDG/LDU if we don't want them to be counted as loads?

I cannot really comment on the verifier change.

asavonic retitled this revision from [NVPTX] Annotate LDG/LDU instructions as mayLoad to [NVPTX] Drop memory references of LDG/LDU.Nov 2 2021, 5:36 AM
asavonic edited the summary of this revision. (Show Details)
tra added a comment.Nov 2 2021, 11:52 AM

While ldu does indeed specify that it loads from read-only memory, I do not think we can treat ld.global.nc the same way.
PTX spec says Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.

The way I read it -- it's a regular load that bypasses cache. Unlike ldu, it does not specify that it's a read-only data. While read-only data will make it work correctly, it's not the only valid use case.
E.g. data may be changed from the host and two subsequent ld.global.nc will return different values.

While ldu does indeed specify that it loads from read-only memory, I do not think we can treat ld.global.nc the same way.
PTX spec says Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.

The way I read it -- it's a regular load that bypasses cache.

I'm not sure, but I think the spec says that the load is cached:
Load [...] from the location [...] in the global state space, and optionally cache in non-coherent texture cache.
The problem is that the cache is non-coherent and optional.

Unlike ldu, it does not specify that it's a read-only data. While read-only data will make it work correctly, it's not the only valid use case.
E.g. data may be changed from the host and two subsequent ld.global.nc will return different values.

You mean something like this, right?

ld.global.nc r1, [a]
st.global [a], r2           ; may be updated from host?
ld.global.nc r3, [a]

I think the result is undefined: r1 and r3 can be equal if the first instruction cached the result. Otherwise, since the cache is optional, r2 and r3 can be equal.
However, what happens for this code?

st.global [a], r4
ld.global.nc r5, [a]

Provided that a is not cached before the store, ld.global.nc should always load the stored value, right?
If this is correct, then ld.global.nc can behave like a "load" in some cases, and therefore should have mayLoad flag.

hliao added a subscriber: hliao.Nov 3 2021, 8:12 AM

These machine instruction should be marked with mayLoad as they are really load instructions. Check the attached patch fixing the machine verifier.

These machine instruction should be marked with mayLoad as they are really load instructions. Check the attached patch fixing the machine verifier.

This is where we started from in this review request (see the revision 382033).
I think the consensus is that LDU is not a load, because it operates on read-only memory (and therefore can be safely excluded from memory-related analysis).
I'm not sure about LDG though.

tra added a comment.Nov 3 2021, 10:23 AM

While ldu does indeed specify that it loads from read-only memory, I do not think we can treat ld.global.nc the same way.
PTX spec says Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.

The way I read it -- it's a regular load that bypasses cache.

I'm not sure, but I think the spec says that the load is cached:
Load [...] from the location [...] in the global state space, and optionally cache in non-coherent texture cache.
The problem is that the cache is non-coherent and optional.

ld.global.nc is an odd instruction.

The optionally cache in texture cache applies only to the new variant of the instruction which specifies caching hints.

ld.global.nc{.level::eviction_priority}{.level::cache_hint}.type      d, [a]{, cache-policy};
ld.global.nc{.level::eviction_priority}{.level::cache_hint}.vec.type  d, [a]{, cache-policy};
...
Support for .level::eviction_priority and .level::cache_hint qualifiers introduced in PTX ISA version 7.4.
Support for .level::eviction_priority qualifier requires sm_70 or higher.
Support for .level::cache_hint qualifier requires sm_80 or higher.

I think the optionally-caching variants may have to be treated as separate instructions and those would be closer to ldu functionality-wise.

Unlike ldu, it does not specify that it's a read-only data. While read-only data will make it work correctly, it's not the only valid use case.
E.g. data may be changed from the host and two subsequent ld.global.nc will return different values.

You mean something like this, right?

ld.global.nc r1, [a]
st.global [a], r2           ; may be updated from host?

I mean that host could do cudaMemcpy(host->device, a, host_ptr, N).

ld.global.nc r3, [a]

I think the result is undefined: r1 and r3 can be equal if the first instruction cached the result. Otherwise, since the cache is optional, r2 and r3 can be equal.

ld.global.nc does no caching. If host copies something into 'a' between two loads, r1 and r3 will be different.

However, what happens for this code?

st.global [a], r4
ld.global.nc r5, [a]

Provided that a is not cached before the store, ld.global.nc should always load the stored value, right?
If this is correct, then ld.global.nc can behave like a "load" in some cases, and therefore should have mayLoad flag.

Result of this code is undefined. In general, writes are posted and the data may not make it to the memory by the time ld.global.nc attempts to read it.
According to PTX docs memory consistency does not apply to ld.global.nc: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability

The memory consistency model does not apply to texture (including ld.global.nc) and surface accesses.

My example of host/device accesses may be a bad one. It would equally affect regular loads/stores, too, and right now we generally assume that only GPU can change memory.
If that's the case, then ld.global.nc should not have mayLoad -- we can't guarantee consistency with any store ops anyways, so there's no point affecting their order.

hliao added a comment.Nov 3 2021, 11:27 AM

These machine instruction should be marked with mayLoad as they are really load instructions. Check the attached patch fixing the machine verifier.

This is where we started from in this review request (see the revision 382033).
I think the consensus is that LDU is not a load, because it operates on read-only memory (and therefore can be safely excluded from memory-related analysis).
I'm not sure about LDG though.

Memory analysis already knows any loads from constant memory won't inference with other loads or stores (pointsToConstantMemory). Loads from constant memory are still a load. They needs to be marked with mayLoad.

These machine instruction should be marked with mayLoad as they are really load instructions. Check the attached patch fixing the machine verifier.

This is where we started from in this review request (see the revision 382033).
I think the consensus is that LDU is not a load, because it operates on read-only memory (and therefore can be safely excluded from memory-related analysis).
I'm not sure about LDG though.

Memory analysis already knows any loads from constant memory won't inference with other loads or stores (pointsToConstantMemory). Loads from constant memory are still a load. They needs to be marked with mayLoad.

It seems that pointsToConstantMemory is not implemented for NVPTX, so we'll have fix that.
@jlebar, does this sound like a better solution?

It seems that pointsToConstantMemory is not implemented for NVPTX, so we'll have fix that.

This is a function of AliasAnalysis and if I understand correctly there is no NVPTX-specific implementation of AA.
AMDGPU implementation has a set of rules based on pointer address spaces (e.g. pointers in global and private AS do not alias).
Does it make sense to add similar logic for NVPTX?

tra added a comment.Nov 10 2021, 12:16 PM

Does it make sense to add similar logic for NVPTX?

Yes. We can make similar no-aliasing conclusions for shared/constant/global/local/param memories.

Sorry, this got pushed back by other things, but I'd like to return to this conversation. I've uploaded a patch to add a simple AA that takes NVPTX address spaces into account and implements pointsToConstantMemory (D124787). This patch should be useful on its own, but I'm no longer sure that it is enough to safely (performace-wise) add mayLoad to LDU/LDG instructions.

The problem is that AA::pointsToConstantMemory works with LLVM IR Value (MemoryLocation), and optimizations in CodeGen may not be able to use it or it is just not enabled yet. There are also cases where heuristics treat loads separately (BranchFolding, RegAllocScore), and it is not clear if we can change this and treat constant loads as regular instructions for all targets.

I'll go over these cases in a couple of days, but for now it seems like the current patch (drop memory references) is the safest option.

Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2022, 10:58 AM

I think the current patch is the way to go: we want to treat LDU and LDG as regular instructions and exclude them from memory analysis. Therefore memory references are not needed and we can drop them.

pointsToConstantMemory or other AA functions do not seem to cover this case, because they operate on memory locations (pointers), and not on instructions themselves. Pointer operands for LDU and LDG are from global address space, and if I'm reading the PTX spec right, LDU can also load from generic address space.

tra added a comment.Jan 11 2023, 12:26 PM

LGTM in principle.

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1660–1661

The function also handles regular loads/stores (e.g. NVPTXISD::LoadV4) and those should still mark the node as a memory reference.

asavonic added inline comments.Jan 12 2023, 4:40 AM
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1660–1661

Can you elaborate why we need to treat these nodes separately? A regular load can be lowered to LDG if it matches canLowerToLDG, but the final instruction is still LDG (aka ld.global.nc).

tra accepted this revision.Jan 12 2023, 10:41 AM
tra added inline comments.
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1660–1661

Sorry, my mistake. I only looked at the case constant and didn't pay attention that we're lowering it into an LDG instruction.

This revision is now accepted and ready to land.Jan 12 2023, 10:41 AM
This revision was landed with ongoing or failed builds.Jan 14 2023, 1:18 PM
This revision was automatically updated to reflect the committed changes.