This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][TargetMachine] Handle case when +extended-image-insts is set, and the user forces +wave64
Needs RevisionPublic

Authored by jmmartinez on Aug 23 2023, 5:35 AM.

Details

Reviewers
arsenm
Pierre-vh
Summary

Some functions from device_libs have the attribute
"target-features"="+extended-image-insts".

On targets that default to wave32, if wave64 is forced by the user,
the wave64 feature is dropped when initializing the subtarget because
the "target-features" attribute is already set.

This results in functions marked with "target-features"="+extended-image-insts"
being compiled as wave32, although wave64 was requested.

This patch is a workaround this issue.

If "target-features" is equal to "+extended-image-insts", the global and
function features are concatenated.

In the general case, we cannot just concatenate the global and function
features since they may be incompatible: The feature
"+wavefrontsize32,+wavefrontsize64" results in 64 as wavefrontsize.

Related to SWDEV-410182.

Diff Detail

Event Timeline

jmmartinez created this revision.Aug 23 2023, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 5:35 AM
jmmartinez requested review of this revision.Aug 23 2023, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 5:35 AM
jmmartinez added inline comments.Aug 23 2023, 5:39 AM
llvm/test/CodeGen/AMDGPU/extended-image-insts-wave32-wave64.ll
24

Simplify the test case, I'm pretty sure it doesn't need this

arsenm requested changes to this revision.Aug 23 2023, 6:33 AM

I think changing the behavior of the global subtarget is a bad idea. We shouldn't be relying on the global subtarget in the first place, much less selectively changing the interpretation from "set default" to "append".

Using one set of IR functions and pretending they work for both wave sizes is a bad idea and simply not working. You're running into the consequences of several layers of hacks in device libs, comgr and clang to pretend this works. The wave size behaves more like a different target-cpu, rather than a feature you can simply union. I increasingly think we should move the wavefront size out from target-features and into a separate attribute (or introduce a set of wave32 calling conventions).

There may be several bugs here, and I don't think any of them should be addressed in the backend. First, last I knew comgr was not using clang/-mlink-builtin-bitcode (although maybe this was fixed?). It doesn't matter because I just performed an experiment and found that's also broken:

// builtins.cl 
// clang -O3 -target amdgcn-amd-amdhsa -c -emit-llvm -o builtin.bc builtin.cl

__attribute__((target("extended-image-insts")))
void uses_images(void) {

}

void uses_nothing(void) {

}

// clang -target amdgcn-amd-amdhsa -mcpu=gfx1031 -mwavefrontsize64 -S -emit-llvm -Xclang -mlink-builtin-bitcode -Xclang builtin.bc  -o - builtin-user.cl  -O0
// builtin-user.cl
extern void uses_images(void);
extern void uses_nothing(void);

void user_images(void) {
    uses_images();
}

void user_nothing(void) {
    uses_nothing();
}

This is broken, because the imported builtin didn't append the feature to uses_images:

define internal void @uses_images() #1 {
  ret void
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal void @uses_nothing() #2 {
  ret void
}

attributes #1 = { convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1031" "target-features"="+extended-image-insts" }
attributes #2 = { convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1031" }

@uses_nothing isn't even getting the +wavefrontsize64, which is doubly broken. I believe this was working correctly at one point, so did this regress? clang is broken here, so that should be fixed first. If comgr is still not correctly using -mlink-builtin-bitcode, it should implement an approximately equivalent hack to append the features if there are still blockers to doing so.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
585–587

There's nothing special about extended-image-insts to warrant special casing it here. Don't really understand how it connects to the original failure

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
34

Feature string shouldn't require adjustment

llvm/test/CodeGen/AMDGPU/extended-image-insts-wave32-wave64.ll
2

Don't need quotes or -verify-machineinstrs

This revision now requires changes to proceed.Aug 23 2023, 6:33 AM

First, last I knew comgr was not using clang/-mlink-builtin-bitcode (although maybe this was fixed?).
[...]
@uses_nothing isn't even getting the +wavefrontsize64, which is doubly broken. I believe this was working correctly at one point, so did this regress? clang is broken here, so that should be fixed first. If comgr is still not correctly using -mlink-builtin-bitcode, it should implement an approximately equivalent hack to append the features if there are still blockers to doing so.

Thanks for pointing this out. Comgr uses the Linker::linkInModule function directly. I wasn't aware that there was a difference, but I see now that using -mlink-builtin-bitcode does Gen->CGM().addDefaultFunctionDefinitionAttributes(F); and does internalization for every function in the builtin bitcode. Linker::linkInModule only keeps the attributes in the definition coming in from the builtin bitcode and ignores those in the declaration.

However, there is a TODO in that function...

void CodeGenModule::addDefaultFunctionDefinitionAttributes(llvm::Function &F) {
  llvm::AttrBuilder FuncAttrs(F.getContext());
  getDefaultFunctionAttributes(F.getName(), F.hasOptNone(),
                               /* AttrOnCallSite = */ false, FuncAttrs);
  // TODO: call GetCPUAndFeaturesAttributes?
  F.addFnAttrs(FuncAttrs);
}

If I set up a call to GetCPUAndFeaturesAttributes(GlobalDecl(), F) any target-features present in the definition are overriden (so it overrides the +extended-image-insts).


I have one question though. Why device_libs has that target specific attribute? Shouldn't it be deduced later in the optimization pipeline since the function has calls to image intrinsics?

I have one question though. Why device_libs has that target specific attribute? Shouldn't it be deduced later in the optimization pipeline since the function has calls to image intrinsics?

This isn't an optimization. It cannot be inferred. This is supposed to be a restriction. Programs trying to use this on targets without the underlying instructions should be rejected

Adding Pierre as he worked on this recently as well.