Page MenuHomePhabricator

[AMDGPU] Add remarks to output some resource usage
ClosedPublic

Authored by vangthao on Apr 15 2022, 3:19 PM.

Details

Summary

Add analyis remarks to output kernel name, register usage, occupancy,
scratch usage, spills, and LDS information.

Diff Detail

Event Timeline

vangthao created this revision.Apr 15 2022, 3:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:19 PM
vangthao requested review of this revision.Apr 15 2022, 3:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2022, 3:19 PM
arsenm added inline comments.Apr 15 2022, 4:01 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
511

Needs to skip this whole block if !ORE? Also should move all of this to a separate function

513

Define the string name somewhere to avoid repeating it everywhere

llvm/lib/Target/AMDGPU/SIProgramInfo.h
52

This isn't a spill size

jtramm added a subscriber: jtramm.Apr 18 2022, 7:52 AM
scott.linder added inline comments.Apr 18 2022, 3:48 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
596

This seems like the wrong place to segment the output; could this be made a part of the emitter itself?

Or maybe better yet, could all of these values be collected into a single remark? That seems to be how e.g. llvm/lib/CodeGen/RegAllocGreedy.cpp does it:

Pass:            regalloc                                                                                Name:            SpillReloadCopies                                                                       Function:        f                                                                                       Args:                                                                                                      - NumSpills:       '1'                                                                                   - String:          ' spills '                                                                            - TotalSpillsCost: '1.000000e+00'                                                                        - String:          ' total spills cost '                                                                 - NumReloads:      '1'                                                                                   - String:          ' reloads '                                                                           - TotalReloadsCost: '1.000000e+00'                                                                       - String:          ' total reloads cost '                                                                - String:          generated in function
vangthao updated this revision to Diff 424344.Apr 21 2022, 5:33 PM

Move remarks into its own function. Skip if !ORE. Add clang frontend test. Remove LDSSpillSize.

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 5:33 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
vangthao added inline comments.Apr 21 2022, 5:40 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
596

We also want to output this in a readable format for the frontend. Collecting it all into a single remark seems to break the output format since clang seems to ignore all newlines from a diagnostic remark.

scott.linder added inline comments.Apr 26 2022, 10:45 AM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
596

Rather than use newlines, RegAllocGreedy uses spaces; we can debate aesthetics, but I feel like we should have a compelling reason before we choose our own format.

For example, with RegAllocGreedy you can get output along the lines of:

foo/bar/baz.cpp:42:1: remark: 10 spills 100 total spills cost 7 folded spills 70 total folded spills cost 22 reloads 44 total reloads cost 77 folded reloads 120 total folded reloads cost 1 zero cost folded reloads 78 virtual registers copies 111 total copies cost
void foo()
^

RegAllocFast appears to be the original use-case that caused the machine remarks to be invented, and it has been updated recently (https://reviews.llvm.org/D100020) so I don't suspect this is just some legacy cruft.

If we follow the same approach we get something like:

void AMDGPUAsmPrinter::emitResourceUsageRemarks(
    const MachineFunction &MF, const SIProgramInfo &CurrentProgramInfo) {
  if (!ORE)
    return;

  ORE->emit([&]() {
    return MachineOptimizationRemarkAnalysis(
               "kernel-resource-usage", "ResourceUsage",
               MF.getFunction().getSubprogram(), &MF.front())
           << ore::NV("NumSGPR", CurrentProgramInfo.NumSGPR) << " SGPRs "
           << ore::NV("NumVGPR", CurrentProgramInfo.NumArchVGPR) << " VGPRs "
           << ore::NV("NumAGPR", CurrentProgramInfo.NumAccVGPR) << " AGPRs "
           << ore::NV("ScratchSize", CurrentProgramInfo.ScratchSize)
           << " scratch bytes/thread "
           << ore::NV("Occupancy", CurrentProgramInfo.Occupancy)
           << " occupancy waves/SIMD "
           << ore::NV("SGPRSpill", CurrentProgramInfo.SGPRSpill)
           << " SGPR spills "
           << ore::NV("VGPRSpill", CurrentProgramInfo.VGPRSpill)
           << " VGPR spills " << ore::NV("BytesLDS", CurrentProgramInfo.LDSSize)
           << " LDS size bytes/block ";
  });                                                                                                                                                                                                                                                                   }

which produces:

clang/test/Frontend/amdgcn-machine-analysis-remarks.cl:14:1: remark: 9 SGPRs 10 VGPRs 12 AGPRs 0 scratch bytes/thread 10 occupancy waves/SIMD 0 SGPR spills 0 VGPR spills 0 LDS size bytes/block  [-Rpass-analysis=kernel-resource-usage]
__kernel void foo() {
^

That seems reasonable to me, and avoids bloating the other formats like YAML with extra remarks, including some that have no actual content (i.e. the "KernelName" and "KernelEnd" remarks are meaningless).

Of course I'm just one opinion, and if others prefer the several-remark approach I'm fine with it.

1185–1273
arsenm added inline comments.Apr 26 2022, 2:36 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1185–1273

I'm sort of surprised the all in one form doesn't come out in something more parsable? It might be worth looking at IR level remarks, since there's probably more usage of them. Are there any existing uses in clang that do something meaningful by parsing out different parts?

vangthao added inline comments.Apr 29 2022, 2:00 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1185–1273

We could check if the specific remark is enabled to avoid cluttering YAML output:

const char *Name = "kernel-resource-usage";
LLVMContext &Ctx = MF.getFunction().getContext();
if (!Ctx.getDiagHandlerPtr()->isAnalysisRemarkEnabled(Name))
  return;

I do not think using spaces to format the output will work for us. Most of the IR level remarks seems to be related specifically to their pass and/or optimization thus are quite short so readability is not as impacted. We are outputting a decent chunk of information and need some readability for the user.

Parsing out different parts like this is mostly a workaround for clang ignoring newlines. I am not aware of any other uses doing it like this since many of them are short and uses spaces for formatting.

arsenm added inline comments.Apr 29 2022, 2:45 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1229

Probably should not use the word thread. lane?

1250

Unrelated but we should reimplement VGPRSpill. It's now reporting number of lowered spill instructions rather than number of spilled values

Even with newlines forced via extra remarks, I'm not a big fan of the "-----------------------" remark; it doesn't interact well with other random remarks in the output, for example when I enable all remarks using the pattern '.*' I see:

remark: foo.cl:27:0: AMDGPU DAG->DAG Pattern Instruction Selection: Function: test_kernel: MI Instruction count changed from 0 to 6; Delta: 6
remark: foo.cl:27:0: 0 stack bytes in function
remark: foo.cl:42:0: AMDGPU DAG->DAG Pattern Instruction Selection: Function: test_func: MI Instruction count changed from 0 to 4; Delta: 4
remark: foo.cl:42:0: 0 stack bytes in function
remark: foo.cl:42:0: SI insert wait instructions: Function: test_func: MI Instruction count changed from 4 to 5; Delta: 1
remark: foo.cl:8:0: AMDGPU DAG->DAG Pattern Instruction Selection: Function: empty_kernel: MI Instruction count changed from 0 to 1; Delta: 1
remark: foo.cl:8:0: 0 stack bytes in function
remark: foo.cl:52:0: AMDGPU DAG->DAG Pattern Instruction Selection: Function: empty_func: MI Instruction count changed from 0 to 1; Delta: 1
remark: foo.cl:52:0: 0 stack bytes in function
remark: foo.cl:52:0: SI insert wait instructions: Function: empty_func: MI Instruction count changed from 1 to 2; Delta: 1
remark: <unknown>:0:0: BasicBlock:
: 2

remark: foo.cl:27:0: 6 instructions in function
remark: foo.cl:27:0: Kernel Name: test_kernel
remark: foo.cl:27:0: SGPRs: 24
remark: foo.cl:27:0: VGPRs: 9
remark: foo.cl:27:0: AGPRs: 43
remark: foo.cl:27:0: ScratchSize [bytes/thread]: 0
remark: foo.cl:27:0: Occupancy [waves/SIMD]: 5
remark: foo.cl:27:0: SGPRs Spill: 0
remark: foo.cl:27:0: VGPRs Spill: 0
remark: foo.cl:27:0: LDS Size [bytes/block]: 512
remark: foo.cl:27:0: ------------------------------
remark: <unknown>:0:0: BasicBlock:
: 2

remark: foo.cl:42:0: 5 instructions in function
remark: foo.cl:42:0: Kernel Name: test_func
remark: foo.cl:42:0: SGPRs: 0
remark: foo.cl:42:0: VGPRs: 0
remark: foo.cl:42:0: AGPRs: 0
remark: foo.cl:42:0: ScratchSize [bytes/thread]: 0
remark: foo.cl:42:0: Occupancy [waves/SIMD]: 0
remark: foo.cl:42:0: SGPRs Spill: 0
remark: foo.cl:42:0: VGPRs Spill: 0
remark: foo.cl:42:0: ------------------------------
remark: <unknown>:0:0: BasicBlock:
: 1

If we do keep the delimiter remarks, can we have one at the beginning as well? At least then other remarks don't appear to "bleed" into the new block of remarks.

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1185–1273

Are there any existing uses in clang that do something meaningful by parsing out different parts?

I would expect this sort of thing to happen before it gets serialized to an unstructured string, or for it to happen on a structured output like YAML.

Parsing out different parts like this is mostly a workaround for clang ignoring newlines.

Can we fix clang to respect newlines?

arsenm added inline comments.May 2 2022, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1271

Get rid of the ——. It’s not a remark and only kind of makes sense if you are printing these with others

afanfa added a subscriber: afanfa.May 11 2022, 8:58 AM

If possible, I would like to keep some kind of delimiter. I like the idea of having it at the beginning and at the end of the section. The best option would be to convince clang to print new lines.

If possible, I would like to keep some kind of delimiter. I like the idea of having it at the beginning and at the end of the section. The best option would be to convince clang to print new lines.

It seems like the stripping of non-printable characters is intentional, but only for diagnostics that don't use the TableGen based diagnostic formatting scheme in clang. I'm not sure exactly why, but regardless there is precedent for newlines in other messages.

To get the newlines to hit the terminal for our case the following patch is enough:

--- a/clang/lib/Basic/Diagnostic.cpp
+++ b/clang/lib/Basic/Diagnostic.cpp
@@ -812,7 +812,7 @@ FormatDiagnostic(const char *DiagStr, const char *DiagEnd,
       getArgKind(0) == DiagnosticsEngine::ak_std_string) {                                                                                                                                                                                                                  const std::string &S = getArgStdStr(0);                                                                                                                                                                                                                                 for (char c : S) {
-      if (llvm::sys::locale::isPrint(c) || c == '\t') {
+      if (llvm::sys::locale::isPrint(c) || c == '\t' || c == '\n') {
         OutStr.push_back(c);
       }
     }

To get the right indentation inserted for the extra lines the TextDiagnostic consumer needs to also be updated, but that should be a small change. Only one test breaks, and it seems useful for it to pick up the new behavior anyway.

The only other potential conflict is that in TextDiagnostic specifically there is also the Clang option -fmessage-length=N which will forcibly wrap diagnostic messages on word-boundaries (although never breaking up a word across lines). It seems to only apply to text preceding the first newline in the message, so it is likely just a non-issue?

I am not sure if allowing clang to accept newlines is a good idea. It seems like clang wants to know what type of message is being outputted. For example whether this is a remark, warning, etc. but allowing for a diagnostic to output their own newline makes it ambiguous where exactly that output is coming from.

I am not sure if allowing clang to accept newlines is a good idea. It seems like clang wants to know what type of message is being outputted. For example whether this is a remark, warning, etc. but allowing for a diagnostic to output their own newline makes it ambiguous where exactly that output is coming from.

It already supports newlines in any diagnostic which doesn't use the trivial format string "%0", and at least clang/test/Misc/diag-line-wrapping.cpp explicitly tests this behavior.

It seems reasonable that clang could add a prefix or indentation scheme while emitting multi-line diagnostics to the terminal, to help with the ambiguity issue. For example instead of the current output:

clang/test/Misc/diag-line-wrapping.cpp:8:14: error: non-static member 'f' found in multiple base-class subobjects of type 'B':
    struct DD -> struct D1 -> struct B
    struct DD -> struct D2 -> struct B

Maybe we could have something like:

clang/test/Misc/diag-line-wrapping.cpp:8:14: error: ...
...: non-static member 'f' found in multiple base-class subobjects of type 'B':
...:     struct DD -> struct D1 -> struct B
...:     struct DD -> struct D2 -> struct B

I don't know if changing this kind of output is a breaking change, though? I do know some tooling parses this output.

If possible, I would like to keep some kind of delimiter. I like the idea of having it at the beginning and at the end of the section. The best option would be to convince clang to print new lines.

But it’s not a section and no actual grouping concept here. You just happen to see this printed in order. Any delimiter should be introduced as a display function, not emitted as part of the remarks themselves

ormris removed a subscriber: ormris.May 16 2022, 10:55 AM
vangthao updated this revision to Diff 440447.Jun 27 2022, 5:46 PM

Remove "--------" delimiter. Change ScratchSize [bytes/thread] to ScratchSize [bytes/lane]. Use lambda expression to emit remarks. Do not output yaml if specific remark is not enabled. Add indentation to make it easier to tell which resource usage remark belong to which kernel.

I will let others comment, but I think this is a perfectly reasonable alternative, and I much prefer using indentation over the delimiter-remark approach. LGTM, assuming nobody else objects

vangthao marked 5 inline comments as done.Jul 6 2022, 10:32 AM

ping

arsenm added inline comments.Thu, Jul 14, 10:01 AM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1222

Why is this kernel name? Do we not emit these for other functions?

vangthao added inline comments.Thu, Jul 14, 10:15 AM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1222

We do emit these for other functions. Would this be better off as "Function Name" instead of kernel?

arsenm added inline comments.Thu, Jul 14, 10:22 AM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1222

Yes

vangthao updated this revision to Diff 444759.Thu, Jul 14, 12:03 PM

Change "Kernel Name" to "Function Name" and rebased patch.

arsenm added inline comments.Thu, Jul 14, 12:38 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1201

Why &&?

scott.linder added inline comments.Thu, Jul 14, 3:48 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
1201

Looking at the ore::NV constructors I would vote that this just be by value, i.e. auto Argument. Everything that can be an Argument is a small "prefer passing by value" type.

vangthao updated this revision to Diff 444844.Thu, Jul 14, 4:45 PM

Change auto &&Argument to auto Argument.

arsenm accepted this revision.Thu, Jul 14, 5:03 PM
This revision is now accepted and ready to land.Thu, Jul 14, 5:03 PM
scott.linder accepted this revision.Fri, Jul 15, 10:57 AM
This revision was landed with ongoing or failed builds.Fri, Jul 15, 11:02 AM
This revision was automatically updated to reflect the committed changes.