Page MenuHomePhabricator

[DebugInfo][InstrRef] Move instr-ref controlling flag out of TargetOptions
ClosedPublic

Authored by jmorse on Jan 7 2022, 10:11 AM.

Details

Summary

Hi,

This patch moves the flag controlling instruction referencing out of TargetOptions and into a plain old LLVM option. I don't think TargetOptions is the right place for this to be, mostly because I've tried twice [0, 1] to have instruction referencing on-by-default in clang, and it's still not working out (which is highly unfortunate). Both clang, and seemingly other frontends, populate TargetOptions with their own opinions on what codegen should look like, rather than starting with the defaults from codegen::InitTargetOptionsFromCodeGenFlags. This doesn't make it a good place to control an *optional* extra part of LLVM. Plus, it's harder to control from LTO [2].

This patch switches the truth-test for using instruction referencing from a field in TargetOptions to the added debuginfoShouldUseDebugInstrRef method, which performs the same checks on Triple + command line flags. This will actually turn on instruction referencing for x86 on all frontends, like I intended in [0] :/. I would add an end-to-end test checking that clang does the right thing (produces DBG_INSTR_REF after SelectionDAG), but there doesn't seem to be a right place to put it (cross-project-tests maybe, but it doesn't have any proper compiler tests there). CC @dblaikie in case he has an opinion on such testing.

It's massively rubbish that this has cut out a large amount of time where I thought this was getting test exposure (possibly LTO is still testing it), I just didn't understand how TargetOptions worked. That might mean this has to be reverted / disabled on the release branch (sad faces) if there's turbulence. Too bad.

[0] D114631
[1] rGea22fdd120
[2] https://lists.llvm.org/pipermail/llvm-dev/2020-December/147090.html

Diff Detail

Event Timeline

jmorse created this revision.Jan 7 2022, 10:11 AM
jmorse requested review of this revision.Jan 7 2022, 10:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2022, 10:11 AM

(cross-project-tests maybe, but it doesn't have any proper compiler tests there

Not sure what you mean by "it doesn't have any proper compiler tests there", proper in what sense? I expect you could do something like my recent simplified template names roundtrip test ( cross-project-tests/debuginfo-tests/clang_llvm_roundtrip/simplified_template_names.cpp ) which runs clang and checks the result with llvm-dwarfdump. Presumably the same could be done for this functionality - validating that some certain locations that are a signal that instruction referencing was used.

Not sure what you mean by "it doesn't have any proper compiler tests there", proper in what sense? I expect you could do something like my recent simplified template names roundtrip test ( cross-project-tests/debuginfo-tests/clang_llvm_roundtrip/simplified_template_names.cpp ) which runs clang and checks the result with llvm-dwarfdump. Presumably the same could be done for this functionality - validating that some certain locations that are a signal that instruction referencing was used.

Thanks for the example; by "proper compiler tests" I was meaning that the existing tests involve examining or running the produced binary, rather than stopping compilation half way through and looking at internal representation details. i.e.: clang test.c -mllvm -stop-after=finalize-isel -o - | FileCheck blah. That's an easy test to write, but there's no similar prior art as far as I understand it.

I'll have a look at detecting a "signal" that the right code-paths were used, although it's ultimately a proxy for what I want to test.

Not sure what you mean by "it doesn't have any proper compiler tests there", proper in what sense? I expect you could do something like my recent simplified template names roundtrip test ( cross-project-tests/debuginfo-tests/clang_llvm_roundtrip/simplified_template_names.cpp ) which runs clang and checks the result with llvm-dwarfdump. Presumably the same could be done for this functionality - validating that some certain locations that are a signal that instruction referencing was used.

Thanks for the example; by "proper compiler tests" I was meaning that the existing tests involve examining or running the produced binary, rather than stopping compilation half way through and looking at internal representation details. i.e.: clang test.c -mllvm -stop-after=finalize-isel -o - | FileCheck blah. That's an easy test to write, but there's no similar prior art as far as I understand it.

Oh, that seems OK to test too/instead - though I'd sort of lean towards cross project tests being more end-to-end "functional" tests, like what's the external behavior we expect to see for this change (I assume that's some improved location quality/coverage/etc). It might be that this. stop-after test might be suitable to live in clang. (tests for things that aren't IR serialized are imperfect - MCoptions (like DWARF type units), etc, - either are untested in clang, or are tested by going a bit further into LLVM than we usually would for a clang test, but I think that's a limitation we live with)

Questions around testing aside, the code changes here LGTM.

Cheers Stephen; I'm leaning towards not adding or altering tests here -- there's already a test for this option, and that LLVM does the right thing in LLC. What changes is a matter of storage: instead of the flag being stored in TargetOptions, which every frontend initializes itself it seems, it's now a (global) flag to LLVM. I'd prefer to stay away from parsing MIR in clang tests!.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 12 2022, 5:28 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.

(Apologies that this didn't arrive in the patch that said "turn instr-ref on by default", there were a few mistakes there),

Yup this enables a significantly different mechanism for tracking variable locations [0] -- if you switch to the "size-file" difference on the compile time tracker, there's a corresponding (in one sample 13.5%) increase in DWARF + location lists produced. It significantly increases our variable location coverage: on the one hand there will definitely be a performance cost caused by producing more information, on the other hand, maybe it can be faster.

[0] https://www.youtube.com/watch?v=yxuwfNnp064

... although when I tested this prior to landing [0] there weren't any double-digit instruction-count regressions, I'll check whether I've missed a patch or something.

[0] http://llvm-compile-time-tracker.com/compare.php?from=849b17949f18b5753592588a2290f2f3dde73ac0&to=f484f03178d47dfc06442b1878462e7c2b7fedf6&stat=instructions

... although when I tested this prior to landing [0] there weren't any double-digit instruction-count regressions, I'll check whether I've missed a patch or something.

^ Ignore that, turns out I can't read, this pretty much matches what I got pre-landing.

tl;dr: this is working as intended -- whether it's a worthwhile trade-off is open to discussion. For testing / profiling examples / opting out, the option is "-mllvm -experimental-debug-variable-locations=false".

nikic added a subscriber: nikic.Jan 19 2022, 8:05 AM

Yeah, the compile-time impact here is not great :( I have some hope that some of this can be mitigated. I've applied a low-hanging optimization in https://github.com/llvm/llvm-project/commit/cbaae614224211d01b385227e7efb447c87fc3ce and another is in D117575, but someone who actually understands what this code is doing algorithmically could probably identify more optimization opportunities.

Looks like I can do some work reduction for a common case that cuts time off the worst regression [0], will have that up later today, although it's not NFC.

[0] http://llvm-compile-time-tracker.com/compare.php?from=1b09d0c42b42be219dd0984e0714d68b4a36cd3e&to=ac78c9434e1a33a137c07054f1c2300083524ca0&stat=instructions

I wrote
https://github.com/llvm/llvm-project/issues/53336
about an X86 verifier error that popped up with this patch.

D117877, I'll try and cook up a few similar patches elsewhere too.

I wrote
https://github.com/llvm/llvm-project/issues/53336
about an X86 verifier error that popped up with this patch.

Cheers, will get on that next.

I ran into a regression with this patch, which triggers the error "fatal error: error in backend: unknown codeview register ST0" on code that built fine before that.

Here's a reduced reproducer:

$ cat dither.c 
a() {
  long double b;
  asm volatile("" : "=t"(b));
}
$ clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c 
fatal error: error in backend: unknown codeview register ST0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: bin/clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'dither.c'.
4.	Running pass 'X86 Assembly Printer' on function '@a'

This makes it impossible to build VLC for x86_64 windows with PDB debug info, which worked fine before.

I ran into a regression with this patch, which triggers the error "fatal error: error in backend: unknown codeview register ST0" on code that built fine before that.

Here's a reduced reproducer:

$ cat dither.c 
a() {
  long double b;
  asm volatile("" : "=t"(b));
}
$ clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c 
fatal error: error in backend: unknown codeview register ST0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: bin/clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'dither.c'.
4.	Running pass 'X86 Assembly Printer' on function '@a'

Do @rnk or @aganea have any clue about this?

I ran into a regression with this patch, which triggers the error "fatal error: error in backend: unknown codeview register ST0" on code that built fine before that.

Here's a reduced reproducer:

$ cat dither.c 
a() {
  long double b;
  asm volatile("" : "=t"(b));
}
$ clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c 
fatal error: error in backend: unknown codeview register ST0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: bin/clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'dither.c'.
4.	Running pass 'X86 Assembly Printer' on function '@a'

Do @rnk or @aganea have any clue about this?

Some mappings seem to be missing, but I don't know why the Location->Register didn't return the FPU stack ST0-ST7 before? Here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L1348

diff --git a/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp b/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
index 9da0a8129f23..c83c211b0c58 100644
--- a/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
+++ b/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
@@ -111,6 +111,15 @@ void X86_MC::initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI) {

       {codeview::RegisterId::EFLAGS, X86::EFLAGS},

+      {codeview::RegisterId::ST0, X86::ST0},
+      {codeview::RegisterId::ST1, X86::ST1},
+      {codeview::RegisterId::ST2, X86::ST2},
+      {codeview::RegisterId::ST3, X86::ST3},
+      {codeview::RegisterId::ST4, X86::ST4},
+      {codeview::RegisterId::ST5, X86::ST5},
+      {codeview::RegisterId::ST6, X86::ST6},
+      {codeview::RegisterId::ST7, X86::ST7},
+
       {codeview::RegisterId::ST0, X86::FP0},
       {codeview::RegisterId::ST1, X86::FP1},
       {codeview::RegisterId::ST2, X86::FP2},

Some mappings seem to be missing, but I don't know why the Location->Register didn't return the FPU stack ST0-ST7 before?

Ah, I might have an explanation -- the x87 FP stackifier pass complete ignores debug instructions, and so DBG_VALUEs of $fp0 pass through to the DWARF/CodeView emitter untouched. Wheras the "new" variable location tracking extracts physreg locations from the instructions that define values, so it'll produce references to ST0-7 taken from "real" instructions, instead of unstackified FP0-7.

I'm no expert on x87, as I understand it the registers are referred to in a stack but the old variable location tracking paid no attention to the stack changing depth, and that probably lead to lots of wrong variable locations. I haven't tried to fix this in "instruction referencing" mode, but it's technically do-able. I was rather hoping that everyone just used the SSE registers on x86_64, but I guess that doesn't happen when people us "long double".

alexfh added a subscriber: alexfh.Jan 21 2022, 6:03 PM

Hi Jeremy,
after this patch we see significantly increased compile-time memory usage, which makes a large number of targets to fail to compile with the (already quite large) memory limit we have on our build cluster. I'll try to get specific numbers and maybe an example, but it seems like this patch has raised enough concerns to maybe flip the default back while all of these are being addressed?

I ran into a regression with this patch, which triggers the error "fatal error: error in backend: unknown codeview register ST0" on code that built fine before that.

Here's a reduced reproducer:

$ cat dither.c 
a() {
  long double b;
  asm volatile("" : "=t"(b));
}
$ clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c 
fatal error: error in backend: unknown codeview register ST0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: bin/clang -target x86_64-w64-mingw32 -c -O2 -g -gcodeview dither.c
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'dither.c'.
4.	Running pass 'X86 Assembly Printer' on function '@a'

Do @rnk or @aganea have any clue about this?

Some mappings seem to be missing, but I don't know why the Location->Register didn't return the FPU stack ST0-ST7 before? Here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L1348

diff --git a/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp b/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
index 9da0a8129f23..c83c211b0c58 100644
--- a/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
+++ b/llvm/lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
@@ -111,6 +111,15 @@ void X86_MC::initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI) {

       {codeview::RegisterId::EFLAGS, X86::EFLAGS},

+      {codeview::RegisterId::ST0, X86::ST0},
+      {codeview::RegisterId::ST1, X86::ST1},
+      {codeview::RegisterId::ST2, X86::ST2},
+      {codeview::RegisterId::ST3, X86::ST3},
+      {codeview::RegisterId::ST4, X86::ST4},
+      {codeview::RegisterId::ST5, X86::ST5},
+      {codeview::RegisterId::ST6, X86::ST6},
+      {codeview::RegisterId::ST7, X86::ST7},
+
       {codeview::RegisterId::ST0, X86::FP0},
       {codeview::RegisterId::ST1, X86::FP1},
       {codeview::RegisterId::ST2, X86::FP2},

Thanks @aganea! Would you mind posting such a patch for review, or should I? I’m not sure how to create a simple enough and robust testcase for it, but maybe it’s not strictly necessary when it’s just adding more enum entries in a table?

Hi Jeremy,
after this patch we see significantly increased compile-time memory usage, which makes a large number of targets to fail to compile with the (already quite large) memory limit we have on our build cluster. I'll try to get specific numbers and maybe an example, but it seems like this patch has raised enough concerns to maybe flip the default back while all of these are being addressed?

He explained that there is nothing much what is possible to do - you get better debug info but also CT and MU is a bit worse. But if I want debug info (-g), I want the best possible one, so …

Why to pull every useful work out of trunk just because somebody has some issue? There is a flag, consider using it.

Hi Jeremy,
after this patch we see significantly increased compile-time memory usage, which makes a large number of targets to fail to compile with the (already quite large) memory limit we have on our build cluster. I'll try to get specific numbers and maybe an example, but it seems like this patch has raised enough concerns to maybe flip the default back while all of these are being addressed?

He explained that there is nothing much what is possible to do - you get better debug info but also CT and MU is a bit worse. But if I want debug info (-g), I want the best possible one, so …

Why to pull every useful work out of trunk just because somebody has some issue? There is a flag, consider using it.

For what it's worth, this comes off a bit snarky/hyperbolic.

There's certainly tradeoffs to be made - "no regression ever" isn't a useful stance (there's some amount of compile time, memory usage, etc, that folks would likely be willing to pay for some benefit), but equally there is some level of regression that's "too much".

I don't think anyone's intending to suggest this work is dead/must be thrown out, but if the regression is more than expected (that's not clear to me - I think this patch should've included more numbers about expected regressions, and those suggesting that it might be suitable to revert should be documenting the level of regressions they're seeing - hopefully demonstrating more significant regressions than what's expected) it's probably reasonable to revert, analyze, and see if there are some extreme cases that weren't encountered/understood before the patch was committed, etc.

@alexfh - do you have some numbers to help clarify how significant these regressions are?
@jmorse - do you have some numbers to help clarify what sort of regressions you were anticipating this patch causing?

it's probably reasonable to revert, analyze, and see if there are some extreme cases

Revert… So (silent) users, which are now or would be (soon new release) happy with better debug info, would just lose this feature?

Otherwise I dont know why many new bigger things in llvm go under own flag if not for situations like this one.

it's probably reasonable to revert, analyze, and see if there are some extreme cases

Revert… So (silent) users, which are now or would be (soon new release) happy with better debug info, would just lose this feature?

Similarly, they could use the flag to continue to opt in to the behavior.

Otherwise I dont know why many new bigger things in llvm go under own flag if not for situations like this one.

Generally if it ends up being a feature where a small subset of users have an esoteric need compared to the norm, they could continue to use the flag to opt out.

But if the tradeoff is more broadly unfavorable, then that probably indicates the default should reflect that.

It's unclear where we are in this balance without numbers in either direction.

But also if the observed regressions are larger than the documented/intended, that's probably reflective of some considerations that hadn't been accounted before before committing, and so worth reverting to known good and investigating those considerations to make sure they're understood before moving forward.

Similarly, they could use the flag to continue to opt in to the behavior.

Ah yeah, if you meant "revert = set this as false by default" than it is a reasonable thing to do if regressions are too large for common codebases.

Similarly, they could use the flag to continue to opt in to the behavior.

Ah yeah, if you meant "revert = set this as false by default" than it is a reasonable thing to do if regressions are too large for common codebases.

Yeah, that's what I mean - "revert the observable effect of this change". Mechanically I'm open to ideas and I don't think we need to churn this whole patch. Probably would've been good to make the behavior changing patch smaller/simpler so it'd be easy to revert/toggle off, but 'tis what it is. So, yeah, might be suitable to commit a change to the comparisons in debuginfoShouldUseDebugInstrRef to disable this functionality for now for further investigation.

I agree with @dblaikie that a bit more time for @alexfh's numbers would be good -- I'd be especially interested in the code characteristics (i.e.: is this a large C++ codebase under LTO, with things like exceptions?). If there are any runs where the build completes, a comparison of how much debug-info is in the before/after files would be great, if that's possible.

@jmorse - do you have some numbers to help clarify what sort of regressions you were anticipating this patch causing?

For compile time on large C++ codebases, something in the order of 1% regression as the clang frontend took up the vast majority of compile time; on smaller benchmarks, pretty much in line with what CTMark on the compile-time tracker was reporting. I think with the patches from @nikic (cheers!), D117877 and something else I've got cooking then the regression on CTMark should be 2-3%. Caroline kindly ran some tests too [0], unfortunately one of the algorithms changed a lot between now and then :/,

For memory consumption, I only observed negligible changes in max-rss when testing C++ codebases, CTMarks numbers for max-rss are roughly 1%. My general expectation is that the instr-ref LiveDebugValues implementation shouldn't be responsible for peak memory usage as it only fully analyses one variable at a time, instead any greater memory usage Should (TM) come from storing the location information for all variables before DWARF emissions happens. If that holds, any increase in max-rss should be somewhat proportionate to a change in the output file size / output debug-info. I can see a few outcomes here:

  • Optimistic: the increase in memory usage is in proportion with more debug-info being produced. Discussion should then move to whether we should generate all that data,
  • Pessimistic: there's little change in how much debug-info is produced, but lots of extra memory used. That's almost certainly a bug or design problem, to be reverted and investigated,
  • Complete misery: I'm aware of a few pathological scenarios that LLVM can't cope with, usually extremely large functions with lots of inlining, where describing all variables locations at the same time simply won't fit in memory. In the past that's been "fixed" by truncating the debug-info for such functions [1]. If that kind of scenario is occurring it's revert-and-write-more-code.

[0] https://lists.llvm.org/pipermail/llvm-dev/2021-September/152457.html
[1] https://reviews.llvm.org/D80662

nikic added a comment.Jan 24 2022, 4:15 AM

For memory consumption, I only observed negligible changes in max-rss when testing C++ codebases, CTMarks numbers for max-rss are roughly 1%.

Just a quick note on this: The max-rss numbers reported by the compile-time tracker are quite misleading in this case, because they add up max-rss from all compilations and the final link, while in this case only link memory usage increases. Here are the numbers from just the ReleaseLTO-g link steps:

sqlite3.link 	191MiB 	195MiB (+2.26%)
consumer-typeset.link 	155MiB 	155MiB (+0.07%)
bullet.link 	208MiB 	207MiB (-0.04%)
tramp3d-v4.link 	613MiB 	620MiB (+1.02%)
pairlocalalign.link 	99MiB 	126MiB (+26.85%)
clamscan.link 	213MiB 	216MiB (+1.23%)
lencod.link 	257MiB 	275MiB (+6.90%)
SPASS.link 	254MiB 	343MiB (+35.24%)
7zip-benchmark.link 	439MiB 	442MiB (+0.67%)

The 25% and 35% increases on pairlocalalign and SPASS stand out as outliers here and might be good (relatively small) test cases to double check.

Hi,

Just a quick note on this: The max-rss numbers reported by the compile-time tracker are quite misleading in this case, because they add up max-rss from all compilations and the final link, while in this case only link memory usage increases. Here are the numbers from just the ReleaseLTO-g link steps:

Ah, right!, that explains a few things, thanks for the tip,

SPASS.link 	254MiB 	343MiB (+35.24%)

Now that's unexpected, I'll dig in there. This certainly moves the direction of travel towards disabling/reverting.

Pushed up 74db5c8c95 that effectively reverts this patch, through disabling the on-by-default-for-x86 portion of code.

I've dug through the CTMark SPASS sample and found some now-obvious problems. I reckon they can be addressed, but not immediately, so it's best to revert for the moment.

This raises some uncertainty regarding the LLVM14 branch date; I'll post back here closer to the time.

Pushed up 74db5c8c95 that effectively reverts this patch, through disabling the on-by-default-for-x86 portion of code.

I've dug through the CTMark SPASS sample and found some now-obvious problems. I reckon they can be addressed, but not immediately, so it's best to revert for the moment.

This raises some uncertainty regarding the LLVM14 branch date; I'll post back here closer to the time.

You could default this option to just false, instead of ifdefing it out, so anybody could enable it explicitly.

You could default this option to just false, instead of ifdefing it out, so anybody could enable it explicitly.

Possibly I'm misunderstanding but: below the #if 0'd block, there's a test of the cl::opt that allows that:

return ValueTrackingVariableLocations == cl::boolOrDefault::BOU_TRUE;

So that anyone can specify -mllvm -experimental-debug-variable-locations=true to get the instr-ref behaviour. The command-line-flag defaults to "unset", causing llvm::debuginfoShouldUseDebugInstrRef to return false on "main" right now.

It's mostly that we try very hard to not have #if 0 blocks in code please :)

The XFAIL * means that it also won't show up when you try to turn it on or
if something breaks. An option will let you explicitly keep that code
tested.

Thanks!

-eric

(Block stripped out in d27f022614)

Thanks @aganea! Would you mind posting such a patch for review, or should I? I’m not sure how to create a simple enough and robust testcase for it, but maybe it’s not strictly necessary when it’s just adding more enum entries in a table?

Feel free to go ahead with a patch!

I found one particularly bad input. Here what time -v clang -fno-omit-frame-pointer -fcommon -no-canonical-prefixes -fdiagnostics-show-option -fmessage-length=0 -fno-exceptions -fbracket-depth=768 -fno-strict-aliasing -fmerge-all-constants -flax-vector-conversions=all -funsigned-char -Xclang=-fuse-ctor-homing -fcolor-diagnostics --cuda-host-only -faligned-allocation -fnew-alignment=8 -fshow-overloads=best -mllvm -disable-binop-extract-shuffle -fdebug-types-section -O2 -momit-leaf-frame-pointer -ffp-contract=off --target=x86_64-unknown-linux-gnu -fno-unique-section-names -maes -m64 -mcx16 -msse4.2 -mpclmul -mprefer-vector-width=128 -gmlt -gz=zlib -fdebug-default-version=5 -fsanitize=address,bool,bounds,builtin,float-cast-overflow,integer-divide-by-zero,null,return,returns-nonnull-attribute,shift-exponent,signed-integer-overflow,unreachable,vla-bound -fsanitize-address-globals-dead-stripping -fsanitize-trap=null -fno-sanitize-recover=all -fno-omit-frame-pointer -fsized-deallocation -fPIE -fdirect-access-external-data -ffunction-sections -fdata-sections -pthread -fno-sanitize-trap=all -std=gnu++17 -stdlib=libc++ -Wno-deprecated-declarations -Wno-inconsistent-missing-override -save-stats -c q.cc -o q.o says 1. at rG74db5c8c95e2aed40d288bb2df92eb859b87c827 (with the new behavior disabled) and 2. at rG80532ebb508d0ca62f96df5f253db8caed969397 (right before rG74db5c8c95e2aed40d288bb2df92eb859b87c827):

1
User time (seconds): 190.92
System time (seconds): 2.16
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:13.10
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4926000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 27080
Voluntary context switches: 8
Involuntary context switches: 579
Swaps: 0
File system inputs: 0
File system outputs: 54168
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
2
User time (seconds): 255.60
System time (seconds): 8.09
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:23.71
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 27272308
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 32213
Voluntary context switches: 8
Involuntary context switches: 779
Swaps: 0
File system inputs: 0
File system outputs: 54168
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

The most problematic are User time (seconds): 190.92 -> 255.60 (+34%) and Maximum resident set size (kbytes): 4926000 -> 27272308 (5.5x).

As for the nature of the code, it's a large generated file (preprocessed size ~22MB) with a few kinds of boilerplate (e.g. a method with a ton of v->push_back({some, constant}); calls and other stuff). I can try to find out which of them is responsible for most of the increase.

alexfh added a comment.EditedJan 27 2022, 6:16 PM

I reduced our problematic file to a function with a couple thousand variables of a type containing a few std::function<>s. This can be demonstrated by this synthetic test case (compiled on x86-64 linux with libc++):

#include <functional>
#include <vector>
namespace n {

using F = std::function<void()>;

struct S {
  F f1;
  F f2;
  F f3;
};

std::vector<S> f() {
  std::vector<S> v;
  #define X(s) S s; v.push_back(s)
  #define XX(s) \
    X(s ## 0); \
    X(s ## 1); \
    X(s ## 2); \
    X(s ## 3); \
    X(s ## 4); \
    X(s ## 5); \
    X(s ## 6); \
    X(s ## 7); \
    X(s ## 8); \
    X(s ## 9)
  #define XXX(s) \
    XX(s ## 0); \
    XX(s ## 1); \
    XX(s ## 2); \
    XX(s ## 3); \
    XX(s ## 4); \
    XX(s ## 5); \
    XX(s ## 6); \
    XX(s ## 7); \
    XX(s ## 8); \
    XX(s ## 9)
  #define XXXX(s) \
    XXX(s ## 0); \
    XXX(s ## 1); \
    XXX(s ## 2); \
    XXX(s ## 3); \
    XXX(s ## 4); \
    XXX(s ## 5); \
    XXX(s ## 6); \
    XXX(s ## 7); \
    XXX(s ## 8); \
    XXX(s ## 9)

  XXXX(a);
  return v;
}
}
$ time -v ./clang-74db5c8c95e2aed40d288bb2df92eb859b87c827 -O1 --target=x86_64-generic-linux-gnu -g -fsanitize=address -fsized-deallocation -std=c++17 -c q.cc -o q.o
        Command being timed: "./clang-74db5c8c95e2aed40d288bb2df92eb859b87c827 -O1 --target=x86_64-generic-linux-gnu -g -fsanitize=address -fsized-deallocation -std=c++17 -c q.cc -o q.o"
        User time (seconds): 20.13
        System time (seconds): 0.31
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1029064
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8101
        Voluntary context switches: 8
        Involuntary context switches: 42
        Swaps: 0
        File system inputs: 0
        File system outputs: 5104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
$ time -v ./clang-80532ebb508d0ca62f96df5f253db8caed969397 -O1 --target=x86_64-generic-linux-gnu -g -fsanitize=address -fsized-deallocation -std=c++17 -c q.cc -o q.o
        Command being timed: "./clang-80532ebb508d0ca62f96df5f253db8caed969397 -O1 --target=x86_64-generic-linux-gnu -g -fsanitize=address -fsized-deallocation -std=c++17 -c q.cc -o q.o"
        User time (seconds): 66.67
        System time (seconds): 2.06
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.75
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 7632172
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 12537
        Voluntary context switches: 8
        Involuntary context switches: 223
        Swaps: 0
        File system inputs: 0
        File system outputs: 4872
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Note the 3.3x run time and ~7.5x maximum rss with this patch. But what's surprising to me is the high run time and memory consumption of clang even without this patch =\

Hi @alexfh , first many thanks for coming back with details and a reproducer!,

The most problematic are User time (seconds): 190.92 -> 255.60 (+34%) and Maximum resident set size (kbytes): 4926000 -> 27272308 (5.5x).

Oooooff :/,

As for the nature of the code, it's a large generated file (preprocessed size ~22MB) with a few kinds of boilerplate (e.g. a method with a ton of v->push_back({some, constant}); calls and other stuff). I can try to find out which of them is responsible for most of the increase.

As ever, generated code can easily stimulates limitations, and in this case it's really exciting: the culprit is ASAN having thousands of live pointers to wherever it is asan stores things. They all get spilt to the stack: and because my working assumption when writing this code was that LLVM would usually seek to minimise the working set size for obvious reasons, InstrRefBasedLDV tries to track all the spilt values in every block, which in this case is thousands of slots. That leads to the ballooning memory consumption, which won't be fixed by the other improvements I've made. The cherry on the cake is that, because asan pointers don't have any variables associated with them, there's zero improvement in debug experience from instr-ref on this sample.

VarLocBasedLDV does just fine because it only tracks locations after they're assigned to a variable; instr-ref tracks everything to increase the number of locations that are known to contain a value, hence this explosion.

An easy way out would be to set a hard limit on the number of spill slots being tracked, on the (to me) fairly reasonable principle of "If your working set is over (say) 1000 Values, you probably autogenerated this code and probably won't be debugging it by hand", much like existing cut-offs in LiveDebugValues. I instinctively don't want to add arbitrary limits though.

I'll give it some more thought over the weekend. The assumption about the working set being small is pretty heavily baked into what I've been writing though.

An easy way out would be to set a hard limit on the number of spill slots being tracked, on the (to me) fairly reasonable principle of "If your working set is over (say) 1000 Values, you probably autogenerated this code and probably won't be debugging it by hand", much like existing cut-offs in LiveDebugValues. I instinctively don't want to add arbitrary limits though.

(... or maybe for large inputs reduce the (currently highly detailed) quality of information about stack slots, down to the level VarLocBasedLDV works at. You probably don't care about individual subregister-fields within slots if you have so many Values. Quickly hacking shows this reduces memory blow-up to 10%, time to 15%).

rnk added a comment.Jan 28 2022, 10:11 AM

Regarding a value cutoff, I think that makes sense. To date, LLVM has never intentionally thrown away correct debug info, because historically we've had so preciously little of it. However, as we get better at tracking debug info, I think we're going to need more heuristics like this to cutoff excessive amounts of debug information.

Anecdotally, debug info for mostly trivial inlined call frames (unique_ptr::get, operator*, etc) takes up a huge chunk of PDBs for Chrome. I think it might be time to start developing heuristics, modes, config files, or flags that discard trivial information like this to save file size and compile time.

Sitrep on this:

  • I've added a cut-off in D118601 that limits max-rss growth in the sample from @alexfh to "only" about 12%, which given its autogenerated nature, seems reasonable to me. I've got some scripts working out what the cut-off value would be for some LTO builds of large codebases, I think the best plan is "pick a reasonable number and double it".
  • Other patches in the related (largely approvedish) stack address most of the growth in max-rss for CTMark, although it still goes up by 10% for SPASS. That's in a function with 220k instructions, 50k of which are DBG_VALUEs, 90k DBG_VALUEs with instr-ref.
  • On the performance front, the regression in compile-time is now down to less than 2% geomean [0], there are still one or two tricks I can pull.

However I'm out of the (figurative) office for most of tomorrow, so not all of this is going to hit by branch time. I'll wave the situation in Tom Stellards direction and see what happens.

[0] http://llvm-compile-time-tracker.com/compare.php?from=30efee764d95d425b480758ed02682fab6426d28&to=2de5ce8245a00d52e397057d5976d89f3002220d&stat=instructions

@alexfh - while it's great to have highlighted this issue upstream (& I think it's been valuable to have motivated the improvements so far) - if you want to CC me on the internal bug or whatever, maybe we can see if __attribute__((nodebug)) or the like might be suitable on some of this large autogenerated code.

Regarding a value cutoff, I think that makes sense. To date, LLVM has never intentionally thrown away correct debug info, because historically we've had so preciously little of it. However, as we get better at tracking debug info, I think we're going to need more heuristics like this to cutoff excessive amounts of debug information.

Anecdotally, debug info for mostly trivial inlined call frames (unique_ptr::get, operator*, etc) takes up a huge chunk of PDBs for Chrome. I think it might be time to start developing heuristics, modes, config files, or flags that discard trivial information like this to save file size and compile time.

FTR, Sony downstream has implemented an option that will automatically treat certain methods and functions as nodebug. The criteria, IIRC, are: method with in-class definition; explicitly marked inline; or attribute always_inline. It gets rid of a lot of the fluff. If there's interest I can put it on my to-do list to post a patch.

Regarding a value cutoff, I think that makes sense. To date, LLVM has never intentionally thrown away correct debug info, because historically we've had so preciously little of it. However, as we get better at tracking debug info, I think we're going to need more heuristics like this to cutoff excessive amounts of debug information.

Anecdotally, debug info for mostly trivial inlined call frames (unique_ptr::get, operator*, etc) takes up a huge chunk of PDBs for Chrome. I think it might be time to start developing heuristics, modes, config files, or flags that discard trivial information like this to save file size and compile time.

FTR, Sony downstream has implemented an option that will automatically treat certain methods and functions as nodebug. The criteria, IIRC, are: method with in-class definition; explicitly marked inline; or attribute always_inline. It gets rid of a lot of the fluff. If there's interest I can put it on my to-do list to post a patch.

Hmm - I'd at least be curious to know more about the criteria used. Those seem a bit more aggressive than I'd have figured - non-trivial inline functions in headers seem like they could be quite confusing to treat as nodebug.

Hmm - I'd at least be curious to know more about the criteria used. Those seem a bit more aggressive than I'd have figured - non-trivial inline functions in headers seem like they could be quite confusing to treat as nodebug.

Well, those are the criteria we landed on. And of course it's controlled by a command-line option, so you can get everything if you want.

The argument (made by licensees; we didn't make this up out of thin air) is that in-class defined methods, and those explicitly marked as inline or always_inline in the source, do tend to be small and not the source of real bugs. Also, making it obvious from the source whether a given function would be affected was a positive point. Yes, a team could spend a lot of effort adding nodebug everywhere, but having the quick and easy ability to _get_ debug info for everything (by fiddling a command-line option, no tedious source modifications required) was also a win.

You can invent a counter example easily, but in our experience the counter-examples aren't the norm, at least for the licensees who asked for this.

Hmm - I'd at least be curious to know more about the criteria used. Those seem a bit more aggressive than I'd have figured - non-trivial inline functions in headers seem like they could be quite confusing to treat as nodebug.

Well, those are the criteria we landed on. And of course it's controlled by a command-line option, so you can get everything if you want.

The argument (made by licensees; we didn't make this up out of thin air) is that in-class defined methods, and those explicitly marked as inline or always_inline in the source, do tend to be small and not the source of real bugs. Also, making it obvious from the source whether a given function would be affected was a positive point. Yes, a team could spend a lot of effort adding nodebug everywhere, but having the quick and easy ability to _get_ debug info for everything (by fiddling a command-line option, no tedious source modifications required) was also a win.

You can invent a counter example easily, but in our experience the counter-examples aren't the norm, at least for the licensees who asked for this.

cool cool - good to know!

rnk added a comment.Feb 1 2022, 2:19 PM

The argument (made by licensees; we didn't make this up out of thin air) is that in-class defined methods, and those explicitly marked as inline or always_inline in the source, do tend to be small and not the source of real bugs. Also, making it obvious from the source whether a given function would be affected was a positive point. Yes, a team could spend a lot of effort adding nodebug everywhere, but having the quick and easy ability to _get_ debug info for everything (by fiddling a command-line option, no tedious source modifications required) was also a win.

I've heard about this behavior in the Sony compiler, but for some reason I never thought about it as a flag. A flag to get this behavior seems like something we could reasonably accept upstream, for all the reasons you describe. It's a reasonable heuristic for some codebases. However, in my experience, there are plenty of complex methods defined inline in class templates, and the existing behavior of retaining all the inlined call site info seems like a better default.

You may recall this came up in the -gno-inline-line-tables patch https://reviews.llvm.org/D67723, which was really driven by debug info size concerns rather than debug experience concerns.

Sitrep: I've re-enabled instr-ref for x86_64 in rG6e03a68b776d. I believe all the issues reported here so far are fixed / optimised further, so I've filed a cherry-pick request in https://github.com/llvm/llvm-project/issues/53548 , which I believe is the correct procedure.

Any more difficulties -> we'll take them as they come.

We noticed another crash with this change that still repros on ToT: https://github.com/llvm/llvm-project/issues/54190

@jmorse can you please check?

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 6:20 PM