This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] [WinException] Only produce handler data at the end of the function if needed
ClosedPublic

Authored by mstorsjo on Sep 10 2020, 5:03 AM.

Details

Summary

If we are going to write handler data (that is written as variable length data following after the unwind info in .xdata), we need to emit the handler data immediately, but for cases where no such info is going to be written, skip emitting it right away. (Unwind info for all remaining functions that hasn't gotten it emitted directly is emitted at the end.)

This also matches GCC's assembly output, which doesn't output .seh_handlerdata unless it actually is needed.

For ARM64, the unwind info can be packed into the runtime function entry itself (leaving no data in the .xdata section at all), but that can only be done if there's no follow-on data in the .xdata section. If emission of the unwind info is triggered via EmitWinEHHandlerData (or the .seh_handlerdata directive), which implicitly switches to the .xdata section, there's a chance of the caller wanting to pass further data there, so the packed format can't be used in that case.

This turned out to be the reason why D87371 had no effect at all, as it forced .seh_handlerdata at the end of all functions. With this patch in place, D87371 has got a small effect (but very few functions in practice end up matching).

Diff Detail

Event Timeline

mstorsjo created this revision.Sep 10 2020, 5:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2020, 5:03 AM
rnk added a comment.Sep 10 2020, 2:19 PM

Makes sense.

Unwind info for all remaining functions that hasn't gotten it emitted directly is emitted at the end.

Can we make .seh_endproc responsible for emitting the unwind info into .xdata if necessary? Would that preserve the section ordering? There is something nice about having all the .text/.pdata/.xdata sections grouped together in the object file, even though the order doesn't matter.

llvm/test/CodeGen/X86/mingw-comdats.ll
81 ↗(On Diff #290945)

Shouldn't this remain? This change seems unrelated.

In D87448#2266746, @rnk wrote:

Makes sense.

Unwind info for all remaining functions that hasn't gotten it emitted directly is emitted at the end.

Can we make .seh_endproc responsible for emitting the unwind info into .xdata if necessary?

I guess it could be done, but... Right now, EmitWinCFIEndProc is only implemented in the MCStreamer base class, while the class responsible for actually writing the .xdata contents, Win64EH::UnwindEmitter or Win64EH::ARM64UnwindEmitter) is in the subclasses X86WinCOFFStreamer and AArch64WinCOFFStreamer, so it'd require overriding EmitWinCFIEndProc in these subclasses and calling the respective UnwindEmitter methods for writing the xdata right there. Not hard, but a bit extra boilerplate needed in any target specific subclass doing this, if they're to be consistent wrt this.

Would that preserve the section ordering? There is something nice about having all the .text/.pdata/.xdata sections grouped together in the object file, even though the order doesn't matter.

I guess it would - but even right now, they're not grouped together right now. For an object file with two comdat functions, func1 and func2, GCC/binutils do generate them in the order .text$func1, .xdata$func1, .pdata$func1, .text$func2, .xdata$func2, .pdata$func2 - i.e .text/.xdata/.pdata ordered together.

Currently, LLVM generates .text$func1, .xdata$func1, .text$func2, .xdata$func2, .pdata$func1, .pdata$func2 - i.e. .text/.xdata pairwise together, but all .pdata emitted at the end. With this patch in place, it generates .text$func1, .text$func2, .xdata$func1, .xdata$func2, .pdata$func1, .pdata$func2, which admittedly is the least nice ordering.

In that sense, the current ordering isn't much to strive for either, so if we want to make an effort towards keeping that, we probably also should emit the .pdata right next after the corresponding .xdata. I don't think that's much extra work either, but still a little bigger refactoring than just "stop emitting unnecessary .seh_handlerdata and keep .xdata where it was".

llvm/test/CodeGen/X86/mingw-comdats.ll
81 ↗(On Diff #290945)

It was a mistake on my side; it does remain, but the sections have changed order, and I thought it wasn't important to the testcase (like unrelated lines in the middle of a CHECK-NEXT sequence), but I see it's part of what it actually tries to check. I've locally updated the patch to move it up between text and xdata.

I think it's called this out in the comments on the aarch64 implementation, but ultimately, I think we want specialized MCFragments for pdata/xdata. The current AArch64 implementation currently has a couple FIXMEs which can't be fixed in the current architecture (in particular, it doesn't interact with relaxation correctly).

@rnk - Can you follow up on the discussion above?

rnk accepted this revision.Sep 21 2020, 12:00 PM

@rnk - Can you follow up on the discussion above?

I agree, it sounds like the current order isn't worth preserving then. Let's do this.

This revision is now accepted and ready to land.Sep 21 2020, 12:00 PM
hans added a subscriber: hans.Nov 3 2020, 4:12 AM

Sorry for the late follow-up, but this caused big problems in Chromium.

This caused an explosion in ICF times during linking on Windows when libfuzzer
instrumentation is enabled. For a small binary we see ICF time go from ~0 to
~10 s. For a large binary it goes from ~1 s to forevert (I gave up after 30
minutes).

See https://bugs.chromium.org/p/chromium/issues/detail?id=1144476

I don't know if the real issue is with this commit, lld's icf code, or libfuzzer's instrumentation, but until that's figured out, I've reverted this patch in cbf25fbed5b46ec47e3ce2799ed9095f2f18ea8f to bring things back to a working state.

Sorry for the late follow-up, but this caused big problems in Chromium.

This caused an explosion in ICF times during linking on Windows when libfuzzer
instrumentation is enabled. For a small binary we see ICF time go from ~0 to
~10 s. For a large binary it goes from ~1 s to forevert (I gave up after 30
minutes).

See https://bugs.chromium.org/p/chromium/issues/detail?id=1144476

I don't know if the real issue is with this commit, lld's icf code, or libfuzzer's instrumentation, but until that's figured out, I've reverted this patch in cbf25fbed5b46ec47e3ce2799ed9095f2f18ea8f to bring things back to a working state.

Hmm, tricky. Functionally, afaik this commit still is ok (I haven't heard about a functional issue with the generated code), but it does indeed change the order of output sections, and it's plausible that the new order is suboptimal for some of those tools.

I'd appreciate if you or @rnk can dig into what makes it take so long (presumably in lld), and I could try to make an effort to redo this commit while preserving the section ordering. (Ideally we might even want to get text/xdata/pdata grouped as binutils does, see a comment further up, if that order works equally well.)

mstorsjo updated this revision to Diff 305467.Nov 16 2020, 4:11 AM
mstorsjo edited the summary of this revision. (Show Details)
mstorsjo added a reviewer: hans.

Updated for a retry, trying to retain the exact section ordering as before, to hopefully avoid the LLD ICF runtime explosion reported before.

This reapplies 36c64af9d7f97414d48681b74352c9684077259b in updated form, applying @rnk's earlier suggestion, emitting .xdata at .seh_endproc.

This keeps the exact same output header order for most code generated by the LLVM CodeGen layer. (Sections still change order for code built from assembly where functions lack an explicit .seh_handlerdata directive, and functions with chained unwind info.)

The practical effect should be that assembly output lacks superfluous ".seh_handlerdata; .text" pairs at the end of functions that don't handle exceptions, which allows such functions to use the AArch64 packed unwind format again.

@hans - Can you check whether this works fine with libfuzzer and LLD's ICF?

mstorsjo reopened this revision.Nov 16 2020, 4:17 AM
This revision is now accepted and ready to land.Nov 16 2020, 4:17 AM
hans added a comment.Nov 16 2020, 7:23 AM

@hans - Can you check whether this works fine with libfuzzer and LLD's ICF?

Yes, this appears to fix the problem we saw. (I tried on Chromium's base_unittests.exe, where ICF time dropped from 13 s with the bad patch to 73 ms with this patch.) Thanks!

@hans - Can you check whether this works fine with libfuzzer and LLD's ICF?

Yes, this appears to fix the problem we saw. (I tried on Chromium's base_unittests.exe, where ICF time dropped from 13 s with the bad patch to 73 ms with this patch.) Thanks!

Awesome, thanks! Do you want to have a look at the patch itself and review it, or does @rnk have time? (It's his original suggestion, to emit xdata in .seh_endproc, but the implementation is entirely trivial.)

(I also have a couple more patches that can go on top of this, that simplifies things a bit further, and emits pdata together with xdata at the end of the function.)

hans added a comment.Nov 17 2020, 1:23 AM

@hans - Can you check whether this works fine with libfuzzer and LLD's ICF?

Yes, this appears to fix the problem we saw. (I tried on Chromium's base_unittests.exe, where ICF time dropped from 13 s with the bad patch to 73 ms with this patch.) Thanks!

Awesome, thanks! Do you want to have a look at the patch itself and review it, or does @rnk have time? (It's his original suggestion, to emit xdata in .seh_endproc, but the implementation is entirely trivial.)

(I also have a couple more patches that can go on top of this, that simplifies things a bit further, and emits pdata together with xdata at the end of the function.)

@rnk knows this stuff much better than I do. Let's see if he has time, otherwise I can try to ramp up on it.

rnk added a comment.Nov 17 2020, 9:51 AM

I would like to understand why LLD ICF time is sensitive to section ordering. I took the time to get set up to measure the difference in performance, but didn't find the time to investigate. I've had several personal and non-technical work things come up over the last week, so I probably don't have time to take this on.

In D87448#2400337, @rnk wrote:

I would like to understand why LLD ICF time is sensitive to section ordering. I took the time to get set up to measure the difference in performance, but didn't find the time to investigate. I've had several personal and non-technical work things come up over the last week, so I probably don't have time to take this on.

I posted my analysis of that so far in the chromium bug thread.

Awesome, thanks! Do you want to have a look at the patch itself and review it, or does @rnk have time? (It's his original suggestion, to emit xdata in .seh_endproc, but the implementation is entirely trivial.)

@rnk knows this stuff much better than I do. Let's see if he has time, otherwise I can try to ramp up on it.

Looks like he hasn't, so @hans, can you give it a read through so I can reland this change?

hans accepted this revision.Nov 23 2020, 6:35 AM

lgtm

(I added a naming comment, but feel free to resolve that as you feel is best.)

Relatedly, if one wants to learn about Windows EH, are there any good resources besides studying the code?

llvm/lib/MC/MCStreamer.cpp
695

When I first saw this variable name I thought it was some kind of offset. But now I see it's really an index into WinFrameInfos.

Maybe CurrentProcWinFrameInfoIndex would be a better name? And should it be initialized to -1 or some other invalid value rather than 0, since I assume it should always be set here before it's used?

lgtm

(I added a naming comment, but feel free to resolve that as you feel is best.)

Thanks!

Relatedly, if one wants to learn about Windows EH, are there any good resources besides studying the code?

https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64 and https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling have pretty good descriptions of the unwind info formats, https://llvm.org/docs/ExceptionHandling.html describes some things from the IR generator perspective, but I don't know of any good docs of all the other practical details of everything inbetween. (I don't know the things inbetween myself really either, I just try to compare with a known working case, when I need to poke those bits.) For reading/following code, the wine source is sometimes a good reference for things that otherwise isn't documented/visible anywhere - it might not always be exactly correct, but usually is fairly close.

llvm/lib/MC/MCStreamer.cpp
695

Initializing to -1 should be fine, this is pretty tightly checked to be called in pairs only.

I can change the name, but it isn't necessarily the index of one single frame, but the start index to a range of frames, from startindex to the end of the vector.

hans added a comment.Nov 23 2020, 7:40 AM

Thanks for the pointers!

llvm/lib/MC/MCStreamer.cpp
695

Maybe StartIndex, then? Or just leave as it is I suppose, it becomes clear when looking at how it's used.

mstorsjo added inline comments.Nov 23 2020, 7:42 AM
llvm/lib/MC/MCStreamer.cpp
695

Yeah, StartIndex sounds good to me, will change it to that form before pushing. Thanks!