This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: Fix PR40605 by splitting constant struct initializers
ClosedPublic

Authored by glider on Feb 7 2019, 8:13 AM.

Details

Reviewers
jfb
Summary

When emitting initializers for local structures for code built with
-ftrivial-auto-var-init, replace constant structures with sequences of
stores.

This appears to greatly help removing dead initialization stores to those
locals that are later overwritten by other data.
This also removes a lot of .rodata constants (see PR40605), replacing most
of them with immediate values (for Linux kernel the .rodata size is
reduced by ~1.9%)

Diff Detail

Event Timeline

glider created this revision.Feb 7 2019, 8:13 AM
jfb added a comment.Feb 7 2019, 12:45 PM

Thanks for taking this on! As I said in the bug I think we definitely want something like this change.

Can you add a link to bug 40605 in the commit message, as well as the size results you quoted there?

You should have some tests that show the resulting difference in code. I'm especially interested in seeing empty structs, large structs, nested structs, unions, arrays.

Maybe a separate change: I think you should do the same thing for small arrays.

tools/clang/lib/CodeGen/CGDecl.cpp
1099 ↗(On Diff #185768)

If you do keep this forInit, you should use an enum class (instead of a bool) instead to be self-documenting.

1143 ↗(On Diff #185768)

I think you need a heuristic here, where you don't emit multiple stores if getNumOperands is greater than some number. In that case you'd keep doing memcpy instead. Note that a struct containing sub-structs (or arrays) should also count the number of sub-structs (so just checking getNumOperands isn't sufficient).

Other parts of this code chose 6 stores...

1765 ↗(On Diff #185768)

Can you explain why this case needs to be different? IIUC it's because the types can differ which makes the struct's elements not match up the ones from the constant in the loop above? I think the code above should automatically handle that case instead of passing in a flag here.

You also definitely need to test the case where that happens :)

glider marked 2 inline comments as done.Feb 8 2019, 2:27 AM
glider added inline comments.
tools/clang/lib/CodeGen/CGDecl.cpp
1143 ↗(On Diff #185768)

Do we ever need to not split these stores? Can't the MemCpyOpt pass take care of them later on?

1765 ↗(On Diff #185768)

I just thought we shouldn't break the constants that we didn't create with -ftrivial-var-auto-init.
I'll take a closer look though (indeed, this crashes on one of the structs from the Linux kernel)

glider marked 2 inline comments as not done.Feb 8 2019, 2:36 AM

Sorry, didn't mean to mark these done.

glider updated this revision to Diff 185990.Feb 8 2019, 9:31 AM
glider retitled this revision from [RFC] Split constant structures generated by -ftrivial-auto-var-init when emitting initializators to CodeGen: Fix PR40605: split constant structures generated by -ftrivial-auto-var-init when emitting initializators.
glider edited the summary of this revision. (Show Details)

Updated the patch: instead of passing around forInit, match the constant type to pointer type.

I'm not quite sure how to show the resulting difference in code. Do you mean we need Clang to support both modes and to compare the resulting assembly?

glider added a comment.Feb 8 2019, 9:58 AM

Re: case when STy != Loc.getElementType() - this is already covered by other Clang tests.

I'm still unsure about the heuristic here. I believe that for auto-initialization we want to be quite aggressive with these splits (unlike for regular constant stores).
Perhaps we should do the split in the case all bytes are pattern bytes? (This is probably another use case for forInit)

jfb added a comment.Feb 8 2019, 10:04 AM

Can you add a link to bug 40605 in the commit message?

I'm not quite sure how to show the resulting difference in code. Do you mean we need Clang to support both modes and to compare the resulting assembly?

I only meant tests that show codegen, as you've now added auto-var-init.cpp. There might be interesting cases which I wasn't testing when I wrote it, so if you think of any please do make sure you add some. You mentioned a Linux struct that used to crash? That would be a useful test (I imagine it's one with the Loc adjustment).

tools/clang/lib/CodeGen/CGDecl.cpp
1143 ↗(On Diff #185768)

You're assuming the optimizer is running at all :-)
In general your approach seems to care about -O2, and I agree that's usually what we want to tune. However, clang also support -O0 which won't touch these stores at all, and in -O0 it's usually nice to just see a copy from a global while debugging. Further, configurations such as -Os and -Oz would be hampered by excessive codegen.

You also need to consider the compile-time hit the expansion you're making would add. More code means more time to compile, and I'd expect little to no win from emitting more than a handful of stores. If someone has a large thing on the stack, and it can be optimized, then we should teach the optimizer to deal with memcpy better. I don't think e.g. a 400+ byte struct makes sense to initialize with a bunch of store.

So I think you want to check what the size of the struct is (check out when x86-64 and ARM64 inline memcpy / memset, I assume those are decent guesses for size to inline).

Further, your current code sill do something silly for code like this:

struct S { int j; char arr[8]; float f; };

Before you'd have a single memcpy, but now it'll be a store, a memset, and a store. I think you really want to have the same heuristic for arrays (so you'd have just stores), and you want to make sure that if you're going to split up a struct you do so for all the fields inside that struct.

1765 ↗(On Diff #185768)

Yeah I think it's from the Loc adjustment above. I think your change is worth doing for all cases, not just trivial var auto-init. It'll lead us to optimize all of it much better IMO (and support things like -O0 -Os and -Oz uniformly well too).

tools/clang/test/CodeGenCXX/auto-var-init.cpp
37 ↗(On Diff #185990)

How come @__const.test_small_custom.custom still gets emitted? I guess custom initialization goes through a different code path?

496 ↗(On Diff #185990)

I wonder if (maybe in a separate patch) you also want to avoid memset when something is pretty small.

664 ↗(On Diff #185990)

This one is particularly silly since it's a zero-sized memset :)

Can you add a link to bug 40605 in the commit message?

It's in the title now, doesn't that count? :)

jfb added a comment.Feb 8 2019, 10:22 AM

Can you add a link to bug 40605 in the commit message?

It's in the title now, doesn't that count? :)

Oh OK I hadn't seen that.

glider marked 4 inline comments as done.Feb 12 2019, 7:33 AM
glider added inline comments.
tools/clang/test/CodeGenCXX/auto-var-init.cpp
496 ↗(On Diff #185990)

If I'm understanding your request correctly, the same code also handles the memset() case for zero-initialization.

glider updated this revision to Diff 186469.Feb 12 2019, 7:47 AM
glider marked an inline comment as done.
glider added a subscriber: pcc.

Added a helper function to decide whether we want to break the structure or not.
Right now it only checks for optimization level and structure size.

The constant 64 is picked to match the cacheline size. Other interesting numbers I've considered were:

  • 160 - a size of a kernel structure for which the initialization was inlined without any noticeable profit (no stores to that uninit structure in the same function)
  • 256 - the maximum size for an inlined memset/memcpy with -mno-sse
  • 512 - the maximum size for an inlined memset/memcpy with SSE enabled/

I don't think we have enough data to back any of the choices though.

I've also fixed the code to skip zero-size memset() calls.

This patch shaved off 2.5% of performance slowdown on https://github.com/kamalmarhubi/linux-ipc-benchmarks/blob/master/af_inet_loopback.c on a kernel built in pattern-initialization mode.
The .rodata size went down by only 12K (0.5%, since we now split only structures smaller than 64 bytes)
The .text size went down by 19K (0.4%)

jfb added a comment.Feb 12 2019, 9:32 AM

Overall this LGTM besides a few nits, and wanting input from @rjmccall.

As follow-ups (which I can take on):

  • Handle the case where the types don't match (because Loc was adjusted).
  • Handle small arrays (and any other cases where the struct stores get broken down into components which still contain a memcpy/ memset).
tools/clang/lib/CodeGen/CGDecl.cpp
975 ↗(On Diff #186469)

You don't use Init in the function.

979 ↗(On Diff #186469)

The general 64-byte heuristic is fine with me. It's just a bit weird to special-case -O0, but we do it elsewhere too (and it keeps the current status-quo of letting the backend decide what to do). From that perspective I'm fine with it.

@rjmccall do you have a strong preference either way?

One small nit: can you use ByteSize instead of just Size? Makes it easier to figure out what's going on in code IMO.

tools/clang/test/CodeGenCXX/auto-var-init.cpp
476 ↗(On Diff #186469)

The tests aren't matching labels anymore: PATTERN-LABEL is a dead label check now. I think forking things at -O0 and -O1 is fine, but I think you want a small -O0 test (separate from this patch) which does one or two things, and then you want this test to only look at -O1. That way you don't need so much stuff changing in the test (and -O0 is still tested).

glider updated this revision to Diff 186605.Feb 13 2019, 1:53 AM
glider marked 4 inline comments as done.

Addressed the comments.

tools/clang/lib/CodeGen/CGDecl.cpp
979 ↗(On Diff #186469)

I don't have a strong opinion on variable naming, but this:

974 static bool shouldSplitStructStore(CodeGenModule &CGM,
975                                    uint64_t GlobalByteSize) {
976   // Don't break structures that occupy more than one cacheline.
977   uint64_t ByteSizeLimit = 64;
978   if (CGM.getCodeGenOpts().OptimizationLevel == 0)
979     return false;
980   if (GlobalByteSize <= ByteSizeLimit)
981     return true;
982   return false;

looks a bit more heavyweight than the previous function, in which we also have GlobalSize and SizeLimit.

tools/clang/test/CodeGenCXX/auto-var-init.cpp
476 ↗(On Diff #186469)

Why? We're passing both PATTERN and PATTERN-OX to FileCheck when testing.
I've checked that replacing @test_empty_uninit() on this line with a different function name makes the test fail.

Well, now we somewhat break padding initialization.

E.g. for the following struct:

struct s { 
  int a;
  char b;
  long int c;
};

we generate the following constant initializer with -O0:

.L__const.foo.local:
        .long   2863311530              # 0xaaaaaaaa
        .byte   170                     # 0xaa
        .zero   3   
        .quad   -6148914691236517206    # 0xaaaaaaaaaaaaaaaa
        .size   .L__const.foo.local, 16

, which overwrites the padding bytes on stack, but with a wrong constant.

OTOH with -O1 (i.e. with my patch) we generate the following sequence of stores:

movl    $-1431655766, 8(%rsp)   # imm = 0xAAAAAAAA
movb    $-86, 12(%rsp)
movabsq $-6148914691236517206, %rax # imm = 0xAAAAAAAAAAAAAAAA
movq    %rax, 16(%rsp)

... which happily skips the padding.

It occurs to me that we need to properly handle the padding in patternFor before we'll be able to split the structures.

rjmccall added inline comments.Feb 13 2019, 8:46 AM
tools/clang/lib/CodeGen/CGDecl.cpp
979 ↗(On Diff #186469)

What are the cases we actually want to make sure we emit as separate stores? Basically just really small types that happen to be wrapped up in several layers of struct for abstraction purposes, right? Maybe this heuristic should primarily consider element counts (and types) and use overall sizes at the top level.

jfb added a comment.Feb 13 2019, 8:59 AM

... which happily skips the padding.

I don't think padding is an issue right now. It's valid to either initialize it or not. It is slightly unfortunate to lose the information about padding (which the struct approach keeps). One alternative could be to store a struct when it has padding (i.e. not two stores of char+int, but a single store struct of char+int). From what I've seen the optimization pipeline doesn't do as well with that type of stores.

You might instead just choose to not do anything in this patch if the struct has padding. We can do something later.

FWIW I intend to write a separate patch which lets you opt in to padding always being set. I'll be touching this code anyways.

It occurs to me that we need to properly handle the padding in patternFor before we'll be able to split the structures.

patternFor handles padding already: it treats it as any value, and chooses something adjacent.

tools/clang/lib/CodeGen/CGDecl.cpp
979 ↗(On Diff #186469)

Yes, small types.

FWIW I think I've almost finished handling the padding: https://reviews.llvm.org/D58188
I haven't checked whether it works correctly with padding in custom initializers (like struct s local = {1, 2}), but TEST_UNINIT and TEST_BRACES are covered already.

I think (with test updates) this will be good to go once D58188 lands.

tools/clang/lib/CodeGen/CGDecl.cpp
1158 ↗(On Diff #186605)

Can you leave a FIXME here to handle STY != Loc.getElementType(), as well as another FIXME to handle non-struct aggregate types such as arrays? I'm happy to do them as a follow-up.

glider updated this revision to Diff 188548.Feb 27 2019, 8:15 AM
glider marked an inline comment as done.
glider retitled this revision from CodeGen: Fix PR40605: split constant structures generated by -ftrivial-auto-var-init when emitting initializators to CodeGen: Fix PR40605.
glider edited the summary of this revision. (Show Details)

Rebased the code, updated the tests, added FIXME.

jfb added a comment.Feb 27 2019, 11:01 AM

Can you change the title to be more descriptive?

Small test fixes, otherwise this LGTM, will check box after update.

clang/test/CodeGenCXX/auto-var-init.cpp
123

-NOT is in the wrong place above.

130

-NOT is in the wrong place above.

glider updated this revision to Diff 188694.Feb 28 2019, 2:48 AM
glider marked 3 inline comments as done.
glider retitled this revision from CodeGen: Fix PR40605 to CodeGen: Fix PR40605 by splitting constant struct initializers.
glider added inline comments.
clang/test/CodeGenCXX/auto-var-init.cpp
130

Hm, I wonder if lit could automatically check that all the patterns in the file are used in RUN lines.
(Probably no, it's hard to tell if an all-caps word is a pattern or just a comment)

jfb accepted this revision.Feb 28 2019, 4:38 PM

LGTM, thanks!

This revision is now accepted and ready to land.Feb 28 2019, 4:38 PM
glider closed this revision.Mar 1 2019, 1:00 AM

Landed r355181, thank you!