This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel][IRTranslator] Split aggregates during IR translation
ClosedPublic

Authored by aemerson on Apr 24 2018, 8:54 AM.

Details

Summary

We currently handle all aggregates by creating one large LLT, and letting the legalizer deal with splitting them up. However using this approach means that we can't support big endian code correctly.

This patch changes the way that the IRTranslator deals with aggregate values, by splitting them up into their constituent element values. To do this, parts of the translator need to be modified to deal with multiple VRegs for a single Value.

A new Value to VReg mapper is introduced to help keep compile time under control, currently there is no measurable impact on CTMark despite the extra code being generated in some cases.

Patch is based on the original work of Tim Northover.

Diff Detail

Repository
rL LLVM

Event Timeline

aemerson created this revision.Apr 24 2018, 8:54 AM
aemerson edited the summary of this revision. (Show Details)Apr 24 2018, 8:57 AM
aemerson added a reviewer: rovka.
rtereshin added inline comments.Apr 26 2018, 8:39 PM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

This looks like over 128 bytes per Value or more. How does memory consumption change with this patch?

If it ends up being a problem, we might reuse the MachineInstrs approach with storing machine memory operands. It's a similar pattern - in absolute majority of the cases we have just one to one (in case of value to vregs, or one to zero in case of the machine instruction to memory operands) mapping, but only sometimes its one to many.

aemerson added inline comments.Apr 29 2018, 2:33 PM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

I haven't measured memory consumption, but changing the SmallVector capacities to 1 has no impact on the compile time on CTMark so I'll do that.

Can you be more specific about the MachineInstr thing? To which bit of code are you referring to? I think at capacities of 1 this should be fine.

rtereshin added inline comments.Apr 30 2018, 11:56 AM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

I haven't measured memory consumption, but changing the SmallVector capacities to 1 has no impact on the compile time on CTMark so I'll do that.

Makes sense.

I came up with 128+ for bytes as follows:

  1. every SmallVector has 3 * sizeof(intptr_t) worth of overhead (size, capacity, pointer to the underlying array), looks like we have 2 of them per Value, so 6 intptr_ts of overhead.
  2. We also have 2 separate maps that in very best case consume 4 intptr_ts per Value on top of that, more likely about twice that if DenseMaps keep the load factor around 0.5, I didn't check that, but AFAIK it's a common value.
  3. The actual data consume 2 intptr_ts for 4 unsigneds and 4 intptr_ts for 4 uint64_t on 64-bit platforms.

Overall it's optimistically (10 + 6) * sizeof(intptr_t) = 128+ bytes per Value on 64-bit platforms.

Reducing the default SmallVector sizes to 1 changes only (3) to 1 (due to alignment) + 1 intptr_ts with overall memory consumption (10 + 2) * sizeof(intptr_t) = 96+ bytes per Value on 64-bit platforms.

If we put a vreg and its offset in the same pair and have a single SmallVector and a single map it will go down like follows:

  1. 3 intptr_ts of overhead per Value due to SmallVector
  2. 2+ intptr_ts of overhead per Value due to the pointer to pointer map.
  3. 2 intptr_t of actual payload assuming the default size of SmallVector being 1.

overall (5 + 2) * sizeof(intptr_t) = 56+ bytes per Value on 64-bit platforms.

If the memory consumption is not a problem, that should be more than enough I think.

The memory operand's trick I was referring to lives roughly here:
https://github.com/llvm-mirror/llvm/blob/2a793f6500a1a77cb7186549a6f9245bea847cf5/include/llvm/CodeGen/MachineInstr.h#L107-L114
https://github.com/llvm-mirror/llvm/blob/2a793f6500a1a77cb7186549a6f9245bea847cf5/lib/CodeGen/MachineInstr.cpp#L318-L332
https://github.com/llvm-mirror/llvm/blob/2a793f6500a1a77cb7186549a6f9245bea847cf5/include/llvm/CodeGen/MachineInstr.h#L1304-L1313

It won't save much here on top of the "just one SmallVector and one map" suggestion, basically just the capacity part of SmallVector, AFAICT, we definitely don't need it if the memory consumption is not an issue here.

All of this is just a mere suggestion, we can keep it as it is and worry about it later if it becomes an apparent problem, I think.

UPD. I took a look at DenseMap implementation, and it appears to me that the expected value of it's load factor is 9/16, so it takes approximately twice as much memory as estimated above. Therefore estimations change from 128 -> 96 -> 56 to 160 -> 128 -> 72.

aemerson added inline comments.May 8 2018, 6:49 AM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

Thanks for the analysis. I don't think we can have just a single SmallVector because we need the ability to dynamically grow each set of vregs and/or offsets, so storing one SmallVector for each Value is better despite the extra memory cost in terms of the design/

Hi Amara,

I think I'm about half-way through, didn't touch the {pack,unpack}Regs functions and the whole story with interfacing with legacy call-lowering yet.

This is great work, thank you for doing this!

Roman

lib/CodeGen/GlobalISel/IRTranslator.cpp
121

Maybe this way it's just a bit more straightforward:

for (unsigned I = 0, E = STy->getNumElements(); I != E; ++I)
  computeValueLLTs(DL, *STy->getElementType(I), ValueTys, Offsets,
                   StartingOffset + SL->getElementOffset(I));

It's alright as it is, of course, just in case you would like the version above better yourself.

128

I think we're trying to stick to capitalized local variables' names, including loop induction variables.

157

Looks like technically for aggregate constants getOrCreateVRegs has worst case complexity of O(N^2), for constants like { i8 0, { i8 0, { i8 0, ... } } }, but this is probably not easy to avoid and such constants hopefully don't happen often.

173

This piece in general is not exactly trivial, perhaps it makes sense to add test cases having multiple deeply nested aggregate constants sharing some parts so getOrCreateVRegs would visit an actual non-tree DAG while traversing them.

176

VRegs->front() maybe

474

Same note as for insertvalue, do we really need to build these copies here?

508

Let's say we have a large aggregate that eventually maps to N vregs, and also N insertvalues each of which replaces just a single item (vreg). Will we end up generating N^2 COPYs for the IR of initial size O(N) here?

Granted, this probably happens very rarely in practice, but what if we don't getOrCreateVRegs(U), but rather just get(U) (no implicit MIR.createGenericRegister calls), resize, and assign either *InsertedIt++ or SrcRegs[i] to it directly w/o building any explicit COPYs at all? We're just remapping values back and forth here during translation, with always constant indices, so the whole thing could be copy-propagated on the fly avoiding quite a bit of MIR-churn.

rtereshin added inline comments.May 9 2018, 12:04 AM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

I didn't mean the same SmallVector for all values in the function, I meant one SmallVector for each Value, just as you say, but one (of std::pair<unsigned, int64_t>s, for instance), instead of two per each value (for offsets and vregs themselves).

lib/CodeGen/GlobalISel/IRTranslator.cpp
467

For better or for worse, getIndexedOffsetInType returns int64_t, not uint64_t.

494

This logic of getting indices from values is quite repeated here and in extractvalue, do you think it makes sense to refactor it out as a separate function?

1275

This could be inlined as MIRBuilder.buildInstr(TargetOpcode::G_PHI, Reg);

aemerson added inline comments.May 10 2018, 1:23 PM
include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

That's possible, I'd rather do it in a follow up patch.

lib/CodeGen/GlobalISel/IRTranslator.cpp
128

To me using a capital I normally indicates an iterator while a lower case 'i' is assumed to be an integer type, and we already use lower case 'i' idiomatically everywhere including GISel.

494

Could do.

508

I don't know why you mean by get(U). What would it return?

Also, I looked briefly into https://bugs.llvm.org/show_bug.cgi?id=37397 recently, and I tried to apply this patch and see what it would do. It generated considerably more code (final assembly) comparing to itself and FastISel both (I'd eyeball the difference as 3x) and didn't seem to fix the issue with initializing upper bits of a boolean value.

Generally, I feel concerned about the quality (runtime performance) of generated code with this approach.

include/llvm/CodeGen/GlobalISel/IRTranslator.h
78

Another thing to notice is that apparently all the offsets depend on the value's type, not the value itself, therefore they could be cached (separately from vregs this time) by using the value type as a key, not the value itself.

lib/CodeGen/GlobalISel/IRTranslator.cpp
128

@bogner It's a valid point:

> grep -Ern 'for.* i\s*=\s*\d' ./ --include='*.cpp' --include='*.h' --exclude-dir=build | wc -l
   17473
> grep -Ern 'for.* I\s*=\s*\d' ./ --include='*.cpp' --include='*.h' --exclude-dir=build | wc -l
    1617
508

It was supposed to be more like allocateVRegs(U). What I mean here is that let's say we only allocate the required number of vregs for U w/o initializing them with anything, and then just assign to a DstReg[i] either *Inserted++ or SrcRegs[i] depending on the offset, like this:

MutableArrayRef<unsigned> DstRegs = allocateVRegs(U); // `VMap.insertVRegs(U)` + initializing offsets
ArrayRef<uint64_t> DstOffsets = *VMap.getOffsets(U);  // getting just initialized offsets
ArrayRef<unsigned> SrcRegs = getOrCreateVRegs(*Src);
ArrayRef<unsigned> InsertedRegs = getOrCreateVRegs(*U.getOperand(1));
auto InsertedIt = InsertedRegs.begin();

for (unsigned i = 0; i < DstRegs.size(); ++i) {
  if (DstOffsets[i] >= Offset && InsertedIt != InsertedRegs.end())
    DstRegs[i] = *InsertedIt++;
  else
    DstRegs[i], SrcRegs[i];

avoiding building any copies.

Also, if the offsets are cached by value type (as suggested in a comment above), allocateVRegs(U) won't visit U at all, as Src is already processed (assuming we IR Translate top to bottom) and we have all the offsets cached for U->getType() already and U and Src have the same type.

bogner added inline comments.May 10 2018, 2:58 PM
lib/CodeGen/GlobalISel/IRTranslator.cpp
128

The coding standards certainly imply that variables should all be capitalized, though they don't call out loop variables explicitly:

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

I tend to stick to capitalizing in llvm here, if only because I find it very hard to talk about code if both "I" and "i" end up being in scope at the same time. In any case, if lower case "i" is meant to be canonical someone should really patch the coding standards to say so.

Also, I looked briefly into https://bugs.llvm.org/show_bug.cgi?id=37397 recently, and I tried to apply this patch and see what it would do. It generated considerably more code (final assembly) comparing to itself and FastISel both (I'd eyeball the difference as 3x) and didn't seem to fix the issue with initializing upper bits of a boolean value.

Generally, I feel concerned about the quality (runtime performance) of generated code with this approach.

My thoughts were that we'd take a compile time hit with this, although in the end it was mostly neutral, but the code size would regress as you saw. The idea was to have a pre-legalizer combiner (Aditya had a prototype demo of one to show how the combiner API would work). Either that, or we have a clean up phase at the end of IRTranslator where we eliminate G_INSERT and G_EXTRACT pairs, perhaps with some caching of potentially redundant sets of these instructions due to pack/unpack regs as translation is happening, and then a quick pass to eliminate them without searching the whole function.

To support my notes on performance and memory consumption side of things I have tried GlobalISel and FastISel / SelectionDAG ISel before and after this patch on the following test case

(~1k (1,000) lines of LLVM IR):

GlobalISel before: Total: 55 ms / 3.5k lines of assembly, IRTranslator: 0.5M / 0.5 ms / 3k of MIR lines, Legalizer: 40M / 30.5 ms / 137k of MIR lines
GlobalISel after: Total: 920 ms / 8k lines of assembly, IRTranslator: 315M / 250 ms / 2,110k of MIR lines, Legalizer: 0.06M / 12 ms / 2,110k of MIR lines
FastISel / Selection DAG ISel: Total: 7,400 ms / 4k lines of assembly, AArch64 Instruction Selection: 2M / 7,340 ms / 4k of MIR lines

In all cases instruction selection is the worst offender in terms of memory consumption.

It’s nice to see that FastISel doesn’t really manage to select that test case, it falls back to SelectionDAG ISel that operates in a -O0 mode and ends up working for much-much longer and still producing poor assembly. It would behave the same way for x86-64. If tried in -01 mode it would burn the same amount of time, but produce just 16 lines of assembly.

However, the memory consumption side of the story looks pretty bad for GlobalISel even before this patch, and looks much worse after. It could be noticed also, that memory consumption grows as O(N^2) with the size of the input LLVM IR.

I don’t think it’s acceptable.

I've made the changes to not generate copies for the insertvalue/extractvalue, but how did you measure the memory consumption of the individual passes? With the change it now produces 4k lines of assembly on your test case, vs 3.5k assembly. The MIR is much larger but that's because of the extra GEPs and G_CONSTANTs. Compile time is now comparable to without this change.

I've made the changes to not generate copies for the insertvalue/extractvalue

That's great! Thanks!

but how did you measure the memory consumption of the individual passes?

-time-passes option of llc has a neat modifier: -track-memory. llc -time-passes -track-memory will report the diff in memory footprint for each pass as a difference before and after (so it could be negative if a pass frees a lot of memory). My guess that it also means that it doesn't report the peak memory consumption if the peak happens mid-pass. Which is rather unfortunate. But I'm not sure about that one.

Another way to look at things on a grand scale - no individual passes data - is to do llc -time-passes -time-compilations=1000 (or so) so llc loops for a long time and just watch the memory data in top. top seems to be greatly unable to catch spikes in memory consumption as well, unfortunately, it's probably a sampling tool so it misses brief increases in memory footprint. Or it least it feels that way. Or I'm using it wrong. Please share any ideas about all of this.

With the change it now produces 4k lines of assembly on your test case, vs 3.5k assembly. The MIR is much larger but that's because of the extra GEPs and G_CONSTANTs. Compile time is now comparable to without this change.

Awesome, could you please update the diff here at Phabricator?

Thanks!
Roman

aemerson updated this revision to Diff 146704.May 14 2018, 3:34 PM

Thanks. -time-passes shows that IRTranslator is now using around 2MB. It's also taking most of the compile time vs other passes like your initial analysis showed, but we trade off the legalizer doing much less work.

I haven't run this version of the patch through much testing yet.

aemerson marked 6 inline comments as done.May 14 2018, 3:36 PM

Some review comments not addressed yet, will add a test requested in a later revision.

Thanks. -time-passes shows that IRTranslator is now using around 2MB. It's also taking most of the compile time vs other passes like your initial analysis showed, but we trade off the legalizer doing much less work.

I haven't run this version of the patch through much testing yet.

Looks like there is a higher precision way of measuring the memory consumption for the whole process on macOS: time -l. Given that with this specific test IRTranslator was allocating much more memory than anything else, I think it also makes sense to measure the entire llc compile from outside:

command time -l ./bin/llc -O0 -global-isel=true -global-isel-abort=2 -mtriple aarch64-- test.ll -o /dev/null -time-passes

And the results are great:

[GlobalISel Before]
49.5M / 68ms

[FastISel / SelectionDAG ISel]
57.3M / 7,250ms

[GlobalISel After / Original Patch]
337.7M / 740ms

[GlobalISel After / Updated Patch]
29.5M / 40ms

(memory figure is "maximum resident set size" as shown by time -l, the time figure is "User+System / Total" as shown by -time-passes)

I think it's a huge win and we can stop here as far as performance and memory consumption are concerned. Thank you for the update!

lib/CodeGen/GlobalISel/IRTranslator.cpp
409

Offsets[i] here needs to be in bytes as well.

413

Ditto, MinAlign expects everything in bytes as far as I can tell.

438

Ditto, MachinePointerInfo expects everything in bytes.

442

Ditto.

Hi Amara,

Alright, I'm done with the review.

General notes:

  1. there is insufficient tests coverage for cases like PHIs and nested / shared / repeated aggregate constants
  2. there is code that is likely to be dead
  3. pretty much every use-case of computeValueLLTs looks like a hack, tbh, maybe it could be helped if we actually start mapping Types instead of Values to offsets and LLTs.

None of it is a showstopper IMO though, please address what seems to be reasonable to get addressed in the initial patch and we will be good to go with this.

Thanks!
Roman

lib/CodeGen/GlobalISel/IRTranslator.cpp
154

Just Regs.resize(SplitTys.size()) may be a littler cleaner.

912

Indentation is off

965
unsigned Res = IsSplitType
    ? MRI->createGenericVirtualRegister(getLLTForType(*CI.getType(), *DL))
    : getOrCreateVReg(CI);

may work a little nicer here.

1254

It looks like getOrCreateVReg (singular) is already implemented so it could be used here.

1307

There are two is in scope here. Do we have a test covering this?

1324

This function is rather a hack, but I guess it's not a big deal.

1462–1469

I can't see how this condition could be true. And none of the tests fail if an assertion is inserted under this condition. If it is possible, could you please add a test? Thanks!

1481

could be replaced with range-based for.

1486

How this could be not true? As above, an assertion in in else branch shows VRegs is always empty here as far as the existing tests' coverage go.

1490

ArgIt is incremented in both branches, maybe better to put it at the end of the loop.

rtereshin added inline comments.May 15 2018, 2:11 AM
lib/CodeGen/GlobalISel/IRTranslator.cpp
1307

I think I've got a couple of tests for PHIs. They both seem to translate correctly:

test_phi_diamond.ll
test_phi_loop.ll

The test_phi_loop.ll one translates rather beautifully IMO:

body:             |
  bb.1.entry:
    liveins: $w0

    %0:_(s32) = COPY $w0
    %5:_(s32) = G_CONSTANT i32 1
    %7:_(s32) = G_CONSTANT i32 0
    %9:_(s64) = G_CONSTANT i64 0
    %10:_(s64) = G_CONSTANT i64 1

  bb.2.loop:
    %1:_(s32) = G_PHI %0(s32), %bb.1, %6(s32), %bb.2
    %2:_(s64) = G_PHI %9(s64), %bb.1, %3(s64), %bb.2
    %3:_(s64) = G_PHI %10(s64), %bb.1, %4(s64), %bb.2
    %4:_(s64) = G_ADD %2, %3
    %6:_(s32) = G_SUB %1, %5
    %8:_(s1) = G_ICMP intpred(sle), %1(s32), %7
    G_BRCOND %8(s1), %bb.3
    G_BR %bb.2

  bb.3.exit:
    $x0 = COPY %2(s64)
    RET_ReallyLR implicit $x0

Nested / shared / repeated aggregate constants & multi-level index tests: test_cons.ll - also seems to be translated correctly:

./bin/llc -O0 -global-isel=true -global-isel-abort=2 -mtriple aarch64--  -stop-after instruction-select test_cons.ll -o - -simplify-mir -verify-machineinstrs
body:             |
  bb.1.entry:
    liveins: $x0

    %0:gpr64sp = COPY $x0
    %34:gpr32 = MOVi32imm 10
    %33:gpr32 = MOVi32imm 20
    %32:gpr32 = MOVi32imm 50
    STRBBui %34, %0, 0 :: (store 1 into %ir.dst)
    STRBBui %33, %0, 1 :: (store 1 into %ir.dst + 1)
    STRBBui %34, %0, 2 :: (store 1 into %ir.dst + 2)
    STRBBui %33, %0, 3 :: (store 1 into %ir.dst + 3)
    STRBBui %32, %0, 4 :: (store 1 into %ir.dst + 4)
    STRBBui %34, %0, 5 :: (store 1 into %ir.dst + 5)
    STRBBui %33, %0, 6 :: (store 1 into %ir.dst + 6)
    STRBBui %33, %0, 7 :: (store 1 into %ir.dst + 7)
    STRBBui %34, %0, 0 :: (store 1 into %ir.dst)
    STRBBui %33, %0, 1 :: (store 1 into %ir.dst + 1)
    STRBBui %34, %0, 2 :: (store 1 into %ir.dst + 2)
    STRBBui %33, %0, 3 :: (store 1 into %ir.dst + 3)
    STRBBui %33, %0, 4 :: (store 1 into %ir.dst + 4)
    STRBBui %34, %0, 5 :: (store 1 into %ir.dst + 5)
    STRBBui %33, %0, 6 :: (store 1 into %ir.dst + 6)
    STRBBui %33, %0, 7 :: (store 1 into %ir.dst + 7)
    RET_ReallyLR

(this is already changed to pass values in bytes (instead of bits) as offset and align parameters to pointer info of memory operands)

aemerson updated this revision to Diff 146938.May 15 2018, 3:29 PM

I think I've addressed most of the issues. I've tried to clean up the computeValueLLTs function a little bit, still perhaps not ideal. Dead code has been removed (I think an earlier revision of the patch required it for the tests to pass). Also added the tests you wrote, thanks.

rtereshin accepted this revision.May 15 2018, 3:39 PM

I think I've addressed most of the issues. I've tried to clean up the computeValueLLTs function a little bit, still perhaps not ideal. Dead code has been removed (I think an earlier revision of the patch required it for the tests to pass). Also added the tests you wrote, thanks.

Hi Amara,

I think it's ready to get in, thank you for working with me on this one!

Roman

This revision is now accepted and ready to land.May 15 2018, 3:39 PM
This revision was automatically updated to reflect the committed changes.