This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel] Combine (a[0]) | (a[1] << k1) | ...| (a[m] << kn) into a wide load
ClosedPublic

Authored by paquette on Jan 8 2021, 4:05 PM.

Details

Summary

This is a restricted version of the combine in DAGCombiner::MatchLoadCombine.
(See D27861)

This tries to recognize patterns like below (assuming a little-endian target):

s8* x = ...
s32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
->
s32 val = *((i32)a)

s8* x = ...
s32 val = a[3] | (a[2] << 8) | (a[1] << 16) | (a[0] << 24)
->
s32 val = BSWAP(*((s32)a))

(This patch also handles the big-endian target case as well, in which the first example above has a BSWAP, and the second example above does not.)

To recognize the pattern, this searches from the last G_OR in the expression
tree.

E.g.

Reg    Reg
   \    /
    OR_1     Reg
        \    /
        OR_2
            \   Reg
             .. /
            OR_Root

Each non-OR register in the tree is put in a list. Each register in the list is then checked to see if it's an appropriate load + shift logic.

If every register is a load + potentially a shift, the combine checks if those loads + shifts, when OR'd together, are equivalent to a wide load (possibly with a BSWAP.)

To simplify things, this patch

(1) Only handles G_ZEXTLOADs (which appear to be the common case)
(2) Only works in a single MachineBasicBlock
(3) Only handles G_SHL as the bit twiddling to stick the small load into a specific location

An IR example of this is here: https://godbolt.org/z/4sP9Pj (lifted from test/CodeGen/AArch64/load-combine.ll)

At -Os on AArch64, this is a 0.5% code size improvement for CTMark/sqlite3, and a 0.4% improvement for CTMark/7zip-benchmark.

Diff Detail

Event Timeline

paquette created this revision.Jan 8 2021, 4:05 PM
paquette requested review of this revision.Jan 8 2021, 4:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2021, 4:05 PM
Herald added a subscriber: wdng. · View Herald Transcript

Thanks for tackling this. The logic seems sound to me, but I have some efficiency concerns. (Note, this is conjecture, I don't know if this is realistically a big problem)

G_OR trees merging data are probably fairly common. This algorithm tries to match an entire tree of them with their load/shifted operands starting from the root. The two issues I see is that:

  1. If the actual legal load+OR tree is a subtree of the root node, we essentially do the work twice. If the combine does fire then this isn't a major issue since we do end up getting a nice code improvement from it.
  2. If some part of the tree invalidates the combine, such as a leaf instruction having an invalid shift, then in the "worst-case" flattened tree list, we'll do (1/2)N^2 combine attempts until we finally exhaust all the G_OR instructions.

These aren't probably easy issues to solve without some intrusive changes. The former would need you to change the algorithm to go top-down and find the maximal mergeable subtree. The latter doesn't seem possible without also having some way to find which part of the tree is invalid.

I think 2 things would be good to know before going ahead with this patch. The first is compile time impacts if any on -O0 and maybe -Os too. The second is if you can add some debugging code to check how often we find tree candidates of say, more than 2 or 3 G_ORs, before *failing* to match.

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
3528

Here's some measurements:

                     O0 compile   O0 code       #Missed   #Missed    Avg regs/      | Os compile   Os code      #Missed   #Missed    Avg regs/
                     time change  size change   combines  (> 1 reg)  missed combine | time change  size change  combines  (> 1 reg)  missed combine
bullet               0.25%         0.00%        22        4          0.5            |  0.56%       0.00%        138       14         0.3
kc                   0.05%         0.00%        2         0          0.0            |  0.76%       0.00%        66        0          0.0
lencod              -0.18%         0.00%        89        20         0.6            |  0.31%       0.00%        508       34         0.2
SPASS                0.09%         0.00%        1         1          3.0            |  0.37%       0.00%        146       4          0.0
consumer-typeset    -0.27%         0.00%        369       52         0.3            |  0.17%       0.00%        625       61         0.2
clamscan            -0.09%         0.00%        179       78         1.3            |  0.21%       0.00%        493       49         0.3
sqlite3             -0.48%         0.00%        185       29         0.4            | -0.18%      -0.50%        514       91         0.6
tramp3d-v4           0.14%         0.00%        0         0          0.0            |  0.52%       0.00%        253       2          0.0
pairlocalalign      -0.04%         0.00%        0         0          0.0            |  0.49%       0.00%        54        48         2.7
7zip-benchmark       0.12%        -0.10%        336       273        2.5            |  0.75%      -0.40%        565       268        1.6
                                                                                    |
Average             -0.04%        -0.01%        118.30    45.7       0.87           |  0.40%      -0.09%        336       57         0.6
  • Compile time was measured using a single thread with 3 samples. Comparison is between geomeans of the samples.
  • # Missed combines is the total times that the combine failed
  • # Missed (>1 reg) is the number of times the combine failed with at least one register in RegsToVisit.
  • Avg regs/missed combine is the number of registers in RegsToVisit when the combine misses

Note that a lot of the time, the combine does miss with 0 registers (because you just don't find anything with a single use, for example).

Here's some measurements:

                     O0 compile   O0 code       #Missed   #Missed    Avg regs/      | Os compile   Os code      #Missed   #Missed    Avg regs/
                     time change  size change   combines  (> 1 reg)  missed combine | time change  size change  combines  (> 1 reg)  missed combine
bullet               0.25%         0.00%        22        4          0.5            |  0.56%       0.00%        138       14         0.3
kc                   0.05%         0.00%        2         0          0.0            |  0.76%       0.00%        66        0          0.0
lencod              -0.18%         0.00%        89        20         0.6            |  0.31%       0.00%        508       34         0.2
SPASS                0.09%         0.00%        1         1          3.0            |  0.37%       0.00%        146       4          0.0
consumer-typeset    -0.27%         0.00%        369       52         0.3            |  0.17%       0.00%        625       61         0.2
clamscan            -0.09%         0.00%        179       78         1.3            |  0.21%       0.00%        493       49         0.3
sqlite3             -0.48%         0.00%        185       29         0.4            | -0.18%      -0.50%        514       91         0.6
tramp3d-v4           0.14%         0.00%        0         0          0.0            |  0.52%       0.00%        253       2          0.0
pairlocalalign      -0.04%         0.00%        0         0          0.0            |  0.49%       0.00%        54        48         2.7
7zip-benchmark       0.12%        -0.10%        336       273        2.5            |  0.75%      -0.40%        565       268        1.6
                                                                                    |
Average             -0.04%        -0.01%        118.30    45.7       0.87           |  0.40%      -0.09%        336       57         0.6
  • Compile time was measured using a single thread with 3 samples. Comparison is between geomeans of the samples.
  • # Missed combines is the total times that the combine failed
  • # Missed (>1 reg) is the number of times the combine failed with at least one register in RegsToVisit.

Is this meant to say *greater* than 1?

  • Avg regs/missed combine is the number of registers in RegsToVisit when the combine misses

Note that a lot of the time, the combine does miss with 0 registers (because you just don't find anything with a single use, for example).

  • Avg regs/missed combine is the number of registers in RegsToVisit when the combine misses

Does this metric count the misses where there were no registers?

  • Avg regs/missed combine is the number of registers in RegsToVisit when the combine misses

Does this metric count the misses where there were no registers?

Nope.

Added below in brackets under Avg regs/#missed combine:

                                                                     Avg regs/  |                                                Avg regs/
                     O0 compile   O0 code       #Missed   #Missed    missed     | Os compile   Os code       #Missed   #Missed   missed
                     time change  size change   combines  (> 1 reg)  combine    | time change  size change  combines  (> 1 reg)  combine
bullet               0.25%         0.00%        22        4          0.5 (3.0)  |  0.56%       0.00%        138       14         0.3 (2.9)
kc                   0.05%         0.00%        2         0          0.0 (0.0)  |  0.76%       0.00%        66        0          0.0 (0.0)
lencod              -0.18%         0.00%        89        20         0.6 (2.8)  |  0.31%       0.00%        508       34         0.2 (2.3)
SPASS                0.09%         0.00%        1         1          3.0 (3.0)  |  0.37%       0.00%        146       4          0.0 (1.0)
consumer-typeset    -0.27%         0.00%        369       52         0.3 (2.0)  |  0.17%       0.00%        625       61         0.2 (2.2)
clamscan            -0.09%         0.00%        179       78         1.3 (3.1)  |  0.21%       0.00%        493       49         0.3 (3.0)
sqlite3             -0.48%         0.00%        185       29         0.4 (2.7)  | -0.18%      -0.50%        514       91         0.6 (3.2)
tramp3d-v4           0.14%         0.00%        0         0          0.0 (0.0)  |  0.52%       0.00%        253       2          0.0 (3.0)
pairlocalalign      -0.04%         0.00%        0         0          0.0 (0.0)  |  0.49%       0.00%        54        48         2.7 (3.0)
7zip-benchmark       0.12%        -0.10%        336       273        2.5 (3.1)  |  0.75%      -0.40%        565       268        1.6 (3.3)
                                                                                |
Average             -0.04%        -0.01%        118.30    45.7       0.87 (2.0) |  0.40%      -0.09%        336       57         0.6 (2.4)
paquette updated this revision to Diff 316256.Jan 12 2021, 3:28 PM

Fix a bug in isPredecessor exposed by this patch and add a testcase for it. It wasn't able to handle the case where DefMI is the first instruction in the block.

I don't think it can be exposed by findPostIndexCandidate, so I guess I'll just fix it here where it's testable.

We also need to check for the strict-align function attribute before we generate wider loads.

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
3291

Nit: I think you can use m_OneUse() on the load and remove the check below, but you may need to add a variant for m_OneNonDbgUse().

3300

I think G_ZEXTLOADs are the only valid loads we can match here? Otherwise we can't merge the values with G_OR.

paquette added inline comments.Jan 13 2021, 11:27 AM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
3300

We can have a G_SEXTLOAD when it occupies the highest part of the result:

https://godbolt.org/z/1M11s7

Although I'm not sure how often that case comes up.

We also need to check for the strict-align function attribute before we generate wider loads.

Shouldn't the legalizer check alignment requirements on the load? Or is this something different?

paquette updated this revision to Diff 316738.Jan 14 2021, 12:12 PM

Use m_OneNonDBGUse (D94705)

We also need to check for the strict-align function attribute before we generate wider loads.

Shouldn't the legalizer check alignment requirements on the load? Or is this something different?

Each pass has to respect the semantics though. At the moment we fall back in the translator for strict-align, but when we do add support places like this will need to correct. Unless you add alignment checks for this combine, this might create unaligned wide loads which are illegal with strict-align, since we’re not allowed to assume that unaligned loads are safe.

paquette updated this revision to Diff 317110.Jan 15 2021, 4:25 PM
  • Add a LLT variant of allowsMemoryAccess. This is the same as the DAGCombiner check.
  • Add a testcase that shows that when we have strict-align, the combine only fires when the resulting load will be aligned.
This revision is now accepted and ready to land.Jan 16 2021, 11:59 AM
foad added a subscriber: foad.Jan 19 2021, 10:27 AM