Page MenuHomePhabricator

[DAGCombine] Match a pattern where a wide type scalar value is stored by several narrow stores

Authored by steven.zhang on May 12 2019, 8:58 PM.



This opportunity is found from spec 2017 557.xz_r. And it is used by the sha encrypt/decrypt. See sha-2/sha512.c

static void store64(u64 x, unsigned char* y)
    for(int i = 0; i != 8; ++i)
        y[i] = (x >> ((7-i) * 8)) & 255;

static u64 load64(const unsigned char* y)
    u64 res = 0;
    for(int i = 0; i != 8; ++i)
        res |= (u64)(y[i]) << ((7-i) * 8);
    return res;

The load64 has been implemented by
This patch is trying to implement the store pattern.

Match a pattern where a wide type scalar value is stored by several narrow
stores. Fold it into a single store or a BSWAP and a store if the targets
supports it.

Assuming little endian target:
i8 *p = ...
i32 val = ...
p[0] = (val >> 0) & 0xFF;
p[1] = (val >> 8) & 0xFF;
p[2] = (val >> 16) & 0xFF;
p[3] = (val >> 24) & 0xFF;


*((i32)p) = val;

i8 *p = ...
i32 val = ...
p[0] = (val >> 24) & 0xFF;
p[1] = (val >> 16) & 0xFF;
p[2] = (val >> 8) & 0xFF;
p[3] = (val >> 0) & 0xFF;


*((i32)p) = BSWAP(val);

Diff Detail


Event Timeline

steven.zhang created this revision.May 12 2019, 8:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2019, 8:58 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
lebedev.ri added a subscriber: lebedev.ri.

This looks like a middle-end optimization problem to me.
Final i64 is certainly a legal type as per datalayout, and that IR certainly doesn't loop optimal to me.

That was discussed widely when is reviewed. This is the commit log saying something about the delay.

This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do
this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.
Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

The general scheme is to match OR expressions by recursively calculating the origin of individual bytes which constitute the resulting OR value. If all the OR bytes come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.

Gentle ping ... Thank you!

RKSimon added inline comments.May 21 2019, 5:27 AM
3 ↗(On Diff #199199)

The change to the form of test needs committing beforehand, assuming the systemz guys are happy with it. Then this patch should just show any delta in the IR.

steven.zhang marked an inline comment as done.May 21 2019, 7:03 PM
steven.zhang added inline comments.
3 ↗(On Diff #199199)

ok. I will commit another patch for this and let you know if they accept it. is created to update the test llvm/test/CodeGen/SystemZ/codegenprepare-splitstore.ll

Have you considered reusing calculateByteProvider mechanism from load combining instead of matching an exact pattern?

6219–6233 ↗(On Diff #199199)

Can you add a comment outlining the pattern you are looking for? It would be easier to grasp from a short comment than from the implementation.

6251 ↗(On Diff #199199)

Typo "small" -> "smaller"

6318 ↗(On Diff #199199)

What happens to the individual stores after? Do you rely on other DAG combine rules to remove them?

steven.zhang marked 2 inline comments as done.EditedMay 29 2019, 7:22 PM

Yes, I tried but decide not to reuse it. Because, the load pattern is trying to find out the load sequences that they are SHIFT and OR together, so, they have to walk the tree recursively to collect all the loads. But the store pattern is much easier. Because, We already know the stores from the chain. What we need to do is to check if all the store values are from some fixed pattern.

6219–6233 ↗(On Diff #199199)


6318 ↗(On Diff #199199)


apilipenko accepted this revision.May 30 2019, 12:16 PM


6318 ↗(On Diff #199199)

Can you add a comment saying that?

This revision is now accepted and ready to land.May 30 2019, 12:16 PM
steven.zhang marked an inline comment as done.

Address comments.

I will hold this patch until the test case update for systemZ is accepted. D62370: [NFC] Check the endianness after the CodeGenPrepare.

6318 ↗(On Diff #199199)


This revision was automatically updated to reflect the committed changes.