Page MenuHomePhabricator

[ARM] Fix conditions for lowering to S[LR]I

Authored by PetreTudor on Apr 3 2020, 6:21 AM.



Fixed wrong conditions for generating (S[LR]I X, Y, C2) from
(or (and X, BvecC1), (lsl Y, C2)). The optimisation is also
enabled by default now.

Diff Detail

Event Timeline

PetreTudor created this revision.Apr 3 2020, 6:21 AM

Looks good.


I know you didn't write this part, but it's generally not a great idea to lower via intrinsics if it can be helped.

Can you add a ISel node, a lot like ARMISD::VSHLIMM, and plumb that through tablegen instead?

PetreTudor added inline comments.Apr 3 2020, 7:30 AM

Sure! I will start working on this right away.

Now lowering to S[LR]I via TableGen + ISel nodes.

PetreTudor marked 2 inline comments as done.Apr 14 2020, 7:03 AM

Fix styling problem caused by linter (inconsistent with local style).

dmgreen accepted this revision.Apr 14 2020, 10:47 AM

Thanks. LGTM.


You could possibly use uint64_t ShiftAmount = Op.getConstantOperandVal(3), simplify it a little. It won't handle the report_fatal_error, but this should have already been checked at a higher level, somewhere in clang to make sure the shift amount is a constant.

Same for the ElemSizeInBits check below. You could change them to asserts. Up to you.

This revision is now accepted and ready to land.Apr 14 2020, 10:47 AM

Simplified logic for converting intrinsics to AArch64ISD::VS[LR]I.

PetreTudor marked an inline comment as done.Apr 15 2020, 5:49 AM
This revision was automatically updated to reflect the committed changes.

This seems to cause misoptimizations for me:

$ cat test.c
void rgb16to24(const unsigned char *src, unsigned char *dst, int src_size) {
    unsigned char *d = dst;
    const unsigned short *s = (const unsigned short *)src;
    const unsigned short *end = s + src_size / 2;

    while (s < end) {
        unsigned short bgr = *s++;
        *d++ = ((bgr&0xF800)>>8) | ((bgr&0xF800)>>13);
        *d++ = ((bgr&0x07E0)>>3) | ((bgr&0x07E0)>> 9);
        *d++ = ((bgr&0x001F)<<3) | ((bgr&0x001F)>> 2);
$ clang -target aarch64-linux-gnu -S -O3 test.c -o test.s

There's quite a bit of difference in the output of the compiler for this function caused by this patch - I haven't yet traced it and tracked down exactly which bit of it is wrong, but the testsuite that contains this code reports unexpected values.

efriedma added inline comments.

Please generate simple tests like this with

This transform is simply wrong, as written: sli is not the vector version of bfi.

It appears that there were probably more things wrong with this optimization than just the condition.
I have reverted the patch for now so that I can take more time to make sure that I get this fix right.