This is an archive of the discontinued LLVM Phabricator instance.

[x86] convert anyext of pinsrb scalar op to subreg insert
Needs ReviewPublic

Authored by spatel on Feb 26 2019, 5:47 PM.

Details

Summary

This is trying to fix one of the potential regressions seen in D58521, but it's a problem independent of that patch as shown in the diffs here.

We are aggressively converting 'anyext' to 'zext' in isel to avoid partial reg stalls, but that shouldn't be a problem for pinsrb because the instruction only uses the low byte of the 32-bit scalar reg. AFAICT, this isn't a problem for pinsrw because we promote all 16-bit ops.

I'm not sure why we don't get load folding with some of the fast-isel tests, but I assume that's as expected or an independent problem.

Diff Detail

Event Timeline

spatel created this revision.Feb 26 2019, 5:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 26 2019, 5:47 PM

I'm not convinced that the dependency logic in the frontend and the renaming portion of Intel CPUs know that this instruction only uses the lower 8 bits. Same with BT/BTC/BTR/BTS/SHLX/SHRX/SARX and probably others. On Sandy Bridge and later CPUs, probably the worst this means is that it would force an AH/BH/CH/DH merge if it had been written independently previously. Bits 63:16 and 7:0 are always together. In 64-bit mode that should never really happen since we won't consider the AH/BH/CH/DH for general use by the register allocator.

I'm not convinced that the dependency logic in the frontend and the renaming portion of Intel CPUs know that this instruction only uses the lower 8 bits. Same with BT/BTC/BTR/BTS/SHLX/SHRX/SARX and probably others. On Sandy Bridge and later CPUs, probably the worst this means is that it would force an AH/BH/CH/DH merge if it had been written independently previously. Bits 63:16 and 7:0 are always together. In 64-bit mode that should never really happen since we won't consider the AH/BH/CH/DH for general use by the register allocator.

So the movzbl here and in the other patch are preferred? I’m happy to leave this as-is if everyone’s ok with the x86 diffs in D58521. Personally, I don’t think those tests with extra instructions are worth worrying about unless there’s some evidence of a perf regression from that.

I'm not convinced that the dependency logic in the frontend and the renaming portion of Intel CPUs know that this instruction only uses the lower 8 bits. Same with BT/BTC/BTR/BTS/SHLX/SHRX/SARX and probably others. On Sandy Bridge and later CPUs, probably the worst this means is that it would force an AH/BH/CH/DH merge if it had been written independently previously. Bits 63:16 and 7:0 are always together. In 64-bit mode that should never really happen since we won't consider the AH/BH/CH/DH for general use by the register allocator.

So the movzbl here and in the other patch are preferred? I’m happy to leave this as-is if everyone’s ok with the x86 diffs in D58521. Personally, I don’t think those tests with extra instructions are worth worrying about unless there’s some evidence of a perf regression from that.

Reading this again...is the suggestion to promote 8-bit ops to 32-bit like we do for 16-bit? I actually tried that somewhere along the way, and it seemed like a reasonable alternative based on the regression test diffs. That would at least make our lives easier in the compiler by allowing for removal of what would become unlikely/unnecessary patterns.

TBH I'm more worried about the failure of those PINSRBs to fold the loads

The fast-isel loads don't fold because we run the argument lowering and the rest of the code through selectiondag separately during fast-isel. So selectiondag doesn't see them together to fold the load during isel. Then we can't fold them during peephole because there is an INSERT_SUBREG between the MOVB and the PINSRB and the memory folding code can't handle that. Prior to this patch we had a MOVB, followed by a MOVZX, then the PINSRB at the time of peephole. Peephole pass did merge the MOVB and the MOVZX, but MOVZX isn't a foldable load so we don't go any further to merge with the PINSRB.

spatel added a comment.Mar 7 2019, 4:22 PM

Ping.

Thoughts on where this should go?

I tried a few experiments with promoting some 8-bit ops to 32-bit, but I don't see a way to do that in stages. I think what we should do is promote almost all of those to 32-bit (like we do for 16-bit) because there's very little upside in using 8-bit ops, but lots of downside (this would make our codegen more like gcc from what I can tell). But that change is going to expose lots of small regressions because we've come to expect the 8-bit patterns.

Should we just aim to remove X86ISD::PINSRW/PINSRB and use INSERT_VECTOR_ELT directly? https://bugs.llvm.org/show_bug.cgi?id=39956