This is an archive of the discontinued LLVM Phabricator instance.

[x86] use zero-extending load of a byte outside of loops too
ClosedPublic

Authored by spatel on Jul 14 2022, 7:09 AM.

Details

Summary

This implements the main suggested change from issue #56498. Using the shorter (non-extending) instruction with only -Oz ("minsize") rather than -Os ("optsize") is left as a possible follow-up.

As noted in the bug report, the zero-extending load may have shorter latency/better throughput across a wide range of x86 micro-arches, and it avoids a potential false dependency. The cost is an extra instruction byte.

This could cause perf ups and downs from secondary effects, but I don't think it is possible to account for those in advance, and that will likely also depend on exact micro-arch. This does bring LLVM x86 codegen more in line with existing gcc codegen, so if problems are exposed they are more likely to occur for both compilers.

There are 200+ test files changed, so the patch is too big to upload to Phab with full context, but all of the test diffs that I scanned are the expected "movb" -> "movzbl".

Diff Detail

Event Timeline

spatel created this revision.Jul 14 2022, 7:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 7:09 AM
spatel requested review of this revision.Jul 14 2022, 7:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 7:09 AM
RKSimon added inline comments.Jul 14 2022, 8:28 AM
llvm/test/CodeGen/X86/ushl_sat_vec.ll
72

Why did only 1 of these movb get extended?

craig.topper added inline comments.Jul 14 2022, 10:02 AM
llvm/test/CodeGen/X86/ushl_sat_vec.ll
72

%ch is live from line 69.

pcordes accepted this revision.Jul 16 2022, 10:02 AM
pcordes added inline comments.
llvm/test/CodeGen/X86/ushl_sat_vec.ll
72

It's possible to still avoid the false dependency by doing movzbl (mem), %ecx first, then movb (mem), %ch.

Reading the full CX/ECX/RCX will still need a merging uop on Intel SnB-family CPUs (which rename high-8 registers separately from the full reg), and unfortunately that merging uop has to issue in a cycle by itself. (So in terms of front-end cost, that extra cost is 4 or 5 uops, not just 1 more. Back-end contention for execution units is rarely a limiting factor for uops that can run on any port). But that merging cost is not paid until later, on the first read.

And on AMD CPUs (and Silvermont-family, like Alder Lake E-cores), there's no later merging cost, writing CH merges on the spot, so it's a nice win vs. movzbl into a temporary; shl $8,%tmp ; or %tmp,%ecx

This revision is now accepted and ready to land.Jul 16 2022, 10:02 AM

Oops, I think "accept revision" was stronger than I expected. I just meant "looks good to me" after reading the summary and comments, not that I'd gone through all the code to make sure it correctly did what it intends. Sorry about that.

Ping - any remaining objections/concerns?
The patch still applies cleanly as of today, but I doubt that will last with so many test diffs. :)

LGTM - although I can't see much Os and Oz test coverage?

LGTM - although I can't see much Os and Oz test coverage?

Yes - I'll add minimal tests to check each of the attribute settings, so it's clear where we've drawn the line.

pcordes added a comment.EditedJul 19 2022, 11:01 AM

LGTM - although I can't see much Os and Oz test coverage?

Good point about -Oz / -Os. This would be a code-size regression for -Oz (for common cases that don't end up needing a later movzx if we start with movb).

And since this doesn't fix issue #56498 , it's *always* a code-size regression. We still redundantly do a movzbl %al, %eax after starting with a movzbl load and only doing byte operations on the bottom byte of a register, so it's still correctly zero-extended. e.g. the last test case (zext-logicop-shift-load.ll) shows that.

This is a good first step; fixing that issue to take advantage of cases where byte values are already zero extended can be done later.

If -Oz (and probably -Os) continue to use movb, then later code can't count on byte values being zero-extended to 64-bit. So the byte-load will need to tell the optimizer whether or not it's zero-extending, depending on options. (And ideally an optimization pass could still fold later zero-extension back into an earlier byte or word load, even with -Oz, because spending 1 extra byte in a load to save a multi-byte instruction is worth it for code-size.)

A temporary regression to code-size for -Oz and -Os is probably less important than a speedup for -O2 / -O3. (Assuming we actually do get any net speedup from this change alone!)

This revision was landed with ongoing or failed builds.Jul 19 2022, 1:45 PM
This revision was automatically updated to reflect the committed changes.