Allow to reduce redundant shift masks.
For example:
x1 = x & 0xAB00
x2 = (x >> 8) & 0xAB
can be reduced to:
x1 = x & 0xAB00
x2 = x1 >> 8
It only allows folding when the masks and shift values are constants.
Differential D48278
[SelectionDAG] Fold redundant masking operations of shifted value dnsampaio on Jun 18 2018, 5:33 AM. Authored by
Details Allow to reduce redundant shift masks. can be reduced to:
Diff Detail Event TimelineComment Actions Hi Diogo, Some nitpick comments. Please add negative tests for shifts with multiple uses and where the shift and mask aren't by constants. cheers,
Comment Actions Replaced tests as to be not dependent in the load width reduction. Comment Actions LGTM. For future reference, and before committing, arm and aarch64 tests live in different codegen directories so please separate the test into the two sub-directories. Thanks! Comment Actions This patch was reverted with rL336453 because it caused: Beyond that, I don't understand the motivation. The patch increases the latency of a computation. Why is that beneficial? The x86 diff doesn't look like a win to me. I don't know what the ARM/AArch output looked like before this patch. Always check in the baseline tests before making a code change, so we have that as a reference (and in case the patch is reverted, we won't lose the test coverage that motivated the code patch). Comment Actions Like @spatel I'm not clear on what you're really trying to accomplish here - has the arm/arm64 codegen improved?
Comment Actions Sorry about that.
It reduces the number of computation operations, from 3 to 2, and the number of constants kept for performing the masking, from 2 to 1.
See in line comments for code change, in both x86-64, AArch64 and ARM.
Comment Actions In the examples it reduces most required ARM instructions and constants by half in this examples, as the OR and SHIFT operations can be combined. Comment Actions Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR: I'm not sure exactly where that optimization belongs. Ie, is it EarlyCSE, GVN, somewhere else, or is it its own pass? But I don't see any benefit in waiting to do this in the DAG. Comment Actions This also raises a question that has come up in another review recently - D41233. If we reverse the canonicalization of shl+and, we would solve the most basic case that I showed above: define i32 @shl_first(i32 %a) { %t2 = shl i32 %a, 8 %t3 = and i32 %t2, 44032 ret i32 %t3 } define i32 @mask_first(i32 %a) { %a2 = and i32 %a, 172 %a3 = shl i32 %a2, 8 ret i32 %a3 } Comment Actions Added checks for shift opcodes as to early exit if not found. Comment Actions
Agree, I also intend to implement this transformation in the IR. But there are cases that this is only seen after some instructions have been combined in the dag, so why not here also? And indeed, it is a requirement for a future patch that detects opportunities to reduce load and store widths.
Comment Actions IMO, this is backwards, we optimize first in IR (because the sooner we can fold something like this, the more it helps later transforms). Then, only if there's reason to create redundancy (because the patterns emerge late) do we repeat a fold in the backend. Can you post a motivating example that would not be solved by IR transforms, so we can see why this is necessary for the DAG?
Comment Actions This "canonicalization" won't help to prevent even basic duplicated masked values when using a lshr: %0 = sext i16 %a to i32 %1 = lshr i32 %0, 8 %2 = and i32 %1, 172 %3 = and i32 %0, 44032 And a simplest case, that it is already in the test case, that won't be handled in the IR level: %m2 = and i32 %a, 3855 %shl = shl i32 %a, 24 %shr = lshr i32 %a, 8 %or = or i32 %shl, %shr %m1 = and i32 %or, 251658255 %or2 = or i32 %m1, %m2 ret i32 %or2 } The shl shr instructions become a ror that masks the same masked value.
Comment Actions Only accepts instructions with 2 uses (AND / SHIFT operations). So that looping through the uses is not expensive, and we avoid it in most cases.
|
Why not check that you have a supported shift here and potentially exit early?