This is an archive of the discontinued LLVM Phabricator instance.

[X86] Combine MUL+SRL+TRUNC to MULX for i32 on 64-bit
Needs ReviewPublic

Authored by pengfei on Jun 23 2023, 4:10 AM.

Details

Summary

D153576 brought _mulx_u32 to 64-bit targets. But the lowering of it
doesn't satisfy does not read or write arithmetic flags described in
intrinsic guide: https://godbolt.org/z/xb1fjf1sM

This patch completes the lowering part through combining
(i32 (trunc (shr (mul (zext (i32 A)), (zext (i32 B))), 32)))
to (umul_lohi (i32 A), (i32 B))

It is also a general optimization.

Diff Detail

Unit TestsFailed

Event Timeline

pengfei created this revision.Jun 23 2023, 4:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2023, 4:10 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
pengfei requested review of this revision.Jun 23 2023, 4:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2023, 4:10 AM

On https://github.com/llvm/llvm-project/issues/33580 there was concern whether MULX32 on 64-bit targets is slower than the alternatives?

On https://github.com/llvm/llvm-project/issues/33580 there was concern whether MULX32 on 64-bit targets is slower than the alternatives?

From the current code base, I think we slightly prefer D80498 to D55565?

If not updating arithmetic flags is a requirement, the we should have an IR intrinsic. Relying on pattern matching that can be easily broken is not a good solution.

Most of these test are for fixed point intrinsics. Maybe we should change the lowering of those to what we want instead of using a DAGCombine to get there?

"does not read or write arithmetic flags" described in intrinsic guide

How much do we actually care about honoring this? A program written in C can't tell if an intrinsic overwrites arithmetic flags; I'd feel free to just pick the fastest lowering, regardless of what the guide says.

If not updating arithmetic flags is a requirement, the we should have an IR intrinsic. Relying on pattern matching that can be easily broken is not a good solution.

You are right. I just found it doesn't work under O0 due to we use fast ISel. _mulx_u64 works because it falls back to DAGISel.
But I don't see much necessity to add a intrinsic. So maybe just need to roll back it D153681

Most of these test are for fixed point intrinsics. Maybe we should change the lowering of those to what we want instead of using a DAGCombine to get there?

They are not affected by DAGCombine but changes in Select. Is this what you expected here?

How much do we actually care about honoring this? A program written in C can't tell if an intrinsic overwrites arithmetic flags; I'd feel free to just pick the fastest lowering, regardless of what the guide says.

I don't know. But that might explain why we implemented these intrinsics in this way. I have no preference between this and D153681, either way is good to me.

Most of these test are for fixed point intrinsics. Maybe we should change the lowering of those to what we want instead of using a DAGCombine to get there?

They are not affected by DAGCombine but changes in Select. Is this what you expected here?

Why would it be the changes to Select? If the i8 or i16 umul_logo existed before this change, wouldn’t we have failed selection?

craig.topper added inline comments.Jun 24 2023, 12:10 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5284

Shouldn't this be SMUL_LOHI?

Most of these test are for fixed point intrinsics. Maybe we should change the lowering of those to what we want instead of using a DAGCombine to get there?

They are not affected by DAGCombine but changes in Select. Is this what you expected here?

Why would it be the changes to Select? If the i8 or i16 umul_logo existed before this change, wouldn’t we have failed selection?

Not sure I understand your question here. I was thinking you mean to DAGCombine in X86ISelLowering, but now I guess you mean in DAGCombiner.
The code was there for a long time. We removed i8 and i16 several years ago due to that code.
So I think the answer to the question is we should give targets a chance to lower it themselves, hence calling TLI.isOperationLegal(ISD::UMUL_LOHI, VT) before doing that.
TBH, I didn't think that far when I changed the code, I just wanted to break the infinite combine loop between mul and mul_lohi.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5284

Yes. Thanks!

pengfei updated this revision to Diff 534200.Jun 24 2023, 5:34 AM

Fix typo.