This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Compute known bits of CopyFromReg
ClosedPublic

Authored by piotr on Mar 19 2019, 1:15 AM.

Details

Summary

Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.

This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.

Also add a missing truncation on X86, found by testing of this patch.

Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa

Diff Detail

Repository
rL LLVM

Event Timeline

piotr created this revision.Mar 19 2019, 1:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2019, 1:15 AM
lebedev.ri added inline comments.
lib/Target/X86/X86ISelLowering.cpp
19513 ↗(On Diff #191258)

Does it trigger any tests without the rest of the diff?
Does this *have* to be in the same diff as the other part of the diff?

piotr marked an inline comment as done.Mar 19 2019, 2:29 AM
piotr added inline comments.
lib/Target/X86/X86ISelLowering.cpp
19513 ↗(On Diff #191258)

The change in X86ISelLowering.cpp, when applied on its own, does not trigger on any of the existing tests. However, when applying the rest of the diff, without the change in X86ISelLowering.cpp, there are two assertions thrown in CodeGen/X86/early-ifcvt.ll and CodeGen/X86/switch.ll.

Therefore I think the rest of the diff is necessary to trigger the new condition that is handled here.

RKSimon added inline comments.Mar 19 2019, 5:28 AM
lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3181 ↗(On Diff #191258)

auto R = cast<RegisterSDNode>(Op.getOperand(1));

lib/Target/X86/X86ISelLowering.cpp
19513 ↗(On Diff #191258)

DAG.getAnyExtOrTrunc ?

test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
146 ↗(On Diff #191258)

Please commit this with trunk's current codegen and then rebase the patch to show the diff.

piotr marked an inline comment as done.Mar 19 2019, 5:40 AM
piotr added inline comments.
test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
146 ↗(On Diff #191258)

Just to make sure I understand what is being asked: you would like to see the raw output of the new tests with 1) the current trunk and 2) version with the patch?

RKSimon added inline comments.Mar 19 2019, 6:06 AM
test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
146 ↗(On Diff #191258)

Yes, so you commit the new tests with checks for current codegen by trunk. Then this patch just shows the codegen delta

piotr updated this revision to Diff 191854.Mar 22 2019, 4:07 AM

Addressed review comments, extracted new tests and rebased.

piotr marked an inline comment as done.Mar 22 2019, 4:15 AM
piotr added inline comments.
test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
146 ↗(On Diff #191258)

https://reviews.llvm.org/D59690 created. I will rebase this patch when D59690 gets pushed.

piotr updated this revision to Diff 192574.Mar 28 2019, 12:48 AM

Rebased.

RKSimon accepted this revision.Apr 4 2019, 4:18 AM

LGTM - are you intending to do the same for ComputeNumSignBits ?

This revision is now accepted and ready to land.Apr 4 2019, 4:18 AM

LGTM - are you intending to do the same for ComputeNumSignBits ?

Don't sign bits for live ins usually become an AssertSExt?

This revision was automatically updated to reflect the committed changes.
piotr added a comment.Apr 5 2019, 12:44 AM

@RKSimon No, I do not intend to modify ComputeNumSignBits at this time.

dmgreen added a subscriber: dmgreen.Apr 8 2019, 2:23 AM

Hello

We are seeing a lot of codesize increases and performance changes from this patch, unfortunately. Consider this code, which I believe is fairly common in an embedded context:

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv6m-arm-none-eabi"

; Function Attrs: minsize nounwind optsize
define dso_local i32 @C(i32 %x, i32* nocapture %y) local_unnamed_addr #0 {
entry:
  br label %for.cond

for.cond:                                         ; preds = %B.exit, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %B.exit ]
  %exitcond = icmp eq i32 %i.0, 128
  br i1 %exitcond, label %for.end, label %for.body

for.body:                                         ; preds = %for.cond
  %mul = shl i32 %i.0, 2
  %add = add i32 %mul, %x
  store volatile i32 0, i32* inttoptr (i32 1342226444 to i32*), align 4
  store volatile i32 %add, i32* inttoptr (i32 1342226436 to i32*), align 4
  store volatile i32 1, i32* inttoptr (i32 1342226448 to i32*), align 16
  tail call void @llvm.arm.isb(i32 15) #1
  br label %while.cond.i

while.cond.i:                                     ; preds = %while.cond.i, %for.body
  %0 = load volatile i32, i32* inttoptr (i32 1342226448 to i32*), align 16
  %tobool.i = icmp eq i32 %0, 0
  br i1 %tobool.i, label %B.exit, label %while.cond.i

B.exit:                                           ; preds = %while.cond.i
  %1 = load volatile i32, i32* inttoptr (i32 1342226440 to i32*), align 8
  %arrayidx = getelementptr inbounds i32, i32* %y, i32 %i.0
  store i32 %1, i32* %arrayidx, align 4
  %inc = add nuw nsw i32 %i.0, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret i32 0
}

; Function Attrs: nounwind
declare void @llvm.arm.isb(i32) #1

attributes #0 = { minsize nounwind optsize "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-m0plus" "target-features"="+armv6-m,+strict-align,+thumb-mode,-crc,-dotprod,-dsp,-fp16fml,-hwdiv,-hwdiv-arm,-ras" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind }

Before this patch, constant hoist would take those constants addresses, which refer to memory mapped registers/IO, and turns them into a base constant plus a number of inttoptr casts. That way there is only one base, plus a number of offsets that are turned into STR rn, [base, offset].

With this new patch, it just sees straight through the CopyFromReg though! (Technically, it turns an ADD to an OR, to a constant). So we end up with multiple bases, each have to be materialised from a constant pool.

There are other things that constant hoist and CGP to in a similar manor, to create base+offset for architectures like Thumb where the offset range is fairly small and materialising constants is not always easy.

So although I like the idea of this patch, it seems to be breaking some optimisations that are relying on this not happening. Why we don't have tests that show any of that, I'm not sure..

piotr added a comment.Apr 10 2019, 2:32 AM

It's indeed unfortunate that you saw a regression with my patch. The patch makes SelectionDAG see more context, so it provides more optimization opportunities. Arguably, the fact that you see worse code now is a limitation of the optimization you mentioned that has become visible only now. Have you checked whether the problematic opt could be restricted (possibly only on targets not benefiting from the broader scope of the SelectionDAG)?

The patch makes SelectionDAG see more context, so it provides more optimization opportunities.

I think it may be more accurate to say that this breaks a whole class of existing optimisations. They were designed under the expectation that selection dag wouldn't do this. You can argue whether that was a sensible design or not, but it's the state we are in right now.

Do you have examples of this performing useful optimisations? Most of the tests here don't show any changes from what I would call sensible code. I would expect the IR transforms to have handled most things like this already (or being doing the split into different BB's deliberately).

I think the most sensible approach is to revert this for now (and hopefully add some more tests!). It seems that this is new info, that wasn't taken into account in the creation/review of this patch, so unless you can fix this quickly we need to prevent breaking the old code. I'm a little chagrined to do that, because there are some improvements in the results I've seen, there are just outweighed by all the regressions. I'm happy to try and help out with extra testing.

piotr added a comment.Apr 10 2019, 5:15 AM

Yes, we see improvements on the AMDGPU target.

I think it may be more accurate to say that this breaks a whole class of existing optimisations.

We need a test for each of these cases. Otherwise we risk having the same problem all over again.

@dmgreen Please revert the patch and add new tests in a separate review. I would prefer if you added the new tests first, but it is not essential.

Thanks. I've reverted in r358113 and will attempt to add some extra tests in.

We need a test for each of these cases. Otherwise we risk having the same problem all over again.

Definitely agree.

I guess this is something that will be fixed in global isel. In the meantime, perhaps we could do with something that is more specific about not being visible though, more than just a bitcast. Happy to help if I can, just let me know.

OK, thanks!

Thanks. I've reverted in r358113 and will attempt to add some extra tests in.

Yes, please add tests (ideally for x86 too.).

We need a test for each of these cases. Otherwise we risk having the same problem all over again.

Definitely agree.

I guess this is something that will be fixed in global isel. In the meantime, perhaps we could do with something that is more specific about not being visible though, more than just a bitcast. Happy to help if I can, just let me know.

D60294 caught my attention, that may be a related fix.

Carrot added a subscriber: Carrot.Apr 17 2019, 11:08 AM

Just FYI.

With this patch, llvm generates wrong instructions for one of our internal applications. And it causes debug version compiler crash in X86DAGToDAGISel::getAddressOperands, when it computes load address of jump table, it got an AM with non-zero disp, it is unexpected by getAddressOperands. I'm not sure if it is the same problem as described in https://reviews.llvm.org/rL358113, or a separate issue.

Unfortunately I don't have a small test case to reproduce it, the original code is large and difficult to be simplified.

piotr added a comment.Apr 30 2019, 5:56 AM

@Carrot Thanks for letting me know.

@dmgreen Do you have an ETA on the new tests for the problematic cases you had noticed?

Hello. I added a constant address case in rL358114, and a couple of other tests for minsize in rL358128. I was trying to look into X86 tests, but the addressing modes being so extensive makes it difficult. There will possibly be some RISCV testcases out of D60294 too.

One of the problems is that I think, because this was working on demand bits, it was actually converting something from aaaaaaxx + xxxxxxbb -> aaaaaaxx | xxxxxxbb and from there computing aaaaaabb directly (if you understand my meaning, x's are unknowns). If there wasn't the right seperation of bits, this may not have come up. Hence the existing test cases not catching things, but it coming up quite a few times in practice.

Hopefully the new tests are at least enough to show some problem cases. Let me know if not, or if there's any other way I can help.