This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
2
ScheduleDAGInstrs.cpp
-
test/CodeGen/
-
CodeGen/
-
AMDGPU/
-
call-argument-types.ll
-
call-preserved-registers.ll
-
callee-special-input-sgprs.ll
1/3
indirect-addressing-si.ll
-
inline-asm.ll
-
insert_vector_elt.ll
-
misched-killflags.mir
-
nested-calls.ll
-
undefined-subreg-liverange.ll
-
ARM/
-
Windows/
-
chkstk-movw-movt-isel.ll
-
chkstk.ll
-
memset.ll
-
arm-and-tst-peephole.ll
-
arm-shrink-wrapping.ll
-
cortex-a57-misched-ldm-wrback.ll
-
cortex-a57-misched-ldm.ll
-
cortex-a57-misched-vldm-wrback.ll
-
cortex-a57-misched-vldm.ll
-
fp16-instructions.ll
-
select.ll
-
twoaddrinstr.ll
-
vcombine.ll
-
vuzp.ll
-
SystemZ/
1
misched-readadvances.mir
-
Thumb2/
-
umulo-128-legalisation-lowering.ll
-
umulo-64-legalisation-lowering.ll
-
X86/
-
lsr-loop-exit-cond.ll
1
phys-reg-local-regalloc.ll
-
schedule-x86-64-shld.ll
-
schedule-x86_32.ll

Differential D49671

[SchedModel] Propagate read advance cycles to implicit operands outside instruction descriptor
ClosedPublic

Authored by jonpa on Jul 23 2018, 7:58 AM.

Download Raw Diff

Details

Reviewers

uweigand
MatzeB
javed.absar
hfinkel
fhahn
atrick
tstellar
rampitec
RKSimon
arsenm
kparzysz
jonpa

Summary

The SchedModel allows the addition of ReadAdvances to express that certain operands of the instructions is needed at a later point than the others. On SystemZ, this amounts to the register operand of a reg/mem instruction, given that the memory operand must first be loaded.

I discovered that in ~ 5% of the cases of expected latency adjustment, this was in effect not achieved. This problem involves the extra use operands added by regalloc for the full register, in case of a subregister usage, like:

After Coalescer:

1920B     %70.subreg_l32:addr64bit = MSR %70.subreg_l32:addr64bit, %70.subreg_l32:addr64bit
1952B     %70.subreg_l32:addr64bit = MSY %70.subreg_l32:addr64bit, %92:addr64bit, -12, $noreg :: (load 4 from %ir.scevgep18)

After RA:

2136B     renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit killed $r4d, implicit-def $r4d
2144B     renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit killed $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18)

Post-RA machine scheduler DAG:

SU(20):   renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit $r4d, implicit-def $r4d
# preds left       : 3
# succs left       : 3
# rdefs left       : 0
Latency            : 6
Depth              : 2
Height             : 31
Predecessors:
SU(4): Out  Latency=0
SU(4): Data Latency=1 Reg=$r4l
SU(4): Data Latency=1 Reg=$r4d
Successors:
SU(21): Out  Latency=0
SU(21): Data Latency=2 Reg=$r4l
SU(21): Data Latency=6 Reg=$r4d
SU(21):   renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18)
# preds left       : 3
# succs left       : 3
# rdefs left       : 0
Latency            : 10
Depth              : 8
Height             : 25
Predecessors:
SU(20): Out  Latency=0
SU(20): Data Latency=2 Reg=$r4l
SU(20): Data Latency=6 Reg=$r4d

SU(20) has instruction latency 6, and MSY has a ReadAdvance on the first use operand of 4 cycles ($r4l). However, the $r4d operand is not covered by this, so the effective latency between the nodes is still 6!

It seems to me that this is a target independent problem. I am not really sure how to best handle this situation, but it seems that the patch I made here solves the problem on SystemZ.

I thought about a simpler version like "If a register use is not part of the instruction descriptor, set latency to 0, in case a subreg has a read advance". I did not dare to do this however, since I found some rare cases (not involving ReadAdvance:s), where it was actually the super reg that had a latency value, while its subreg had a latency of 0. I am guessing this is another situation involving super/sub regs not quite the same as the more common one seen above.

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome). Given this, I suspect there may be some simpler way of achieving this result?

Diff Detail

Event Timeline

jonpa created this revision.Jul 23 2018, 7:58 AM

Herald added a subscriber: JDevlieghere. · View Herald TranscriptJul 23 2018, 7:58 AM

I'm not quite sure what gave you the idea I'd be qualified to review this. My best guess is that some minor bugfixes I made to PowerPC CodeGen years ago might still show up as relatively recent. I'll leave this to minds more capable in this area.

For context: $r4d is a super register formed from $r4l+$r4h?

This is tricky. Some comments:

Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)
What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0?

[1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

In D49671#1172545, @MatzeB wrote:

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

Ignore this comment, looking at the code we obviously do not have implicit defs/uses before regalloc and only add them in VirtRegRewriter. Right now I'm struggling to come up with the reasoning for their existence... Might be related to the block-live-in lists not being computed at subreg granularity...

dmgreen added a subscriber: dmgreen.Jul 24 2018, 1:36 AM

In D49671#1172423, @Florob wrote:

I'm not quite sure what gave you the idea I'd be qualified to review this. My best guess is that some minor bugfixes I made to PowerPC CodeGen years ago might still show up as relatively recent. I'll leave this to minds more capable in this area.

Sorry! I meant to add Florian Hahn...

In D49671#1172499, @MatzeB wrote:

For context: $r4d is a super register formed from $r4l+$r4h?

correct

jonpa edited reviewers, added: fhahn; removed: Florob.Jul 24 2018, 1:40 AM

In D49671#1172572, @MatzeB wrote:

In D49671#1172545, @MatzeB wrote:

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

Ignore this comment, looking at the code we obviously do not have implicit defs/uses before regalloc and only add them in VirtRegRewriter. Right now I'm struggling to come up with the reasoning for their existence... Might be related to the block-live-in lists not being computed at subreg granularity...

Personally, I think there should be a REALLY good reason to keep them around, given that it makes things like this more involved after regalloc. It would be much nicer to just have the tablegen operands around in many cases... I thought it might have something to do with early-clobber of the other subreg or such things, although I don't have any experience with it.

In D49671#1172527, @MatzeB wrote:

This is tricky. Some comments:

Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)

Yes, IIRC I have tried that, but got crashes immediately which was discouraging. So for the moment, that is not something that could be the default for SystemZ, I think.

What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0?

[1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

Yes, that was also my idea but as I wrote earlier in some rare cases I noticed instructions where the actual latency was only put on that extra regalloc operand, while the explicit use op had just a unit latency!

I looked into this now a bit more, and it seems that in these cases a multiply or other instruction requires a double word register (128 bit), so a 64 bit register is coalesced into it:

Before Coalescing:

16B       %0:gr64bit = LGFRL @seedi ::
128B      undef %5.subreg_l64:gr128bit = COPY %0:gr64bit
144B      %6:gr128bit = COPY %5:gr128bit
160B      %6:gr128bit = MLGR %6:gr128bit, %3:gr64bit

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)
...
144B      %6:gr128bit = COPY %5:gr128bit

After RA:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
renamable $r4q = COPY renamable $r0q

After Post-RA pseudo instruction expansion pass:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
$r4d = LGR $r0d, implicit $r0q
$r5d = LGR $r1d, implicit $r0q

DAG has the latency on $r0q (superreg), instead of $r0d between SU(0) and SU(3). ($r0q = $r0d + $r1d):

SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
# preds left       : 0
# succs left       : 9
# rdefs left       : 0
Latency            : 5
Depth              : 0
Height             : 70
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d
...
SU(3):   $r4d = LGR $r0d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r0q
SU(0): Data Latency=1 Reg=$r0d
...
SU(4):   $r5d = LGR $r1d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r1d
SU(0): Data Latency=5 Reg=$r0q
...

Seems like these are (rare) cases then where the defining instruction has an explicit def-op of a subregister, and a RegAlloc-implicit-def of the full register. The using instruction has an explicit use of the *other*
subreg, and an implicit use of the full register. The latency value is set only on the super-register (RegAlloc operand).

During DAG construction in computeOperandLatency(), when handling SU(0), I saw that

OperIdx 0: $r1d -> $r0q : Wlat0, Lat:5
OperIdx 2: $r0q -> $r0d : DefIdx = 1, but there is only one WriteLatencyEntry with the correct Value of 5! So the latency here becomes '1' instead. This is another example of how difficult these extra RA operands are to deal with (and it would be really ugly to have extra SchedWrites in the tablegen file just "in case regalloc decides to add one or a few more").

So, in short: we can't just set the latency on those regalloc operands to 1 whenever we want, because in these cases that would break the SchedModel. That said, this is extremely rare (35 cases out of 1.3 million) currently on SystemZ on SPEC, arising just in this scenario with a coalesced 128 bit register required by a particular instruction. So at least currently, that wouldn't probably matter if ignored... Still, maybe it would on other targets... Of course, this would currently be much better to do on SystemZ instead of the currently missing ReadAdvances...

I guess I wish that since the SchedModel has the quite intricate mapping of SchedWrites to operands (by means of ordering), it would hopefully end there, and not get disrupted with these extra operands... Defining a SchedModel to match the instruction definition operands is hard enough, and it doesn't work well to have to deal with extra implicit regalloc operands as well...

If we have to live with them on the MIs, perhaps we could make some decision to to not give them any latency values on the edges somehow, but to keep the latency values as defined by the tablegen files for the operands found there only?

That would probably be trading a "look-up" (this current patch), for another one, where an implicit operand not part of the MCInstrDesc would have to check for a subreg on the MI and get the latency from it... Not sure if that's even a good idea...

Some tests were crashing. To get all tests are passing I had to:

Avoid using MI->getMF(), because e.g. machinecombiner will call with an MI that is not contained in any MBB.

Somehow make sure that this is only done post-RA (since this aims to handle the regalloc operands). Due to the fact that UseMI may not always be part of the MF, MRI is not retrievable, so instead of doing MRI->isSSA() around all of this, I checked for each use operands physical/virtual domain. I wonder how to improve on this...

New test case for SystemZ that tests that the latency adjustment of the read advance is also applied on the register allocator operand. The test case is a bit longer than expected after bugpoint reduction, but I think that's how it has to be to expose this effect of coalescing
into a superregister or something like that... May be able to find a smaller test case...

jonpa edited the summary of this revision. (Show Details)Jul 27 2018, 7:46 AM

Patch somewhat simplified with NFC on SystemZ/SPEC. SystemZ test case fixed to use '-o -', to not write the .s file to the repo.

Is this approach acceptable, or are the extra RegAlloc super-reg operands somehow under review / redesign?

In D49671#1173217, @jonpa wrote:
In D49671#1172527, @MatzeB wrote:

This is tricky. Some comments:

Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)

Yes, IIRC I have tried that, but got crashes immediately which was discouraging. So for the moment, that is not something that could be the default for SystemZ, I think.

What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0?

[1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

Yes, that was also my idea but as I wrote earlier in some rare cases I noticed instructions where the actual latency was only put on that extra regalloc operand, while the explicit use op had just a unit latency!

I looked into this now a bit more, and it seems that in these cases a multiply or other instruction requires a double word register (128 bit), so a 64 bit register is coalesced into it:
Before Coalescing:

16B       %0:gr64bit = LGFRL @seedi ::
128B      undef %5.subreg_l64:gr128bit = COPY %0:gr64bit
144B      %6:gr128bit = COPY %5:gr128bit
160B      %6:gr128bit = MLGR %6:gr128bit, %3:gr64bit

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)
...
144B      %6:gr128bit = COPY %5:gr128bit

After RA:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
renamable $r4q = COPY renamable $r0q

After Post-RA pseudo instruction expansion pass:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
$r4d = LGR $r0d, implicit $r0q
$r5d = LGR $r1d, implicit $r0q
DAG has the latency on $r0q (superreg), instead of $r0d between SU(0) and SU(3). ($r0q = $r0d + $r1d):
SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
# preds left       : 0
# succs left       : 9
# rdefs left       : 0
Latency            : 5
Depth              : 0
Height             : 70
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d
...
SU(3):   $r4d = LGR $r0d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r0q
SU(0): Data Latency=1 Reg=$r0d
...
SU(4):   $r5d = LGR $r1d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r1d
SU(0): Data Latency=5 Reg=$r0q
...
Seems like these are (rare) cases then where the defining instruction has an explicit def-op of a subregister, and a RegAlloc-implicit-def of the full register. The using instruction has an explicit use of the *other*
subreg, and an implicit use of the full register. The latency value is set only on the super-register (RegAlloc operand).

The "implicit-def of the full register" is added when materializing the result of the regalloc in VirtRegMap.cpp, it is not present or necessary during regalloc itself.
In your debug output I see:

SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d

Given that the actual instruction only writes to r1d I would argue that the latencies on r0q and r0d are "fake". Hence my proposal to ignore the extra operands during schedule dag construction or force their latency to zero. Or do you actually want a latency between SU(0) and SU(3) here?

During DAG construction in computeOperandLatency(), when handling SU(0), I saw that

OperIdx 0: $r1d -> $r0q : Wlat0, Lat:5
OperIdx 2: $r0q -> $r0d : DefIdx = 1, but there is only one WriteLatencyEntry with the correct Value of 5! So the latency here becomes '1' instead. This is another example of how difficult these extra RA operands are to deal with (and it would be really ugly to have extra SchedWrites in the tablegen file just "in case regalloc decides to add one or a few more").

So, in short: we can't just set the latency on those regalloc operands to 1 whenever we want, because in these cases that would break the SchedModel. That said, this is extremely rare (35 cases out of 1.3 million) currently on SystemZ on SPEC, arising just in this scenario with a coalesced 128 bit register required by a particular instruction. So at least currently, that wouldn't probably matter if ignored... Still, maybe it would on other targets... Of course, this would currently be much better to do on SystemZ instead of the currently missing ReadAdvances...

I guess I wish that since the SchedModel has the quite intricate mapping of SchedWrites to operands (by means of ordering), it would hopefully end there, and not get disrupted with these extra operands... Defining a SchedModel to match the instruction definition operands is hard enough, and it doesn't work well to have to deal with extra implicit regalloc operands as well...

If we have to live with them on the MIs, perhaps we could make some decision to to not give them any latency values on the edges somehow, but to keep the latency values as defined by the tablegen files for the operands found there only?

That would probably be trading a "look-up" (this current patch), for another one, where an implicit operand not part of the MCInstrDesc would have to check for a subreg on the MI and get the latency from it... Not sure if that's even a good idea...

Thanks for review!

Given that the actual instruction only writes to r1d I would argue that the latencies on r0q and r0d are "fake". Hence my proposal to ignore the extra operands during schedule dag construction or force their latency to zero.

I tried ignoring the operands during DAG construction, but that caused a lot of machine verifier errors, since those extra operands themselves have def-use chains that of course get corrupted if ignored by the scheduler.

I updated the patch to instead follow your second suggestion - setting the latency to zero in these cases. This gives ~25 test failures across targets, which I hope will be fairly simple to update if you think this patch is acceptable. This approach seems to also fix the issue I was seeing with the read advances.

Since this is a post-RA problem, I still wonder if it would perhaps be possible to instead remove these operands at some point before the post-RA scheduler (Still don't know what they are really for)? This would be even more simple, I think.

ping!

ping

andreadb added a subscriber: andreadb.Sep 25 2018, 4:18 AM

ping!

Sorry for slow response. LGTM, some nitpicks below but feel free to fix testcases and nitpicks at your own discretion.

lib/CodeGen/ScheduleDAGInstrs.cpp
240–241	How about reversing the condition and calling this: bool ImplicitPseudoDef = (OperIdx >= DefMIDesc->getNumOperands() && !DefMIDesc->hasImplicitDefOfPhysReg(MO.getReg()));
267–269	similar here: bool ImplicitPseudoUse = UseMIDesc && UseOp >= UseMIDesc->getNumOperands() && !UseMIDesc->hasImplicitUseOfPhysReg(*Alias);
test/CodeGen/SystemZ/misched-readadvances.mir
2	If you have the time look at: https://llvm.org/docs/MIRLangRef.html#simplifying-mir-files This smells like you can do things like dropping the IR part, not listing the successor blocks (at least for the blocks that don't use the jumptable)...

This revision is now accepted and ready to land.Oct 5 2018, 3:00 PM

Updated patch per review. Submitting again since I had to add a (int) cast to silence compiler warning:

bool ImplicitPseudoUse =
    (UseMIDesc && UseOp >= (**(int)**UseMIDesc->getNumOperands()) &&
     !UseMIDesc->hasImplicitUseOfPhysReg(*Alias));

I would think this should be ok, right? UseOp may be -1 for ExitSU, but that doesn't matter.

Reduced the test case further.

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Since this is a general improvement for any target that cares about operand read advances, I would like to ask if someone from each target please could apply the patch and do the test updating? Just mail me a patch and I'll apply it and put it up here.

I am not sure about the general agreement on test updating, but I think personally this makes collaborative sense, or?

LLVM :: CodeGen/AMDGPU/call-argument-types.ll
LLVM :: CodeGen/AMDGPU/call-preserved-registers.ll
LLVM :: CodeGen/AMDGPU/callee-special-input-sgprs.ll
LLVM :: CodeGen/AMDGPU/indirect-addressing-si.ll
LLVM :: CodeGen/AMDGPU/inline-asm.ll
LLVM :: CodeGen/AMDGPU/insert_vector_elt.ll
LLVM :: CodeGen/AMDGPU/misched-killflags.mir
LLVM :: CodeGen/AMDGPU/nested-calls.ll
LLVM :: CodeGen/AMDGPU/undefined-subreg-liverange.ll
LLVM :: CodeGen/ARM/Windows/chkstk-movw-movt-isel.ll
LLVM :: CodeGen/ARM/Windows/chkstk.ll
LLVM :: CodeGen/ARM/Windows/memset.ll
LLVM :: CodeGen/ARM/arm-and-tst-peephole.ll
LLVM :: CodeGen/ARM/arm-shrink-wrapping.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-ldm.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-vldm-wrback.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-vldm.ll
LLVM :: CodeGen/ARM/fp16-instructions.ll
LLVM :: CodeGen/ARM/select.ll
LLVM :: CodeGen/ARM/twoaddrinstr.ll
LLVM :: CodeGen/ARM/vcombine.ll
LLVM :: CodeGen/ARM/vuzp.ll
LLVM :: CodeGen/Hexagon/ps_call_nr.ll
LLVM :: CodeGen/Thumb2/umulo-128-legalisation-lowering.ll
LLVM :: CodeGen/Thumb2/umulo-64-legalisation-lowering.ll
LLVM :: CodeGen/X86/lsr-loop-exit-cond.ll
LLVM :: CodeGen/X86/phys-reg-local-regalloc.ll
LLVM :: CodeGen/X86/schedule-x86-64-shld.ll
LLVM :: CodeGen/X86/schedule-x86_32.ll

x86 test changes lgtm

Thanks for the tests patch Simon! I also made one change to phys-reg-local-regalloc.ll that you did not update, which makes all the X86 tests green. I hope that one also looks OK to you?

Herald added a subscriber: qcolombet. · View Herald TranscriptOct 8 2018, 12:26 AM

RKSimon added inline comments.Oct 8 2018, 4:35 AM

test/CodeGen/X86/phys-reg-local-regalloc.ll
25	Sorry - missed that test - looks OK.

RKSimon added a reviewer: arsenm.Oct 10 2018, 3:59 AM

Herald added a subscriber: wdng. · View Herald TranscriptOct 10 2018, 3:59 AM

@arsenm could you please check the tests? They seems to be mostly yours, especially with respect to calls.

I have gone through the tests as best as I can since the progress was slow. I will commit this in a few days if no one objects. Please take a look and review the test changes!

One test was beyond me: Hexagon/ps_call_nr.ll, which fails with 'LLVM ERROR: invalid instruction packet: slot error'.

First difference is after the packetizer, where it seems that a call has now been bundled for some reason, which I am guessing is wrong. Not sure at all how to fix.

 *** IR Dump After Hexagon Packetize   # *** IR Dump After Hexagon Packetize
 # Machine code for function f0: NoPHI   # Machine code for function f0: NoPHI

bb.0.b0:                                bb.0.b0:
  successors: %bb.1(0x00000001); %bb.     successors: %bb.1(0x00000001); %bb.

  renamable $r0 = L2_loadri_io undef      renamable $r0 = L2_loadri_io undef
  BUNDLE implicit-def dead $p0, impli     BUNDLE implicit-def dead $p0, impli
  renamable $p0 = C2_bitsclri kille       renamable $p0 = C2_bitsclri kille
  PS_jmprettnew internal killed $p0       PS_jmprettnew internal killed $p0
  }                                       }

bb.1.b2:                                bb.1.b2:
; predecessors: %bb.0                   ; predecessors: %bb.0

  BUNDLE implicit-def $r29, implicit-     BUNDLE implicit-def $r29, implicit-
  $r29 = S2_allocframe killed $r29(       $r29 = S2_allocframe killed $r29(
  $r3 = A2_tfrsi 0                        $r3 = A2_tfrsi 0
  $r4 = A2_tfrsi 0                        $r4 = A2_tfrsi 0
  }                                       }
  BUNDLE implicit-def $r1, implicit-d     BUNDLE implicit-def $r1, implicit-d
  renamable $r1 = L2_loadri_io kill       renamable $r1 = L2_loadri_io kill
  $r0 = A2_tfrsi @g0                      $r0 = A2_tfrsi @g0
  }                                       }
  BUNDLE implicit-def $r2, implicit-d |   BUNDLE implicit-def dead $r2, impli
  renamable $r2 = S2_extractu renam       renamable $r2 = S2_extractu renam
  renamable $r1 = S2_extractu renam       renamable $r1 = S2_extractu renam
                                      >     PS_call_nr @f1, <regmask $d8 $d9
  }                                       }
  PS_call_nr @f1, <regmask $d8 $d9 $d <

# End machine code for function f0.     # End machine code for function f0.

Herald added subscribers: eraman, nhaehnle, jvesely. · View Herald TranscriptOct 26 2018, 12:48 AM

X86 test changes (still) LGTM

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

Could you post the output from -debug-only=packets?

As it is now, this is very bad, so please do not commit this until we figure this out.

In D49671#1277445, @kparzysz wrote:

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

Could you post the output from -debug-only=packets?

As it is now, this is very bad, so please do not commit this until we figure this out.

Sure, here are three files: the output with patch / without patch (base) / side-to-side diff

debug_packets.diff13 KBDownload

debug_packets.base.txt9 KBDownload

debug_packets.txt8 KBDownload

In D49671#1257221, @jonpa wrote:

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Do you still need help with the AMDGPU tests?

In D49671#1277479, @tstellar wrote:

In D49671#1257221, @jonpa wrote:

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Do you still need help with the AMDGPU tests?

Yes, please look through the changes I made. Thanks!

AMDGPU tests look good, just the one comment for indirect-addressing-si.ll.

test/CodeGen/AMDGPU/indirect-addressing-si.ll
390	This [3] looks like a typo.

jonpa added inline comments.Oct 26 2018, 9:01 AM

test/CodeGen/AMDGPU/indirect-addressing-si.ll
390	:-) I know that looks weird and suspected you might not like it. The problem was that the VEC_ELT1 register did not match properly further down. IIRC, there were different matches for different subtargets, so I had to force one of the matches into v3 (I suppose I should have removed the '+'). Please help me out and check if there is a better way, or if this is acceptable.

tstellar added inline comments.Oct 26 2018, 9:04 AM

test/CodeGen/AMDGPU/indirect-addressing-si.ll
390	Ok, that's fine. I would remove the + and also the brackets.

In D49671#1277445, @kparzysz wrote:

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

As it is now, this is very bad, so please do not commit this until we figure this out.

I committed a patch that fixes this problem, so ps_call_nr.ll should pass now.

In D49671#1278005, @bcahoon wrote:

In D49671#1277445, @kparzysz wrote:

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

As it is now, this is very bad, so please do not commit this until we figure this out.

I committed a patch that fixes this problem, so ps_call_nr.ll should pass now.

Thanks, I have confirmed that the test case passes now with the patch.

Now that the X86, AMDGPU and Hexagon test failures are handled, only the ARM and Thumb2 test updates are left to be reviewed, please.

Thanks for review. r345606.

Note: Two minor last-minute regenerations of X86 tests: CodeGen/X86/memset.ll and CodeGen/X86/schedule-x86-64-shld.ll

This revision is now accepted and ready to land.Oct 30 2018, 8:12 AM

jonpa closed this revision.Oct 30 2018, 8:12 AM

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

Apologies for being late to the party, but I am now looking into this too because we've seen some significant regressions with this change committed.
I am not blaming this commit, not yet, because I haven't fully understood the problem yet. As I am new to this area, I wanted to dump some initial thoughts here (because it takes me some time to get up to speed), perhaps people can comment.

First, we found the change in test CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll a bit suspicious. Latencies are changed from 1, 3, 3, and 4 to:

; CHECK-SAME:  Latency=1
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=3
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=0
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=0

The last 2 latencies are changed to 0. We are generating a LDM for this case: ldm r0!, {r1, r2, r3}, and I don't see yet why the latency of the last 2 operands are 0s.

This makes us wonder if variadic instructions and instructions with optional defs are ignored/missed in this patch?

uabelho added a subscriber: uabelho.Oct 31 2018, 7:18 AM

Ka-Ka added a subscriber: Ka-Ka.Oct 31 2018, 7:21 AM

This patch is causing some problems in my out-of-tree back-end
...
I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

This makes us wonder if variadic instructions and instructions with optional defs are ignored/missed in this patch?

Sorry to hear about the problems!

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

I think me and @MatzeB (correct me if I am wrong) may have overlooked this. We were discussing if those extra regalloc operands were needed anywhere, and how much easier life would be without them in cases like this. This was because it doesn't make sense to handle non-tablegen operands in the Schedmodel description. So we removed the latencies for the superregs since they were redundant, and forgot about *other* pseudo implicit operands that are *not* redundant.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

In D49671#1283621, @jonpa wrote:

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

What I meant was to have subtargets override the default latency calculation if they want to ignore operands that are not in the MCInstrDesc. That is, implement SytemZ::getOperandLatency() and have it return 0 in the cases where DefIdx or UseIdx is too large.

I don't have a patch for this and I really don't know which targets want the new behavior.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

I don't have a patch for this and I really don't know which targets wants the new behavior.

I think this behavior is wanted by all targets that define a SchedModel that includes ReadAdvances as well as using subregisters. This seems to include AArch64 , ARM, X86 and SystemZ, as far as I can tell.

Therefore I think this would be worth resolving in common code...

In D49671#1283704, @materi wrote:

In D49671#1283621, @jonpa wrote:

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

What I meant was to have subtargets override the default latency calculation if they want to ignore operands that are not in the MCInstrDesc. That is, implement SytemZ::getOperandLatency() and have it return 0 in the cases where DefIdx or UseIdx is too large.

I don't have a patch for this and I really don't know which targets want the new behavior.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

I don't think having a target specific hook is good enough here because some of the problematic operands are generated by generic register allocation code; you will get them as soon as you have subregisters in the mix.

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

In D49671#1282086, @materi wrote:

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

Would you be in a position to mark your extra operands as explicit operands since the machine does appear to be reading/writing them in your case? (Maybe you have to mark the instruction as variadic, or I would be open to invent a new MCInstrDesc flag if that helps...)

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

I don't like it either, but this fact is true independently of this patch...

In D49671#1284997, @MatzeB wrote:

In D49671#1283704, @materi wrote:

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

I don't think having a target specific hook is good enough here because some of the problematic operands are generated by generic register allocation code; you will get them as soon as you have subregisters in the mix.

Hmm, also if you use TracksSubRegLiveness?

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

I agree that this overloading is unfortunate! How hard would it be to split these operands in two different types?

I also wonder if the only purpose of implicit operands added in regalloc is for liveness modeling? What about things like AH/AL on x86 where writing a subregister spills over to the neighbor?

In D49671#1284998, @MatzeB wrote:

In D49671#1282086, @materi wrote:

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

Would you be in a position to mark your extra operands as explicit operands since the machine does appear to be reading/writing them in your case? (Maybe you have to mark the instruction as variadic, or I would be open to invent a new MCInstrDesc flag if that helps...)

Yes I think we could change them to explicit operands. I have not tried, but I can't think of any reason it would not work.

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

I don't like it either, but this fact is true independently of this patch...

As I understand it, once the operand is attached to the MI, LLVM treated it the same regardless where it came from. So the semantics of the operand was obvious from looking at MI->dump().

Maybe the register allocator should add the implicit-def as an IMPLICIT_DEF instruction just before the MI instead of attaching a bogus impl def on the MI (if the MI uses parts of the register I guess the IMPLICIT_DEF needs a corresponding implicit use).

So instead of

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

After RA:

renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)

you would get

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

After RA:

$r0q = IMPLICIT_DEF
renamable $r1d = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

This way the definitions added to LGFRL maps to the actual definition done by the instruction. The "fake" implicit defs is separated from the LGFRL instruction and the scheduler wouldn't see the false dependency toward LGFRL (instead it would have DAG edge towards the IMPLICIT_DEF but I guess that would get zero latency?). Or wouldn't that help both the problem that this ticket was aiming at solving, and the problems that @materi is describing?

So there would be a slight difference between implicit-def operands and IMPLICIT_DEF:

An implicit-def operand would be seen as a side effect performed by the instruction.
An IMPLICIT_DEF instruction would be used to model that a register is undefined (for liveness purposes).

For the record, I haven't put too much thought into the above. But afaict we can have IMPLICIT_DEF instructions after RA already today. Maybe the problem is that RA passes are adding implicit defs to real instructions, when they actually should add IMPLICIT_DEF instructions instead when modelling undef?

In D49671#1285281, @materi wrote:

In D49671#1284997, @MatzeB wrote:

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

I agree that this overloading is unfortunate! How hard would it be to split these operands in two different types?

I tried adding a bit in MachineOperand which is set by VirtRegMap when handling SuperDefs/SuperKills/SuperDeads. That makes it possible to filter out that kind of dependency in addPhysRegDeps. It seems to work fine and avoids having to look at MCInstrDesc.

Adding a bit in MachineOperand increases the size of the class; I don't know if that's a big deal.

In D49671#1294302, @bjope wrote:

For the record, I haven't put too much thought into the above. But afaict we can have IMPLICIT_DEF instructions after RA already today. Maybe the problem is that RA passes are adding implicit defs to real instructions, when they actually should add IMPLICIT_DEF instructions instead when modelling undef?

I think this is an interesting idea. But does this work without enabling TracksSubRegLiveness if LivenessAfterRegalloc is enabled? (This is so complicated! Maybe it's not an issue at all?)

Anyway, I think that the current code is broken and may miscompile code with SDNPVariadic instructions.

Revision Contents

Path

Size

lib/

CodeGen/

ScheduleDAGInstrs.cpp

20 lines

test/

CodeGen/

AMDGPU/

call-argument-types.ll

22 lines

call-preserved-registers.ll

36 lines

callee-special-input-sgprs.ll

3 lines

indirect-addressing-si.ll

6 lines

inline-asm.ll

4 lines

insert_vector_elt.ll

2 lines

misched-killflags.mir

12 lines

nested-calls.ll

4 lines

undefined-subreg-liverange.ll

12 lines

ARM/

Windows/

chkstk-movw-movt-isel.ll

6 lines

chkstk.ll

6 lines

memset.ll

4 lines

arm-and-tst-peephole.ll

2 lines

arm-shrink-wrapping.ll

28 lines

cortex-a57-misched-ldm-wrback.ll

4 lines

cortex-a57-misched-ldm.ll

2 lines

cortex-a57-misched-vldm-wrback.ll

4 lines

cortex-a57-misched-vldm.ll

4 lines

4 lines

2 lines

4 lines

8 lines

242 lines

SystemZ/

misched-readadvances.mir

31 lines

Thumb2/

umulo-128-legalisation-lowering.ll

4 lines

umulo-64-legalisation-lowering.ll

4 lines

X86/

lsr-loop-exit-cond.ll

8 lines

phys-reg-local-regalloc.ll

4 lines

schedule-x86-64-shld.ll

4 lines

schedule-x86_32.ll

10 lines

Diff 171255

lib/CodeGen/ScheduleDAGInstrs.cpp

	Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines
	/// data dependencies from SU to any uses of the physical register.			/// data dependencies from SU to any uses of the physical register.
	void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {			void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
	const MachineOperand &MO = SU->getInstr()->getOperand(OperIdx);			const MachineOperand &MO = SU->getInstr()->getOperand(OperIdx);
	assert(MO.isDef() && "expect physreg def");			assert(MO.isDef() && "expect physreg def");

	// Ask the target if address-backscheduling is desirable, and if so how much.			// Ask the target if address-backscheduling is desirable, and if so how much.
	const TargetSubtargetInfo &ST = MF.getSubtarget();			const TargetSubtargetInfo &ST = MF.getSubtarget();

				// Only use any non-zero latency for real defs/uses, in contrast to
				// "fake" operands added by regalloc.
				const MCInstrDesc *DefMIDesc = &SU->getInstr()->getDesc();
				bool ImplicitPseudoDef = (OperIdx >= DefMIDesc->getNumOperands() &&
				!DefMIDesc->hasImplicitDefOfPhysReg(MO.getReg()));
				MatzeBUnsubmitted Not Done Reply Inline Actions How about reversing the condition and calling this: bool ImplicitPseudoDef = (OperIdx >= DefMIDesc->getNumOperands() && !DefMIDesc->hasImplicitDefOfPhysReg(MO.getReg())); MatzeB: How about reversing the condition and calling this: ``` bool ImplicitPseudoDef = (OperIdx >=…
	for (MCRegAliasIterator Alias(MO.getReg(), TRI, true);			for (MCRegAliasIterator Alias(MO.getReg(), TRI, true);
	Alias.isValid(); ++Alias) {			Alias.isValid(); ++Alias) {
	if (!Uses.contains(*Alias))			if (!Uses.contains(*Alias))
	continue;			continue;
	for (Reg2SUnitsMap::iterator I = Uses.find(*Alias); I != Uses.end(); ++I) {			for (Reg2SUnitsMap::iterator I = Uses.find(*Alias); I != Uses.end(); ++I) {
	SUnit *UseSU = I->SU;			SUnit *UseSU = I->SU;
	if (UseSU == SU)			if (UseSU == SU)
	continue;			continue;

	// Adjust the dependence latency using operand def/use information,			// Adjust the dependence latency using operand def/use information,
	// then allow the target to perform its own adjustments.			// then allow the target to perform its own adjustments.
	int UseOp = I->OpIdx;			int UseOp = I->OpIdx;
	MachineInstr *RegUse = nullptr;			MachineInstr *RegUse = nullptr;
	SDep Dep;			SDep Dep;
	if (UseOp < 0)			if (UseOp < 0)
	Dep = SDep(SU, SDep::Artificial);			Dep = SDep(SU, SDep::Artificial);
	else {			else {
	// Set the hasPhysRegDefs only for physreg defs that have a use within			// Set the hasPhysRegDefs only for physreg defs that have a use within
	// the scheduling region.			// the scheduling region.
	SU->hasPhysRegDefs = true;			SU->hasPhysRegDefs = true;
	Dep = SDep(SU, SDep::Data, *Alias);			Dep = SDep(SU, SDep::Data, *Alias);
	RegUse = UseSU->getInstr();			RegUse = UseSU->getInstr();
	}			}
	Dep.setLatency(			const MCInstrDesc *UseMIDesc =
	SchedModel.computeOperandLatency(SU->getInstr(), OperIdx, RegUse,			(RegUse ? &UseSU->getInstr()->getDesc() : nullptr);
	UseOp));			bool ImplicitPseudoUse =
				(UseMIDesc && UseOp >= ((int)UseMIDesc->getNumOperands()) &&
				!UseMIDesc->hasImplicitUseOfPhysReg(*Alias));
				MatzeBUnsubmitted Not Done Reply Inline Actions similar here: bool ImplicitPseudoUse = UseMIDesc && UseOp >= UseMIDesc->getNumOperands() && !UseMIDesc->hasImplicitUseOfPhysReg(Alias); MatzeB:* similar here: ``` bool ImplicitPseudoUse = UseMIDesc && UseOp >= UseMIDesc->getNumOperands() &&…
				if (!ImplicitPseudoDef && !ImplicitPseudoUse) {
				Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx,
				RegUse, UseOp));
	ST.adjustSchedDependency(SU, UseSU, Dep);			ST.adjustSchedDependency(SU, UseSU, Dep);
				} else
				Dep.setLatency(0);

	UseSU->addPred(Dep);			UseSU->addPred(Dep);
	}			}
	}			}
	}			}

	/// Adds register dependencies (data, anti, and output) from this SUnit			/// Adds register dependencies (data, anti, and output) from this SUnit
	/// to following instructions in the same scheduling region that depend the			/// to following instructions in the same scheduling region that depend the
	/// physical register referenced at OperIdx.			/// physical register referenced at OperIdx.
	▲ Show 20 Lines • Show All 1,177 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/call-argument-types.ll

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines


; FIXME: Should be passing -1		; FIXME: Should be passing -1
; GCN-LABEL: {{^}}test_call_external_void_func_i1_imm:		; GCN-LABEL: {{^}}test_call_external_void_func_i1_imm:
; MESA: s_mov_b32 s36, SCRATCH_RSRC_DWORD		; MESA: s_mov_b32 s36, SCRATCH_RSRC_DWORD

; MESA-DAG: s_mov_b64 s[0:1], s[36:37]		; MESA-DAG: s_mov_b64 s[0:1], s[36:37]

		; GCN: v_mov_b32_e32 v0, 1{{$}}
		; MESA-DAG: s_mov_b64 s[2:3], s[38:39]
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i1@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i1@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i1@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i1@rel32@hi+4
; GCN-DAG: v_mov_b32_e32 v0, 1{{$}}
; MESA-DAG: s_mov_b64 s[2:3], s[38:39]

; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i1_imm() #0 {		define amdgpu_kernel void @test_call_external_void_func_i1_imm() #0 {
call void @external_void_func_i1(i1 true)		call void @external_void_func_i1(i1 true)
ret void		ret void
}		}

▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @test_call_external_void_func_i1_zeroext(i32) #0 {
%var = load volatile i1, i1 addrspace(1)* undef		%var = load volatile i1, i1 addrspace(1)* undef
call void @external_void_func_i1_zeroext(i1 %var)		call void @external_void_func_i1_zeroext(i1 %var)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_i8_imm:		; GCN-LABEL: {{^}}test_call_external_void_func_i8_imm:
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}

		; GCN: v_mov_b32_e32 v0, 0x7b
		; HSA-DAG: s_mov_b32 s4, s33{{$}}
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8@rel32@hi+4
; GCN-NEXT: v_mov_b32_e32 v0, 0x7b

; HSA-DAG: s_mov_b32 s4, s33{{$}}
; GCN-DAG: s_mov_b32 s32, s33{{$}}		; GCN-DAG: s_mov_b32 s32, s33{{$}}

; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i8_imm(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i8_imm(i32) #0 {
call void @external_void_func_i8(i8 123)		call void @external_void_func_i8(i8 123)
ret void		ret void
}		}

; FIXME: don't wait before call		; FIXME: don't wait before call
; GCN-LABEL: {{^}}test_call_external_void_func_i8_signext:		; GCN-LABEL: {{^}}test_call_external_void_func_i8_signext:
; HSA-DAG: s_mov_b32 s33, s9{{$}}		; HSA-DAG: s_mov_b32 s33, s9{{$}}
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}

; GCN-DAG: buffer_load_sbyte v0		; GCN-DAG: buffer_load_sbyte v0
		; GCN: s_mov_b32 s4, s33
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8_signext@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8_signext@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8_signext@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8_signext@rel32@hi+4

; GCN-DAG: s_mov_b32 s4, s33
; GCN-DAG: s_mov_b32 s32, s3		; GCN-DAG: s_mov_b32 s32, s3

; GCN: s_waitcnt vmcnt(0)		; GCN: s_waitcnt vmcnt(0)
; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i8_signext(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i8_signext(i32) #0 {
%var = load volatile i8, i8 addrspace(1)* undef		%var = load volatile i8, i8 addrspace(1)* undef
call void @external_void_func_i8_signext(i8 %var)		call void @external_void_func_i8_signext(i8 %var)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_i8_zeroext:		; GCN-LABEL: {{^}}test_call_external_void_func_i8_zeroext:
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}
; HSA-DAG: s_mov_b32 s33, s9{{$}}		; HSA-DAG: s_mov_b32 s33, s9{{$}}

; GCN-DAG: buffer_load_ubyte v0		; GCN-DAG: buffer_load_ubyte v0
		; GCN: s_mov_b32 s4, s33
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8_zeroext@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i8_zeroext@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8_zeroext@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i8_zeroext@rel32@hi+4

; GCN-DAG: s_mov_b32 s4, s33
; GCN-DAG: s_mov_b32 s32, s33		; GCN-DAG: s_mov_b32 s32, s33

; GCN: s_waitcnt vmcnt(0)		; GCN: s_waitcnt vmcnt(0)
; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i8_zeroext(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i8_zeroext(i32) #0 {
%var = load volatile i8, i8 addrspace(1)* undef		%var = load volatile i8, i8 addrspace(1)* undef
call void @external_void_func_i8_zeroext(i8 %var)		call void @external_void_func_i8_zeroext(i8 %var)
Show All 11 Lines	define amdgpu_kernel void @test_call_external_void_func_i16_imm() #0 {
call void @external_void_func_i16(i16 123)		call void @external_void_func_i16(i16 123)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_i16_signext:		; GCN-LABEL: {{^}}test_call_external_void_func_i16_signext:
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}

; GCN-DAG: buffer_load_sshort v0		; GCN-DAG: buffer_load_sshort v0
		; GCN: s_mov_b32 s4, s33
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i16_signext@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i16_signext@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i16_signext@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i16_signext@rel32@hi+4

; GCN-DAG: s_mov_b32 s4, s33
; GCN-DAG: s_mov_b32 s32, s33		; GCN-DAG: s_mov_b32 s32, s33

; GCN: s_waitcnt vmcnt(0)		; GCN: s_waitcnt vmcnt(0)
; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i16_signext(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i16_signext(i32) #0 {
%var = load volatile i16, i16 addrspace(1)* undef		%var = load volatile i16, i16 addrspace(1)* undef
call void @external_void_func_i16_signext(i16 %var)		call void @external_void_func_i16_signext(i16 %var)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_i16_zeroext:		; GCN-LABEL: {{^}}test_call_external_void_func_i16_zeroext:
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}


; GCN-DAG: buffer_load_ushort v0		; GCN-DAG: buffer_load_ushort v0
		; GCN: s_mov_b32 s4, s33
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i16_zeroext@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i16_zeroext@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i16_zeroext@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i16_zeroext@rel32@hi+4

; GCN-DAG: s_mov_b32 s4, s33
; GCN-DAG: s_mov_b32 s32, s33		; GCN-DAG: s_mov_b32 s32, s33

; GCN: s_waitcnt vmcnt(0)		; GCN: s_waitcnt vmcnt(0)
; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i16_zeroext(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i16_zeroext(i32) #0 {
%var = load volatile i16, i16 addrspace(1)* undef		%var = load volatile i16, i16 addrspace(1)* undef
call void @external_void_func_i16_zeroext(i16 %var)		call void @external_void_func_i16_zeroext(i16 %var)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_i32_imm:		; GCN-LABEL: {{^}}test_call_external_void_func_i32_imm:
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}

		; GCN: v_mov_b32_e32 v0, 42
		; GCN: s_mov_b32 s4, s33
; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i32@rel32@lo+4		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], external_void_func_i32@rel32@lo+4
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i32@rel32@hi+4		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], external_void_func_i32@rel32@hi+4
; GCN: v_mov_b32_e32 v0, 42
; GCN-DAG: s_mov_b32 s4, s33
; GCN-DAG: s_mov_b32 s32, s33		; GCN-DAG: s_mov_b32 s32, s33

; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN: s_swappc_b64 s[30:31], s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_external_void_func_i32_imm(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_i32_imm(i32) #0 {
call void @external_void_func_i32(i32 42)		call void @external_void_func_i32(i32 42)
ret void		ret void
}		}
▲ Show 20 Lines • Show All 223 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @test_call_external_void_func_v2i32_imm() #0 {
call void @external_void_func_v2i32(<2 x i32> <i32 1, i32 2>)		call void @external_void_func_v2i32(<2 x i32> <i32 1, i32 2>)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_v3i32_imm:		; GCN-LABEL: {{^}}test_call_external_void_func_v3i32_imm:
; HSA-DAG: s_mov_b32 s33, s9		; HSA-DAG: s_mov_b32 s33, s9
; MESA-DAG: s_mov_b32 s33, s3{{$}}		; MESA-DAG: s_mov_b32 s33, s3{{$}}

		; GCN-NOT: v3
; GCN-DAG: v_mov_b32_e32 v0, 3		; GCN-DAG: v_mov_b32_e32 v0, 3
; GCN-DAG: v_mov_b32_e32 v1, 4		; GCN-DAG: v_mov_b32_e32 v1, 4
; GCN-DAG: v_mov_b32_e32 v2, 5		; GCN-DAG: v_mov_b32_e32 v2, 5
; GCN-NOT: v3

; GCN: s_swappc_b64		; GCN: s_swappc_b64
define amdgpu_kernel void @test_call_external_void_func_v3i32_imm(i32) #0 {		define amdgpu_kernel void @test_call_external_void_func_v3i32_imm(i32) #0 {
call void @external_void_func_v3i32(<3 x i32> <i32 3, i32 4, i32 5>)		call void @external_void_func_v3i32(<3 x i32> <i32 3, i32 4, i32 5>)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_external_void_func_v3i32_i32:		; GCN-LABEL: {{^}}test_call_external_void_func_v3i32_i32:
▲ Show 20 Lines • Show All 281 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/call-preserved-registers.ll

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

declare void @external_void_func_void() #0		declare void @external_void_func_void() #0

; GCN-LABEL: {{^}}test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void:		; GCN-LABEL: {{^}}test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void:
; GCN: s_mov_b32 s33, s7		; GCN: s_mov_b32 s33, s7
; GCN: s_getpc_b64 s[34:35]		; GCN: s_mov_b32 s4, s33
		; GCN-NEXT: s_getpc_b64 s[34:35]
; GCN-NEXT: s_add_u32 s34, s34,		; GCN-NEXT: s_add_u32 s34, s34,
; GCN-NEXT: s_addc_u32 s35, s35,		; GCN-NEXT: s_addc_u32 s35, s35,
; GCN-NEXT: s_mov_b32 s4, s33
; GCN-NEXT: s_mov_b32 s32, s33		; GCN-NEXT: s_mov_b32 s32, s33
; GCN: s_swappc_b64 s[30:31], s[34:35]		; GCN: s_swappc_b64 s[30:31], s[34:35]

; GCN-NEXT: s_mov_b32 s4, s33		; GCN-NEXT: s_mov_b32 s4, s33
; GCN-NEXT: #ASMSTART		; GCN-NEXT: #ASMSTART
; GCN-NEXT: #ASMEND		; GCN-NEXT: #ASMEND
; GCN-NEXT: s_swappc_b64 s[30:31], s[34:35]		; GCN-NEXT: s_swappc_b64 s[30:31], s[34:35]
define amdgpu_kernel void @test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void() #0 {		define amdgpu_kernel void @test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void() #0 {
▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @test_call_void_func_void_mayclobber_v31(i32 addrspace(1)* %out) #0 {
%v31 = call i32 asm sideeffect "; def $0", "={v31}"()		%v31 = call i32 asm sideeffect "; def $0", "={v31}"()
call void @external_void_func_void()		call void @external_void_func_void()
call void asm sideeffect "; use $0", "{v31}"(i32 %v31)		call void asm sideeffect "; use $0", "{v31}"(i32 %v31)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_void_func_void_preserves_s33:		; GCN-LABEL: {{^}}test_call_void_func_void_preserves_s33:
; GCN: s_mov_b32 s34, s9		; GCN: s_mov_b32 s34, s9
; GCN: ; def s33		; GCN: s_mov_b32 s4, s34
; GCN-NEXT: #ASMEND		; GCN-DAG: s_mov_b32 s32, s34
; GCN: s_getpc_b64 s[6:7]		; GCN-DAG: ; def s33
; GCN-NEXT: s_add_u32 s6, s6, external_void_func_void@rel32@lo+4		; GCN-DAG: #ASMEND
; GCN-NEXT: s_addc_u32 s7, s7, external_void_func_void@rel32@hi+4		; GCN-DAG: s_getpc_b64 s[6:7]
; GCN-NEXT: s_mov_b32 s4, s34		; GCN-DAG: s_add_u32 s6, s6, external_void_func_void@rel32@lo+4
; GCN-NEXT: s_mov_b32 s32, s34		; GCN-DAG: s_addc_u32 s7, s7, external_void_func_void@rel32@hi+4
; GCN-NEXT: s_swappc_b64 s[30:31], s[6:7]		; GCN-NEXT: s_swappc_b64 s[30:31], s[6:7]
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; use s33		; GCN-NEXT: ; use s33
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_void_func_void_preserves_s33(i32 addrspace(1)* %out) #0 {		define amdgpu_kernel void @test_call_void_func_void_preserves_s33(i32 addrspace(1)* %out) #0 {
%s33 = call i32 asm sideeffect "; def $0", "={s33}"()		%s33 = call i32 asm sideeffect "; def $0", "={s33}"()
call void @external_void_func_void()		call void @external_void_func_void()
call void asm sideeffect "; use $0", "{s33}"(i32 %s33)		call void asm sideeffect "; use $0", "{s33}"(i32 %s33)
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_void_func_void_preserves_v32:		; GCN-LABEL: {{^}}test_call_void_func_void_preserves_v32:
; GCN: s_mov_b32 s33, s9		; GCN: s_mov_b32 s33, s9
; GCN: ; def v32		; GCN: s_mov_b32 s4, s33
; GCN-NEXT: #ASMEND		; GCN-DAG: s_mov_b32 s32, s33
; GCN: s_getpc_b64 s[6:7]		; GCN-DAG: ; def v32
; GCN-NEXT: s_add_u32 s6, s6, external_void_func_void@rel32@lo+4		; GCN-DAG: #ASMEND
; GCN-NEXT: s_addc_u32 s7, s7, external_void_func_void@rel32@hi+4		; GCN-DAG: s_getpc_b64 s[6:7]
; GCN-NEXT: s_mov_b32 s4, s33		; GCN-DAG: s_add_u32 s6, s6, external_void_func_void@rel32@lo+4
; GCN-NEXT: s_mov_b32 s32, s33		; GCN-DAG: s_addc_u32 s7, s7, external_void_func_void@rel32@hi+4
; GCN-NEXT: s_swappc_b64 s[30:31], s[6:7]		; GCN-NEXT: s_swappc_b64 s[30:31], s[6:7]
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; use v32		; GCN-NEXT: ; use v32
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_void_func_void_preserves_v32(i32 addrspace(1)* %out) #0 {		define amdgpu_kernel void @test_call_void_func_void_preserves_v32(i32 addrspace(1)* %out) #0 {
%v32 = call i32 asm sideeffect "; def $0", "={v32}"()		%v32 = call i32 asm sideeffect "; def $0", "={v32}"()
call void @external_void_func_void()		call void @external_void_func_void()
Show All 10 Lines
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @void_func_void_clobber_s33() #2 {		define void @void_func_void_clobber_s33() #2 {
call void asm sideeffect "; clobber", "~{s33}"() #0		call void asm sideeffect "; clobber", "~{s33}"() #0
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_call_void_func_void_clobber_s33:		; GCN-LABEL: {{^}}test_call_void_func_void_clobber_s33:
; GCN: s_mov_b32 s33, s7		; GCN: s_mov_b32 s33, s7
; GCN: s_getpc_b64		; GCN: s_mov_b32 s4, s33
		; GCN-NEXT: s_getpc_b64
; GCN-NEXT: s_add_u32		; GCN-NEXT: s_add_u32
; GCN-NEXT: s_addc_u32		; GCN-NEXT: s_addc_u32
; GCN-NEXT: s_mov_b32 s4, s33
; GCN-NEXT: s_mov_b32 s32, s33		; GCN-NEXT: s_mov_b32 s32, s33
; GCN: s_swappc_b64		; GCN: s_swappc_b64
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
define amdgpu_kernel void @test_call_void_func_void_clobber_s33() #0 {		define amdgpu_kernel void @test_call_void_func_void_clobber_s33() #0 {
call void @void_func_void_clobber_s33()		call void @void_func_void_clobber_s33()
ret void		ret void
}		}

▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/callee-special-input-sgprs.ll

Show First 20 Lines • Show All 552 Lines • ▼ Show 20 Lines	define void @func_use_every_sgpr_input_call_use_workgroup_id_xyz() #1 {
call void asm sideeffect "; use $0", "s"(i32 %val6)		call void asm sideeffect "; use $0", "s"(i32 %val6)

call void @use_workgroup_id_xyz()		call void @use_workgroup_id_xyz()
ret void		ret void
}		}

; GCN-LABEL: {{^}}func_use_every_sgpr_input_call_use_workgroup_id_xyz_spill:		; GCN-LABEL: {{^}}func_use_every_sgpr_input_call_use_workgroup_id_xyz_spill:
; GCN: s_mov_b32 s5, s32		; GCN: s_mov_b32 s5, s32
; GCN: s_add_u32 s32, s32, 0x400
		; GCN-DAG: s_add_u32 s32, s32, 0x400

; GCN-DAG: s_mov_b32 [[SAVE_X:s[0-57-9][0-9]*]], s14		; GCN-DAG: s_mov_b32 [[SAVE_X:s[0-57-9][0-9]*]], s14
; GCN-DAG: s_mov_b32 [[SAVE_Y:s[0-68-9][0-9]*]], s15		; GCN-DAG: s_mov_b32 [[SAVE_Y:s[0-68-9][0-9]*]], s15
; GCN-DAG: s_mov_b32 [[SAVE_Z:s[0-79][0-9]*]], s16		; GCN-DAG: s_mov_b32 [[SAVE_Z:s[0-79][0-9]*]], s16
; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[6:7]		; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[6:7]
; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[8:9]		; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[8:9]
; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[10:11]		; GCN-DAG: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, s[10:11]

▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/indirect-addressing-si.ll

	Show First 20 Lines • Show All 380 Lines • ▼ Show 20 Lines
	}			}

	; GCN-LABEL: {{^}}insert_vgpr_offset_multiple_in_block:			; GCN-LABEL: {{^}}insert_vgpr_offset_multiple_in_block:
	; GCN-DAG: s_load_dwordx4 s{{\[}}[[S_ELT0:[0-9]+]]:[[S_ELT3:[0-9]+]]{{\]}}			; GCN-DAG: s_load_dwordx4 s{{\[}}[[S_ELT0:[0-9]+]]:[[S_ELT3:[0-9]+]]{{\]}}
	; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]			; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]
	; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62			; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62

	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT3:[0-9]+]], s[[S_ELT3]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT3:[0-9]+]], s[[S_ELT3]]
	; GCN: v_mov_b32_e32 v[[VEC_ELT2:[0-9]+]], s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT2:[0-9]+]], s{{[0-9]+}}
	; GCN: v_mov_b32_e32 v[[VEC_ELT1:[0-9]+]], s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT1:[3]+]], s{{[0-9]+}}
				tstellarUnsubmitted Not Done Reply Inline Actions This [3] looks like a typo. tstellar: This [3] looks like a typo.
				jonpaAuthorUnsubmitted Not Done Reply Inline Actions :-) I know that looks weird and suspected you might not like it. The problem was that the VEC_ELT1 register did not match properly further down. IIRC, there were different matches for different subtargets, so I had to force one of the matches into v3 (I suppose I should have removed the '+'). Please help me out and check if there is a better way, or if this is acceptable. jonpa: :-) I know that looks weird and suspected you might not like it. The problem was that the…
				tstellarUnsubmitted Done Reply Inline Actions Ok, that's fine. I would remove the + and also the brackets. tstellar: Ok, that's fine. I would remove the + and also the brackets.
	; GCN: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]

	; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]			; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
	; GCN: s_and_saveexec_b64 vcc, vcc			; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; MOVREL: s_mov_b32 m0, [[READLANE]]
	▲ Show 20 Lines • Show All 254 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/inline-asm.ll

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	; separate comment
; trailing comment		; trailing comment
; extra comment		; extra comment
", ""()		", ""()
ret void		ret void
}		}

; FIXME: Should not have intermediate sgprs		; FIXME: Should not have intermediate sgprs
; CHECK-LABEL: {{^}}i64_imm_input_phys_vgpr:		; CHECK-LABEL: {{^}}i64_imm_input_phys_vgpr:
; CHECK: s_mov_b32 s1, 0		; CHECK-DAG: s_mov_b32 s1, 0
; CHECK: s_mov_b32 s0, 0x1e240		; CHECK-DAG: s_mov_b32 s0, 0x1e240
; CHECK: v_mov_b32_e32 v0, s0		; CHECK: v_mov_b32_e32 v0, s0
; CHECK: v_mov_b32_e32 v1, s1		; CHECK: v_mov_b32_e32 v1, s1
; CHECK: use v[0:1]		; CHECK: use v[0:1]
define amdgpu_kernel void @i64_imm_input_phys_vgpr() {		define amdgpu_kernel void @i64_imm_input_phys_vgpr() {
entry:		entry:
call void asm sideeffect "; use $0 ", "{v[0:1]}"(i64 123456)		call void asm sideeffect "; use $0 ", "{v[0:1]}"(i64 123456)
ret void		ret void
}		}
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/insert_vector_elt.ll

	Show First 20 Lines • Show All 346 Lines • ▼ Show 20 Lines

	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 [[ELT1:v[0-9]+]], 0x40200000			; GCN-DAG: v_mov_b32_e32 [[ELT1:v[0-9]+]], 0x40200000

	; GCN-DAG: s_mov_b32 m0, [[SCALEDIDX]]			; GCN-DAG: s_mov_b32 m0, [[SCALEDIDX]]
	; GCN: v_movreld_b32_e32 v{{[0-9]+}}, 0			; GCN-DAG: v_movreld_b32_e32 v{{[0-9]+}}, 0

	; Increment to next element folded into base register, but FileCheck			; Increment to next element folded into base register, but FileCheck
	; can't do math expressions			; can't do math expressions

	; FIXME: Should be able to manipulate m0 directly instead of s_lshl_b32 + copy to m0			; FIXME: Should be able to manipulate m0 directly instead of s_lshl_b32 + copy to m0

	; GCN: v_movreld_b32_e32 v{{[0-9]+}}, [[ELT1]]			; GCN: v_movreld_b32_e32 v{{[0-9]+}}, [[ELT1]]

	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/misched-killflags.mir

Show All 20 Lines	bb.0:
$vgpr0 = V_MOV_B32_e32 $sgpr8, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $sgpr8_sgpr9_sgpr10_sgpr11		$vgpr0 = V_MOV_B32_e32 $sgpr8, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $sgpr8_sgpr9_sgpr10_sgpr11
$vgpr1 = V_MOV_B32_e32 $sgpr9, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11		$vgpr1 = V_MOV_B32_e32 $sgpr9, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11
$vgpr2 = V_MOV_B32_e32 $sgpr10, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11		$vgpr2 = V_MOV_B32_e32 $sgpr10, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11
$vgpr3 = V_MOV_B32_e32 $sgpr11, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11, implicit $exec		$vgpr3 = V_MOV_B32_e32 $sgpr11, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11, implicit $exec
S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3		S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3
S_ENDPGM		S_ENDPGM
...		...
# CHECK-LABEL: name: func0		# CHECK-LABEL: name: func0
# CHECK: $sgpr10 = S_MOV_B32 5		# CHECK-DAG: $sgpr10 = S_MOV_B32 5
# CHECK: $sgpr9 = S_MOV_B32 4		# CHECK-DAG: $sgpr9 = S_MOV_B32 4
# CHECK: $sgpr8 = S_MOV_B32 3		# CHECK-DAG: $sgpr8 = S_MOV_B32 3
# CHECK: $sgpr33 = S_MOV_B32 killed $sgpr7		# CHECK-DAG: $sgpr33 = S_MOV_B32 killed $sgpr7
# CHECK: $vgpr0 = V_MOV_B32_e32 $sgpr8, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $sgpr8_sgpr9_sgpr10_sgpr11		# CHECK: $vgpr0 = V_MOV_B32_e32 $sgpr8, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $sgpr8_sgpr9_sgpr10_sgpr11
		# CHECK: $sgpr32 = S_MOV_B32 $sgpr33
# CHECK: BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc {		# CHECK: BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc {
# CHECK: $sgpr6_sgpr7 = S_GETPC_B64		# CHECK: $sgpr6_sgpr7 = S_GETPC_B64
# CHECK: $sgpr6 = S_ADD_U32 internal $sgpr6, 0, implicit-def $scc		# CHECK: $sgpr6 = S_ADD_U32 internal $sgpr6, 0, implicit-def $scc
# CHECK: $sgpr7 = S_ADDC_U32 internal $sgpr7, 0, implicit-def $scc, implicit internal $scc		# CHECK: $sgpr7 = S_ADDC_U32 internal $sgpr7, 0, implicit-def $scc, implicit internal $scc
# CHECK: }		# CHECK: }
# CHECK: $sgpr4 = S_MOV_B32 $sgpr33		# CHECK: $sgpr4 = S_MOV_B32 killed $sgpr33
# CHECK: $vgpr1 = V_MOV_B32_e32 $sgpr9, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11		# CHECK: $vgpr1 = V_MOV_B32_e32 $sgpr9, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11
# CHECK: $vgpr2 = V_MOV_B32_e32 $sgpr10, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11		# CHECK: $vgpr2 = V_MOV_B32_e32 $sgpr10, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11
# CHECK: $vgpr3 = V_MOV_B32_e32 killed $sgpr11, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11, implicit $exec		# CHECK: $vgpr3 = V_MOV_B32_e32 killed $sgpr11, implicit $exec, implicit $sgpr8_sgpr9_sgpr10_sgpr11, implicit $exec
# CHECK: $sgpr32 = S_MOV_B32 killed $sgpr33
# CHECK: S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3		# CHECK: S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3
# CHECK: S_ENDPGM		# CHECK: S_ENDPGM

test/CodeGen/AMDGPU/nested-calls.ll

	Show All 27 Lines
	define void @test_func_call_external_void_func_i32_imm() #0 {			define void @test_func_call_external_void_func_i32_imm() #0 {
	call void @external_void_func_i32(i32 42)			call void @external_void_func_i32(i32 42)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}test_func_call_external_void_func_i32_imm_stack_use:			; GCN-LABEL: {{^}}test_func_call_external_void_func_i32_imm_stack_use:
	; GCN: s_waitcnt			; GCN: s_waitcnt
	; GCN: s_mov_b32 s5, s32			; GCN: s_mov_b32 s5, s32
	; GCN: s_add_u32 s32, s32, 0x1400{{$}}			; GCN-DAG: s_add_u32 s32, s32, 0x1400{{$}}
	; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s5 offset			; GCN-DAG: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s5 offset
	; GCN: s_swappc_b64			; GCN: s_swappc_b64
	; GCN: s_sub_u32 s32, s32, 0x1400{{$}}			; GCN: s_sub_u32 s32, s32, 0x1400{{$}}
	; GCN: s_setpc_b64			; GCN: s_setpc_b64
	define void @test_func_call_external_void_func_i32_imm_stack_use() #0 {			define void @test_func_call_external_void_func_i32_imm_stack_use() #0 {
	%alloca = alloca [16 x i32], align 4, addrspace(5)			%alloca = alloca [16 x i32], align 4, addrspace(5)
	%gep0 = getelementptr inbounds [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0			%gep0 = getelementptr inbounds [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
	%gep15 = getelementptr inbounds [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 16			%gep15 = getelementptr inbounds [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 16
	store volatile i32 0, i32 addrspace(5)* %gep0			store volatile i32 0, i32 addrspace(5)* %gep0
	store volatile i32 0, i32 addrspace(5)* %gep15			store volatile i32 0, i32 addrspace(5)* %gep15
	call void @external_void_func_i32(i32 42)			call void @external_void_func_i32(i32 42)
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }
	attributes #2 = { nounwind noinline }			attributes #2 = { nounwind noinline }

test/CodeGen/AMDGPU/undefined-subreg-liverange.ll

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	bb11: ; preds = %bb9
store <4 x i32> %tmp2, <4 x i32> addrspace(1)* undef, align 16		store <4 x i32> %tmp2, <4 x i32> addrspace(1)* undef, align 16
ret float undef		ret float undef
}		}

; FIXME: Should be able to remove the undef copies		; FIXME: Should be able to remove the undef copies

; CHECK-LABEL: {{^}}partially_undef_copy:		; CHECK-LABEL: {{^}}partially_undef_copy:
; CHECK: v_mov_b32_e32 v5, 5		; CHECK: v_mov_b32_e32 v5, 5
; CHECK: v_mov_b32_e32 v6, 6		; CHECK-DAG: v_mov_b32_e32 v6, 6

; CHECK: v_mov_b32_e32 v[[OUTPUT_LO:[0-9]+]], v5		; CHECK-DAG: v_mov_b32_e32 v[[OUTPUT_LO:[0-9]+]], v5

; Undef copy		; Undef copy
; CHECK: v_mov_b32_e32 v1, v6		; CHECK-DAG: v_mov_b32_e32 v1, v6

; undef copy		; undef copy
; CHECK: v_mov_b32_e32 v2, v7		; CHECK-DAG: v_mov_b32_e32 v2, v7

; CHECK: v_mov_b32_e32 v[[OUTPUT_HI:[0-9]+]], v8		; CHECK-DAG: v_mov_b32_e32 v[[OUTPUT_HI:[0-9]+]], v8
; CHECK: v_mov_b32_e32 v[[OUTPUT_LO]], v6		; CHECK-DAG: v_mov_b32_e32 v[[OUTPUT_LO]], v6

; CHECK: buffer_store_dwordx4 v{{\[}}[[OUTPUT_LO]]:[[OUTPUT_HI]]{{\]}}		; CHECK: buffer_store_dwordx4 v{{\[}}[[OUTPUT_LO]]:[[OUTPUT_HI]]{{\]}}
define amdgpu_kernel void @partially_undef_copy() #0 {		define amdgpu_kernel void @partially_undef_copy() #0 {
%tmp0 = call i32 asm sideeffect "v_mov_b32_e32 v5, 5", "={v5}"()		%tmp0 = call i32 asm sideeffect "v_mov_b32_e32 v5, 5", "={v5}"()
%tmp1 = call i32 asm sideeffect "v_mov_b32_e32 v6, 6", "={v6}"()		%tmp1 = call i32 asm sideeffect "v_mov_b32_e32 v6, 6", "={v6}"()

%partially.undef.0 = insertelement <4 x i32> undef, i32 %tmp0, i32 0		%partially.undef.0 = insertelement <4 x i32> undef, i32 %tmp0, i32 0
%partially.undef.1 = insertelement <4 x i32> %partially.undef.0, i32 %tmp1, i32 0		%partially.undef.1 = insertelement <4 x i32> %partially.undef.0, i32 %tmp1, i32 0
Show All 10 Lines

test/CodeGen/ARM/Windows/chkstk-movw-movt-isel.ll

Show All 13 Lines	entry:
%rem = urem i32 %0, 4096		%rem = urem i32 %0, 4096
%arrayidx = getelementptr inbounds [4096 x i8], [4096 x i8]* %buffer, i32 0, i32 %rem		%arrayidx = getelementptr inbounds [4096 x i8], [4096 x i8]* %buffer, i32 0, i32 %rem
%1 = load volatile i8, i8* %arrayidx, align 1		%1 = load volatile i8, i8* %arrayidx, align 1
ret i8 %1		ret i8 %1
}		}

; CHECK-LABEL: isel		; CHECK-LABEL: isel
; CHECK: push {r4, r5, r6, lr}		; CHECK: push {r4, r5, r6, lr}
; CHECK: movw r12, #0		; CHECK-DAG: movw r12, #0
; CHECK: movt r12, #0		; CHECK-DAG: movt r12, #0
; CHECK: movw r4, #{{\d*}}		; CHECK-DAG: movw r4, #{{\d*}}
; CHECK: blx r12		; CHECK: blx r12
; CHECK: sub.w sp, sp, r4		; CHECK: sub.w sp, sp, r4

test/CodeGen/ARM/Windows/chkstk.ll

	Show All 10 Lines
	}			}

	; CHECK-DEFAULT-CODE-MODEL: check_watermark:			; CHECK-DEFAULT-CODE-MODEL: check_watermark:
	; CHECK-DEFAULT-CODE-MODEL: movw r4, #1024			; CHECK-DEFAULT-CODE-MODEL: movw r4, #1024
	; CHECK-DEFAULT-CODE-MODEL: bl __chkstk			; CHECK-DEFAULT-CODE-MODEL: bl __chkstk
	; CHECK-DEFAULT-CODE-MODEL: sub.w sp, sp, r4			; CHECK-DEFAULT-CODE-MODEL: sub.w sp, sp, r4

	; CHECK-LARGE-CODE-MODEL: check_watermark:			; CHECK-LARGE-CODE-MODEL: check_watermark:
	; CHECK-LARGE-CODE-MODEL: movw r12, :lower16:__chkstk			; CHECK-LARGE-CODE-MODEL-DAG: movw r12, :lower16:__chkstk
	; CHECK-LARGE-CODE-MODEL: movt r12, :upper16:__chkstk			; CHECK-LARGE-CODE-MODEL-DAG: movt r12, :upper16:__chkstk
	; CHECK-LARGE-CODE-MODEL: movw r4, #1024			; CHECK-LARGE-CODE-MODEL-DAG: movw r4, #1024
	; CHECK-LARGE-CODE-MODEL: blx r12			; CHECK-LARGE-CODE-MODEL: blx r12
	; CHECK-LARGE-CODE-MODEL: sub.w sp, sp, r4			; CHECK-LARGE-CODE-MODEL: sub.w sp, sp, r4

test/CodeGen/ARM/Windows/memset.ll

	; RUN: llc -mtriple thumbv7--windows-itanium -filetype asm -o - %s \| FileCheck %s			; RUN: llc -mtriple thumbv7--windows-itanium -filetype asm -o - %s \| FileCheck %s

	@source = common global [512 x i8] zeroinitializer, align 4			@source = common global [512 x i8] zeroinitializer, align 4

	declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) nounwind			declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) nounwind

	define void @function() {			define void @function() {
	entry:			entry:
	call void @llvm.memset.p0i8.i32(i8* bitcast ([512 x i8]* @source to i8*), i8 0, i32 512, i1 false)			call void @llvm.memset.p0i8.i32(i8* bitcast ([512 x i8]* @source to i8*), i8 0, i32 512, i1 false)
	unreachable			unreachable
	}			}

	; CHECK: movw r0, :lower16:source
	; CHECK: movt r0, :upper16:source
	; CHECK: movs r1, #0			; CHECK: movs r1, #0
	; CHECK: mov.w r2, #512			; CHECK: mov.w r2, #512
				; CHECK: movw r0, :lower16:source
				; CHECK: movt r0, :upper16:source
	; CHECK: memset			; CHECK: memset

test/CodeGen/ARM/arm-and-tst-peephole.ll

	Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	; THUMB-NEXT: beq .LBB2_2			; THUMB-NEXT: beq .LBB2_2
	; THUMB-NEXT: @ %bb.1:			; THUMB-NEXT: @ %bb.1:
	; THUMB-NEXT: movs r0, r2			; THUMB-NEXT: movs r0, r2
	; THUMB-NEXT: .LBB2_2:			; THUMB-NEXT: .LBB2_2:
	; THUMB-NEXT: bx lr			; THUMB-NEXT: bx lr
	;			;
	; T2-LABEL: test_tst_assessment:			; T2-LABEL: test_tst_assessment:
	; T2: @ %bb.0:			; T2: @ %bb.0:
	; T2-NEXT: lsls r1, r1, #31
	; T2-NEXT: and r0, r0, #1			; T2-NEXT: and r0, r0, #1
				; T2-NEXT: lsls r1, r1, #31
	; T2-NEXT: it ne			; T2-NEXT: it ne
	; T2-NEXT: subne r0, #1			; T2-NEXT: subne r0, #1
	; T2-NEXT: bx lr			; T2-NEXT: bx lr
	;			;
	; V8-LABEL: test_tst_assessment:			; V8-LABEL: test_tst_assessment:
	; V8: @ %bb.0:			; V8: @ %bb.0:
	; V8-NEXT: and r0, r0, #1			; V8-NEXT: and r0, r0, #1
	; V8-NEXT: lsls r1, r1, #31			; V8-NEXT: lsls r1, r1, #31
	Show All 12 Lines

test/CodeGen/ARM/arm-shrink-wrapping.ll

	Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
	; SUM is in r0 because it is coalesced with the second			; SUM is in r0 because it is coalesced with the second
	; argument on the else path.			; argument on the else path.
	; CHECK: mov{{s?}} [[SUM:r0]], #0			; CHECK: mov{{s?}} [[SUM:r0]], #0
	; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10			; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10
	;			;
	; Next BB.			; Next BB.
	; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body			; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body
	; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1			; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1
	; ARM: subs [[IV]], [[IV]], #1			; ARM: add [[SUM]], [[TMP]], [[SUM]]
	; THUMB: subs [[IV]], #1			; THUMB: add [[SUM]], [[TMP]]
	; ARM-NEXT: add [[SUM]], [[TMP]], [[SUM]]			; ARM-NEXT: subs [[IV]], [[IV]], #1
	; THUMB-NEXT: add [[SUM]], [[TMP]]			; THUMB-NEXT: subs [[IV]], #1
	; CHECK-NEXT: bne [[LOOP]]			; CHECK-NEXT: bne [[LOOP]]
	;			;
	; Next BB.			; Next BB.
	; SUM << 3.			; SUM << 3.
	; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3			; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3
	; ENABLE-NEXT: pop {r4, r7, pc}			; ENABLE-NEXT: pop {r4, r7, pc}
	;			;
	; Duplicated epilogue.			; Duplicated epilogue.
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; Make sure we save the CSR used in the inline asm: r4.			; Make sure we save the CSR used in the inline asm: r4.
	; CHECK: push {r4			; CHECK: push {r4
	; CHECK: mov{{s?}} [[SUM:r0]], #0			; CHECK: mov{{s?}} [[SUM:r0]], #0
	; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10			; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10
	; CHECK: nop			; CHECK: nop
	; Next BB.			; Next BB.
	; CHECK: [[LOOP_LABEL:LBB[0-9_]+]]: @ %for.body			; CHECK: [[LOOP_LABEL:LBB[0-9_]+]]: @ %for.body
	; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1			; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1
	; ARM: subs [[IV]], [[IV]], #1
	; THUMB: subs [[IV]], #1
	; ARM: add [[SUM]], [[TMP]], [[SUM]]			; ARM: add [[SUM]], [[TMP]], [[SUM]]
	; THUMB: add [[SUM]], [[TMP]]			; THUMB: add [[SUM]], [[TMP]]
				; ARM: subs [[IV]], [[IV]], #1
				; THUMB: subs [[IV]], #1
	; CHECK-NEXT: bne [[LOOP_LABEL]]			; CHECK-NEXT: bne [[LOOP_LABEL]]
	; Next BB.			; Next BB.
	; CHECK: @ %for.exit			; CHECK: @ %for.exit
	; CHECK: nop			; CHECK: nop
	; CHECK: pop {r4			; CHECK: pop {r4
	define i32 @freqSaveAndRestoreOutsideLoop2(i32 %cond) "no-frame-pointer-elim"="true" {			define i32 @freqSaveAndRestoreOutsideLoop2(i32 %cond) "no-frame-pointer-elim"="true" {
	entry:			entry:
	br label %for.preheader			br label %for.preheader
	Show All 39 Lines
	; SUM is in r0 because it is coalesced with the second			; SUM is in r0 because it is coalesced with the second
	; argument on the else path.			; argument on the else path.
	; CHECK: mov{{s?}} [[SUM:r0]], #0			; CHECK: mov{{s?}} [[SUM:r0]], #0
	; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10			; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10
	;			;
	; Next BB.			; Next BB.
	; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body			; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body
	; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1			; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1
	; ARM: subs [[IV]], [[IV]], #1			; ARM: add [[SUM]], [[TMP]], [[SUM]]
	; THUMB: subs [[IV]], #1			; THUMB: add [[SUM]], [[TMP]]
	; ARM-NEXT: add [[SUM]], [[TMP]], [[SUM]]			; ARM-NEXT: subs [[IV]], [[IV]], #1
	; THUMB-NEXT: add [[SUM]], [[TMP]]			; THUMB-NEXT: subs [[IV]], #1
	; CHECK-NEXT: bne [[LOOP]]			; CHECK-NEXT: bne [[LOOP]]
	;			;
	; Next BB.			; Next BB.
	; SUM << 3.			; SUM << 3.
	; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3			; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3
	; ENABLE: pop {r4, r7, pc}			; ENABLE: pop {r4, r7, pc}
	;			;
	; Duplicated epilogue.			; Duplicated epilogue.
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	; SUM is in r0 because it is coalesced with the second			; SUM is in r0 because it is coalesced with the second
	; argument on the else path.			; argument on the else path.
	; CHECK: mov{{s?}} [[SUM:r0]], #0			; CHECK: mov{{s?}} [[SUM:r0]], #0
	; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10			; CHECK-NEXT: mov{{s?}} [[IV:r[0-9]+]], #10
	;			;
	; Next BB.			; Next BB.
	; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body			; CHECK: [[LOOP:LBB[0-9_]+]]: @ %for.body
	; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1			; CHECK: mov{{(\.w)?}} [[TMP:r[0-9]+]], #1
	; ARM: subs [[IV]], [[IV]], #1			; ARM: add [[SUM]], [[TMP]], [[SUM]]
	; THUMB: subs [[IV]], #1			; THUMB: add [[SUM]], [[TMP]]
	; ARM-NEXT: add [[SUM]], [[TMP]], [[SUM]]			; ARM-NEXT: subs [[IV]], [[IV]], #1
	; THUMB-NEXT: add [[SUM]], [[TMP]]			; THUMB-NEXT: subs [[IV]], #1
	; CHECK-NEXT: bne [[LOOP]]			; CHECK-NEXT: bne [[LOOP]]
	;			;
	; Next BB.			; Next BB.
	; SUM << 3.			; SUM << 3.
	; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3			; CHECK: lsl{{s?}} [[SUM]], [[SUM]], #3
	; ENABLE-NEXT: pop {r4, r7, pc}			; ENABLE-NEXT: pop {r4, r7, pc}
	;			;
	; Duplicated epilogue.			; Duplicated epilogue.
	▲ Show 20 Lines • Show All 371 Lines • Show Last 20 Lines

test/CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll

	Show All 12 Lines
	; CHECK: rdefs left			; CHECK: rdefs left
	; CHECK-NEXT: Latency : 4			; CHECK-NEXT: Latency : 4
	; CHECK: Successors:			; CHECK: Successors:
	; CHECK: Data			; CHECK: Data
	; CHECK-SAME: Latency=1			; CHECK-SAME: Latency=1
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=3			; CHECK-SAME: Latency=3
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=3			; CHECK-SAME: Latency=0
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=4			; CHECK-SAME: Latency=0
	define i32 @bar(i32 %a1, i32 %b1, i32 %c1) minsize optsize {			define i32 @bar(i32 %a1, i32 %b1, i32 %c1) minsize optsize {
	%1 = load i32, i32* @a, align 4			%1 = load i32, i32* @a, align 4
	%2 = load i32, i32* @b, align 4			%2 = load i32, i32* @b, align 4
	%3 = load i32, i32* @c, align 4			%3 = load i32, i32* @c, align 4

	%ptr_after = getelementptr i32, i32* @a, i32 3			%ptr_after = getelementptr i32, i32* @a, i32 3

	%ptr_val = ptrtoint i32* %ptr_after to i32			%ptr_val = ptrtoint i32* %ptr_after to i32
	%mul1 = mul i32 %ptr_val, %1			%mul1 = mul i32 %ptr_val, %1
	%mul2 = mul i32 %mul1, %2			%mul2 = mul i32 %mul1, %2
	%mul3 = mul i32 %mul2, %3			%mul3 = mul i32 %mul2, %3
	ret i32 %mul3			ret i32 %mul3
	}			}

test/CodeGen/ARM/cortex-a57-misched-ldm.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=armv8r-eabi -mcpu=cortex-a57 -misched-postra -enable-misched -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=armv8r-eabi -mcpu=cortex-a57 -misched-postra -enable-misched -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s

	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; We need second, post-ra scheduling to have LDM instruction combined from single-loads			; We need second, post-ra scheduling to have LDM instruction combined from single-loads
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK: LDMIA			; CHECK: LDMIA
	; CHECK: rdefs left			; CHECK: rdefs left
	; CHECK-NEXT: Latency : 3			; CHECK-NEXT: Latency : 3
	; CHECK: Successors:			; CHECK: Successors:
	; CHECK: Data			; CHECK: Data
	; CHECK-SAME: Latency=3			; CHECK-SAME: Latency=3
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=3			; CHECK-SAME: Latency=0

	define i32 @foo(i32* %a) nounwind optsize {			define i32 @foo(i32* %a) nounwind optsize {
	entry:			entry:
	%b = getelementptr i32, i32* %a, i32 1			%b = getelementptr i32, i32* %a, i32 1
	%c = getelementptr i32, i32* %a, i32 2			%c = getelementptr i32, i32* %a, i32 2
	%0 = load i32, i32* %a, align 4			%0 = load i32, i32* %a, align 4
	%1 = load i32, i32* %b, align 4			%1 = load i32, i32* %b, align 4
	%2 = load i32, i32* %c, align 4			%2 = load i32, i32* %c, align 4

	%mul1 = mul i32 %0, %1			%mul1 = mul i32 %0, %1
	%mul2 = mul i32 %mul1, %2			%mul2 = mul i32 %mul1, %2
	ret i32 %mul2			ret i32 %mul2
	}			}

test/CodeGen/ARM/cortex-a57-misched-vldm-wrback.ll

	Show All 14 Lines
	; CHECK: Successors:			; CHECK: Successors:
	; CHECK: Data			; CHECK: Data
	; CHECK-SAME: Latency=1			; CHECK-SAME: Latency=1
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=1			; CHECK-SAME: Latency=1
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=5			; CHECK-SAME: Latency=5
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=5			; CHECK-SAME: Latency=0
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=6			; CHECK-SAME: Latency=0
	define i32 @bar(i32* %iptr) minsize optsize {			define i32 @bar(i32* %iptr) minsize optsize {
	%1 = load double, double* @a, align 8			%1 = load double, double* @a, align 8
	%2 = load double, double* @b, align 8			%2 = load double, double* @b, align 8
	%3 = load double, double* @c, align 8			%3 = load double, double* @c, align 8

	%ptr_after = getelementptr double, double* @a, i32 3			%ptr_after = getelementptr double, double* @a, i32 3

	%ptr_new_ival = ptrtoint double* %ptr_after to i32			%ptr_new_ival = ptrtoint double* %ptr_after to i32
	Show All 17 Lines

test/CodeGen/ARM/cortex-a57-misched-vldm.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=armv8r-eabi -mcpu=cortex-a57 -misched-postra -enable-misched -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=armv8r-eabi -mcpu=cortex-a57 -misched-postra -enable-misched -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s

	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; We need second, post-ra scheduling to have VLDM instruction combined from single-loads			; We need second, post-ra scheduling to have VLDM instruction combined from single-loads
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK: VLDMDIA			; CHECK: VLDMDIA
	; CHECK: rdefs left			; CHECK: rdefs left
	; CHECK-NEXT: Latency : 6			; CHECK-NEXT: Latency : 6
	; CHECK: Successors:			; CHECK: Successors:
	; CHECK: Data			; CHECK: Data
	; CHECK-SAME: Latency=5			; CHECK-SAME: Latency=5
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=5			; CHECK-SAME: Latency=0
	; CHECK-NEXT: Data			; CHECK-NEXT: Data
	; CHECK-SAME: Latency=6			; CHECK-SAME: Latency=0

	define double @foo(double* %a) nounwind optsize {			define double @foo(double* %a) nounwind optsize {
	entry:			entry:
	%b = getelementptr double, double* %a, i32 1			%b = getelementptr double, double* %a, i32 1
	%c = getelementptr double, double* %a, i32 2			%c = getelementptr double, double* %a, i32 2
	%0 = load double, double* %a, align 4			%0 = load double, double* %a, align 4
	%1 = load double, double* %b, align 4			%1 = load double, double* %b, align 4
	%2 = load double, double* %c, align 4			%2 = load double, double* %c, align 4

	%mul1 = fmul double %0, %1			%mul1 = fmul double %0, %1
	%mul2 = fmul double %mul1, %2			%mul2 = fmul double %mul1, %2
	ret double %mul2			ret double %mul2
	}			}

test/CodeGen/ARM/fp16-instructions.ll

	Show First 20 Lines • Show All 929 Lines • ▼ Show 20 Lines
	; CHECK-SOFTFP-FP16-A32: vmrs APSR_nzcv, fpscr			; CHECK-SOFTFP-FP16-A32: vmrs APSR_nzcv, fpscr
	; CHECK-SOFTFP-FP16-A32: vmoveq.f32 [[S4]], [[S2]]			; CHECK-SOFTFP-FP16-A32: vmoveq.f32 [[S4]], [[S2]]
	; CHECK-SOFTFP-FP16-A32-NEXT: vmovvs.f32 [[S4]], [[S2]]			; CHECK-SOFTFP-FP16-A32-NEXT: vmovvs.f32 [[S4]], [[S2]]
	; CHECK-SOFTFP-FP16-A32-NEXT: vcvtb.f16.f32 s0, [[S4]]			; CHECK-SOFTFP-FP16-A32-NEXT: vcvtb.f16.f32 s0, [[S4]]

	; CHECK-SOFTFP-FP16-T32: vmov [[S6:s[0-9]]], r0			; CHECK-SOFTFP-FP16-T32: vmov [[S6:s[0-9]]], r0
	; CHECK-SOFTFP-FP16-T32: vldr s0, .LCP{{.*}}			; CHECK-SOFTFP-FP16-T32: vldr s0, .LCP{{.*}}
	; CHECK-SOFTFP-FP16-T32: vcvtb.f32.f16 [[S6]], [[S6]]			; CHECK-SOFTFP-FP16-T32: vcvtb.f32.f16 [[S6]], [[S6]]
	; CHECK-SOFTFP-FP16-T32: vmov.f32 [[S2:s[0-9]]], #-2.000000e+00
	; CHECK-SOFTFP-FP16-T32: vcmp.f32 [[S6]], s0
	; CHECK-SOFTFP-FP16-T32: vldr [[S4:s[0-9]]], .LCPI{{.*}}			; CHECK-SOFTFP-FP16-T32: vldr [[S4:s[0-9]]], .LCPI{{.*}}
				; CHECK-SOFTFP-FP16-T32: vcmp.f32 [[S6]], s0
				; CHECK-SOFTFP-FP16-T32: vmov.f32 [[S2:s[0-9]]], #-2.000000e+00
	; CHECK-SOFTFP-FP16-T32: vmrs APSR_nzcv, fpscr			; CHECK-SOFTFP-FP16-T32: vmrs APSR_nzcv, fpscr
	; CHECK-SOFTFP-FP16-T32: it eq			; CHECK-SOFTFP-FP16-T32: it eq
	; CHECK-SOFTFP-FP16-T32: vmoveq.f32 [[S4]], [[S2]]			; CHECK-SOFTFP-FP16-T32: vmoveq.f32 [[S4]], [[S2]]
	; CHECK-SOFTFP-FP16-T32: it vs			; CHECK-SOFTFP-FP16-T32: it vs
	; CHECK-SOFTFP-FP16-T32-NEXT: vmovvs.f32 [[S4]], [[S2]]			; CHECK-SOFTFP-FP16-T32-NEXT: vmovvs.f32 [[S4]], [[S2]]
	; CHECK-SOFTFP-FP16-T32-NEXT: vcvtb.f16.f32 s0, [[S4]]			; CHECK-SOFTFP-FP16-T32-NEXT: vcvtb.f16.f32 s0, [[S4]]
	}			}

	▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

test/CodeGen/ARM/select.ll

	Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
	;			;
	; We used to generate really horrible code for this function. The main cause was			; We used to generate really horrible code for this function. The main cause was
	; a lack of a custom lowering routine for an ISD::SELECT. This would result in			; a lack of a custom lowering routine for an ISD::SELECT. This would result in
	; two "it" blocks in the code: one for the "icmp" and another to move the index			; two "it" blocks in the code: one for the "icmp" and another to move the index
	; into the constant pool based on the value of the "icmp". If we have one "it"			; into the constant pool based on the value of the "icmp". If we have one "it"
	; block generated, odds are good that we have close to the ideal code for this:			; block generated, odds are good that we have close to the ideal code for this:
	;			;
	; CHECK-NEON-LABEL: f8:			; CHECK-NEON-LABEL: f8:
	; CHECK-NEON: movw [[R3:r[0-9]+]], #1123
	; CHECK-NEON: adr [[R2:r[0-9]+]], LCPI7_0			; CHECK-NEON: adr [[R2:r[0-9]+]], LCPI7_0
				; CHECK-NEON: movw [[R3:r[0-9]+]], #1123
	; CHECK-NEON-NEXT: cmp r0, [[R3]]			; CHECK-NEON-NEXT: cmp r0, [[R3]]
	; CHECK-NEON-NEXT: it eq			; CHECK-NEON-NEXT: it eq
	; CHECK-NEON-NEXT: addeq{{.*}} [[R2]], #4			; CHECK-NEON-NEXT: addeq{{.*}} [[R2]], #4
	; CHECK-NEON-NEXT: ldr			; CHECK-NEON-NEXT: ldr
	; CHECK-NEON: bx			; CHECK-NEON: bx

	define arm_apcscc float @f8(i32 %a) nounwind {			define arm_apcscc float @f8(i32 %a) nounwind {
	%tmp = icmp eq i32 %a, 1123			%tmp = icmp eq i32 %a, 1123
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

test/CodeGen/ARM/twoaddrinstr.ll

	; Tests for the two-address instruction pass.			; Tests for the two-address instruction pass.
	; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a9 -arm-atomic-cfg-tidy=0 %s -o - \| FileCheck %s			; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a9 -arm-atomic-cfg-tidy=0 %s -o - \| FileCheck %s

	define void @PR13378() nounwind {			define void @PR13378() nounwind {
	; This was orriginally a crasher trying to schedule the instructions.			; This was orriginally a crasher trying to schedule the instructions.
	; CHECK-LABEL: PR13378:			; CHECK-LABEL: PR13378:
	; CHECK: vld1.32			; CHECK: vmov.i32
	; CHECK-NEXT: vmov.i32			; CHECK-NEXT: vld1.32
	; CHECK-NEXT: vst1.32			; CHECK-NEXT: vst1.32
	; CHECK-NEXT: vst1.32			; CHECK-NEXT: vst1.32
	; CHECK-NEXT: vmov.f32			; CHECK-NEXT: vmov.f32
	; CHECK-NEXT: vmov.f32			; CHECK-NEXT: vmov.f32
	; CHECK-NEXT: vst1.32			; CHECK-NEXT: vst1.32

	entry:			entry:
	%0 = load <4 x float>, <4 x float>* undef, align 4			%0 = load <4 x float>, <4 x float>* undef, align 4
	store <4 x float> zeroinitializer, <4 x float>* undef, align 4			store <4 x float> zeroinitializer, <4 x float>* undef, align 4
	store <4 x float> %0, <4 x float>* undef, align 4			store <4 x float> %0, <4 x float>* undef, align 4
	%1 = insertelement <4 x float> %0, float 1.000000e+00, i32 3			%1 = insertelement <4 x float> %0, float 1.000000e+00, i32 3
	store <4 x float> %1, <4 x float>* undef, align 4			store <4 x float> %1, <4 x float>* undef, align 4
	unreachable			unreachable
	}			}

test/CodeGen/ARM/vcombine.ll

	Show All 33 Lines
	}			}

	define <4 x i32> @vcombine32(<2 x i32>* %A, <2 x i32>* %B) nounwind {			define <4 x i32> @vcombine32(<2 x i32>* %A, <2 x i32>* %B) nounwind {
	; CHECK-LABEL: vcombine32			; CHECK-LABEL: vcombine32

	; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]			; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]
	; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]			; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]

	; CHECK-LE: vmov r0, r1, [[LD0]]
	; CHECK-LE: vmov r2, r3, [[LD1]]			; CHECK-LE: vmov r2, r3, [[LD1]]
				; CHECK-LE: vmov r0, r1, [[LD0]]

	; CHECK-BE: vmov r1, r0, d16			; CHECK-BE: vmov r1, r0, d16
	; CHECK-BE: vmov r3, r2, d17			; CHECK-BE: vmov r3, r2, d17
	%tmp1 = load <2 x i32>, <2 x i32>* %A			%tmp1 = load <2 x i32>, <2 x i32>* %A
	%tmp2 = load <2 x i32>, <2 x i32>* %B			%tmp2 = load <2 x i32>, <2 x i32>* %B
	%tmp3 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%tmp3 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	ret <4 x i32> %tmp3			ret <4 x i32> %tmp3
	}			}

	define <4 x float> @vcombinefloat(<2 x float>* %A, <2 x float>* %B) nounwind {			define <4 x float> @vcombinefloat(<2 x float>* %A, <2 x float>* %B) nounwind {
	; CHECK-LABEL: vcombinefloat			; CHECK-LABEL: vcombinefloat

	; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]			; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]
	; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]			; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]

	; CHECK-LE: vmov r0, r1, [[LD0]]
	; CHECK-LE: vmov r2, r3, [[LD1]]			; CHECK-LE: vmov r2, r3, [[LD1]]
				; CHECK-LE: vmov r0, r1, [[LD0]]

	; CHECK-BE: vmov r1, r0, d16			; CHECK-BE: vmov r1, r0, d16
	; CHECK-BE: vmov r3, r2, d17			; CHECK-BE: vmov r3, r2, d17
	%tmp1 = load <2 x float>, <2 x float>* %A			%tmp1 = load <2 x float>, <2 x float>* %A
	%tmp2 = load <2 x float>, <2 x float>* %B			%tmp2 = load <2 x float>, <2 x float>* %B
	%tmp3 = shufflevector <2 x float> %tmp1, <2 x float> %tmp2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%tmp3 = shufflevector <2 x float> %tmp1, <2 x float> %tmp2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	ret <4 x float> %tmp3			ret <4 x float> %tmp3
	}			}

	define <2 x i64> @vcombine64(<1 x i64>* %A, <1 x i64>* %B) nounwind {			define <2 x i64> @vcombine64(<1 x i64>* %A, <1 x i64>* %B) nounwind {
	; CHECK-LABEL: vcombine64			; CHECK-LABEL: vcombine64
	; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]			; CHECK-DAG: vldr [[LD0:d[0-9]+]], [r0]
	; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]			; CHECK-DAG: vldr [[LD1:d[0-9]+]], [r1]

	; CHECK-LE: vmov r0, r1, [[LD0]]
	; CHECK-LE: vmov r2, r3, [[LD1]]			; CHECK-LE: vmov r2, r3, [[LD1]]
				; CHECK-LE: vmov r0, r1, [[LD0]]

	; CHECK-BE: vmov r1, r0, [[LD0]]
	; CHECK-BE: vmov r3, r2, [[LD1]]			; CHECK-BE: vmov r3, r2, [[LD1]]
				; CHECK-BE: vmov r1, r0, [[LD0]]
	%tmp1 = load <1 x i64>, <1 x i64>* %A			%tmp1 = load <1 x i64>, <1 x i64>* %A
	%tmp2 = load <1 x i64>, <1 x i64>* %B			%tmp2 = load <1 x i64>, <1 x i64>* %B
	%tmp3 = shufflevector <1 x i64> %tmp1, <1 x i64> %tmp2, <2 x i32> <i32 0, i32 1>			%tmp3 = shufflevector <1 x i64> %tmp1, <1 x i64> %tmp2, <2 x i32> <i32 0, i32 1>
	ret <2 x i64> %tmp3			ret <2 x i64> %tmp3
	}			}

	; Check for vget_low and vget_high implemented with shufflevector. PR8411.			; Check for vget_low and vget_high implemented with shufflevector. PR8411.
	; They should not require storing to the stack.			; They should not require storing to the stack.
	Show All 40 Lines

test/CodeGen/ARM/vuzp.ll

Show First 20 Lines • Show All 318 Lines • ▼ Show 20 Lines
}		}

define <8 x i8> @cmpsel_trunc(<8 x i8> %in0, <8 x i8> %in1, <8 x i32> %cmp0, <8 x i32> %cmp1) {		define <8 x i8> @cmpsel_trunc(<8 x i8> %in0, <8 x i8> %in1, <8 x i32> %cmp0, <8 x i32> %cmp1) {
; In order to create the select we need to truncate the vcgt result from a vector of i32 to a vector of i8.		; In order to create the select we need to truncate the vcgt result from a vector of i32 to a vector of i8.
; This results in a build_vector with mismatched types. We will generate two vmovn.i32 instructions to		; This results in a build_vector with mismatched types. We will generate two vmovn.i32 instructions to
; truncate from i32 to i16 and one vmovn.i16 to perform the final truncation for i8.		; truncate from i32 to i16 and one vmovn.i16 to perform the final truncation for i8.
; CHECK-LABEL: cmpsel_trunc:		; CHECK-LABEL: cmpsel_trunc:
; CHECK: @ %bb.0:		; CHECK: @ %bb.0:
; CHECK-NEXT: add r12, sp, #16		; CHECK-NEXT: add r12, sp, #16
; CHECK-NEXT: vld1.64 {d16, d17}, [r12]		; CHECK-NEXT: vld1.64 {d16, d17}, [r12]
; CHECK-NEXT: mov r12, sp		; CHECK-NEXT: mov r12, sp
; CHECK-NEXT: vld1.64 {d18, d19}, [r12]		; CHECK-NEXT: vld1.64 {d18, d19}, [r12]
; CHECK-NEXT: add r12, sp, #48		; CHECK-NEXT: add r12, sp, #48
; CHECK-NEXT: vld1.64 {d20, d21}, [r12]		; CHECK-NEXT: vld1.64 {d20, d21}, [r12]
; CHECK-NEXT: add r12, sp, #32		; CHECK-NEXT: add r12, sp, #32
; CHECK-NEXT: vcgt.u32 q8, q10, q8		; CHECK-NEXT: vcgt.u32 q8, q10, q8
; CHECK-NEXT: vld1.64 {d20, d21}, [r12]		; CHECK-NEXT: vld1.64 {d20, d21}, [r12]
; CHECK-NEXT: vcgt.u32 q9, q10, q9		; CHECK-NEXT: vcgt.u32 q9, q10, q9
; CHECK-NEXT: vmov d20, r2, r3		; CHECK-NEXT: vmov d20, r2, r3
; CHECK-NEXT: vmovn.i32 d17, q8		; CHECK-NEXT: vmovn.i32 d17, q8
; CHECK-NEXT: vmovn.i32 d16, q9		; CHECK-NEXT: vmovn.i32 d16, q9
; CHECK-NEXT: vmov d18, r0, r1		; CHECK-NEXT: vmov d18, r0, r1
; CHECK-NEXT: vmovn.i16 d16, q8		; CHECK-NEXT: vmovn.i16 d16, q8
; CHECK-NEXT: vbsl d16, d18, d20		; CHECK-NEXT: vbsl d16, d18, d20
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
%c = icmp ult <8 x i32> %cmp0, %cmp1		%c = icmp ult <8 x i32> %cmp0, %cmp1
%res = select <8 x i1> %c, <8 x i8> %in0, <8 x i8> %in1		%res = select <8 x i1> %c, <8 x i8> %in0, <8 x i8> %in1
ret <8 x i8> %res		ret <8 x i8> %res
}		}

; Shuffle the result from the compare with a <4 x i8>.		; Shuffle the result from the compare with a <4 x i8>.
; We need to extend the loaded <4 x i8> to <4 x i16>. Otherwise we wouldn't be able		; We need to extend the loaded <4 x i8> to <4 x i16>. Otherwise we wouldn't be able
; to perform the vuzp and get the vbsl mask.		; to perform the vuzp and get the vbsl mask.
define <8 x i8> @vuzp_trunc_and_shuffle(<8 x i8> %tr0, <8 x i8> %tr1,		define <8 x i8> @vuzp_trunc_and_shuffle(<8 x i8> %tr0, <8 x i8> %tr1,
; CHECK-LABEL: vuzp_trunc_and_shuffle:		; CHECK-LABEL: vuzp_trunc_and_shuffle:
; CHECK: @ %bb.0:		; CHECK: @ %bb.0:
; CHECK-NEXT: .save {r11, lr}		; CHECK-NEXT: .save {r11, lr}
; CHECK-NEXT: push {r11, lr}		; CHECK-NEXT: push {r11, lr}
; CHECK-NEXT: add r12, sp, #8		; CHECK-NEXT: add r12, sp, #8
; CHECK-NEXT: add lr, sp, #24		; CHECK-NEXT: add lr, sp, #24
; CHECK-NEXT: vld1.64 {d16, d17}, [r12]		; CHECK-NEXT: vld1.64 {d16, d17}, [r12]
; CHECK-NEXT: ldr r12, [sp, #40]		; CHECK-NEXT: ldr r12, [sp, #40]
; CHECK-NEXT: vld1.64 {d18, d19}, [lr]		; CHECK-NEXT: vld1.64 {d18, d19}, [lr]
; CHECK-NEXT: vcgt.u32 q8, q9, q8		; CHECK-NEXT: vcgt.u32 q8, q9, q8
; CHECK-NEXT: vld1.32 {d18[0]}, [r12:32]		; CHECK-NEXT: vld1.32 {d18[0]}, [r12:32]
; CHECK-NEXT: vmov.i8 d19, #0x7		; CHECK-NEXT: vmov.i8 d19, #0x7
; CHECK-NEXT: vmovl.u8 q10, d18		; CHECK-NEXT: vmovl.u8 q10, d18
; CHECK-NEXT: vmovn.i32 d16, q8		; CHECK-NEXT: vmovn.i32 d16, q8
; CHECK-NEXT: vneg.s8 d17, d19		; CHECK-NEXT: vneg.s8 d17, d19
; CHECK-NEXT: vmov d18, r2, r3		; CHECK-NEXT: vmov d18, r2, r3
; CHECK-NEXT: vuzp.8 d16, d20		; CHECK-NEXT: vuzp.8 d16, d20
; CHECK-NEXT: vshl.i8 d16, d16, #7		; CHECK-NEXT: vshl.i8 d16, d16, #7
; CHECK-NEXT: vshl.s8 d16, d16, d17		; CHECK-NEXT: vshl.s8 d16, d16, d17
; CHECK-NEXT: vmov d17, r0, r1		; CHECK-NEXT: vmov d17, r0, r1
; CHECK-NEXT: vbsl d16, d17, d18		; CHECK-NEXT: vbsl d16, d17, d18
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: pop {r11, lr}		; CHECK-NEXT: pop {r11, lr}
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
<4 x i32> %cmp0, <4 x i32> %cmp1, <4 x i8> *%cmp2_ptr) {		<4 x i32> %cmp0, <4 x i32> %cmp1, <4 x i8> *%cmp2_ptr) {
%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4		%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4
%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>		%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>
%c0 = icmp ult <4 x i32> %cmp0, %cmp1		%c0 = icmp ult <4 x i32> %cmp0, %cmp1
%c = shufflevector <4 x i1> %c0, <4 x i1> %cmp2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i1> %c0, <4 x i1> %cmp2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1		%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1
ret <8 x i8> %rv		ret <8 x i8> %rv
}		}

; Use an undef value for the <4 x i8> that is being shuffled with the compare result.		; Use an undef value for the <4 x i8> that is being shuffled with the compare result.
; This produces a build_vector with some of the operands undefs.		; This produces a build_vector with some of the operands undefs.
define <8 x i8> @vuzp_trunc_and_shuffle_undef_right(<8 x i8> %tr0, <8 x i8> %tr1,		define <8 x i8> @vuzp_trunc_and_shuffle_undef_right(<8 x i8> %tr0, <8 x i8> %tr1,
; CHECK-LABEL: vuzp_trunc_and_shuffle_undef_right:		; CHECK-LABEL: vuzp_trunc_and_shuffle_undef_right:
; CHECK: @ %bb.0:		; CHECK: @ %bb.0:
; CHECK-NEXT: mov r12, sp		; CHECK-NEXT: mov r12, sp
; CHECK-NEXT: vld1.64 {d16, d17}, [r12]		; CHECK-NEXT: vld1.64 {d16, d17}, [r12]
; CHECK-NEXT: add r12, sp, #16		; CHECK-NEXT: add r12, sp, #16
; CHECK-NEXT: vld1.64 {d18, d19}, [r12]		; CHECK-NEXT: vld1.64 {d18, d19}, [r12]
; CHECK-NEXT: vcgt.u32 q8, q9, q8		; CHECK-NEXT: vcgt.u32 q8, q9, q8
; CHECK-NEXT: vmov.i8 d18, #0x7		; CHECK-NEXT: vmov.i8 d18, #0x7
; CHECK-NEXT: vmovn.i32 d16, q8		; CHECK-NEXT: vmovn.i32 d16, q8
; CHECK-NEXT: vuzp.8 d16, d17		; CHECK-NEXT: vuzp.8 d16, d17
; CHECK-NEXT: vneg.s8 d17, d18		; CHECK-NEXT: vneg.s8 d17, d18
; CHECK-NEXT: vshl.i8 d16, d16, #7		; CHECK-NEXT: vshl.i8 d16, d16, #7
; CHECK-NEXT: vmov d18, r2, r3		; CHECK-NEXT: vmov d18, r2, r3
; CHECK-NEXT: vshl.s8 d16, d16, d17		; CHECK-NEXT: vshl.s8 d16, d16, d17
; CHECK-NEXT: vmov d17, r0, r1		; CHECK-NEXT: vmov d17, r0, r1
; CHECK-NEXT: vbsl d16, d17, d18		; CHECK-NEXT: vbsl d16, d17, d18
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
<4 x i32> %cmp0, <4 x i32> %cmp1, <4 x i8> *%cmp2_ptr) {		<4 x i32> %cmp0, <4 x i32> %cmp1, <4 x i8> *%cmp2_ptr) {
%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4		%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4
%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>		%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>
%c0 = icmp ult <4 x i32> %cmp0, %cmp1		%c0 = icmp ult <4 x i32> %cmp0, %cmp1
%c = shufflevector <4 x i1> %c0, <4 x i1> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i1> %c0, <4 x i1> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1		%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1
ret <8 x i8> %rv		ret <8 x i8> %rv
}		}

define <8 x i8> @vuzp_trunc_and_shuffle_undef_left(<8 x i8> %tr0, <8 x i8> %tr1,		define <8 x i8> @vuzp_trunc_and_shuffle_undef_left(<8 x i8> %tr0, <8 x i8> %tr1,
; CHECK-LABEL: vuzp_trunc_and_shuffle_undef_left:		; CHECK-LABEL: vuzp_trunc_and_shuffle_undef_left:
; CHECK: @ %bb.0:		; CHECK: @ %bb.0:
; CHECK-NEXT: mov r12, sp		; CHECK-NEXT: mov r12, sp
; CHECK-NEXT: vld1.64 {d16, d17}, [r12]		; CHECK-NEXT: vld1.64 {d16, d17}, [r12]
; CHECK-NEXT: add r12, sp, #16		; CHECK-NEXT: add r12, sp, #16
; CHECK-NEXT: vld1.64 {d18, d19}, [r12]		; CHECK-NEXT: vld1.64 {d18, d19}, [r12]
; CHECK-NEXT: vcgt.u32 q8, q9, q8		; CHECK-NEXT: vcgt.u32 q8, q9, q8
; CHECK-NEXT: vldr d18, .LCPI22_0		; CHECK-NEXT: vldr d18, .LCPI22_0
; CHECK-NEXT: vmov.i8 d19, #0x7		; CHECK-NEXT: vmov.i8 d19, #0x7
; CHECK-NEXT: vmovn.i32 d16, q8		; CHECK-NEXT: vmovn.i32 d16, q8
; CHECK-NEXT: vtbl.8 d16, {d16}, d18		; CHECK-NEXT: vtbl.8 d16, {d16}, d18
; CHECK-NEXT: vneg.s8 d17, d19		; CHECK-NEXT: vneg.s8 d17, d19
; CHECK-NEXT: vmov d18, r2, r3		; CHECK-NEXT: vmov d18, r2, r3
; CHECK-NEXT: vshl.i8 d16, d16, #7		; CHECK-NEXT: vshl.i8 d16, d16, #7
; CHECK-NEXT: vshl.s8 d16, d16, d17		; CHECK-NEXT: vshl.s8 d16, d16, d17
; CHECK-NEXT: vmov d17, r0, r1		; CHECK-NEXT: vmov d17, r0, r1
; CHECK-NEXT: vbsl d16, d17, d18		; CHECK-NEXT: vbsl d16, d17, d18
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
; CHECK-NEXT: .p2align 3		; CHECK-NEXT: .p2align 3
; CHECK-NEXT: @ %bb.1:		; CHECK-NEXT: @ %bb.1:
; CHECK-NEXT: .LCPI22_0:		; CHECK-NEXT: .LCPI22_0:
; CHECK-NEXT: .byte 255 @ 0xff		; CHECK-NEXT: .byte 255 @ 0xff
; CHECK-NEXT: .byte 255 @ 0xff		; CHECK-NEXT: .byte 255 @ 0xff
; CHECK-NEXT: .byte 255 @ 0xff		; CHECK-NEXT: .byte 255 @ 0xff
; CHECK-NEXT: .byte 255 @ 0xff		; CHECK-NEXT: .byte 255 @ 0xff
; CHECK-NEXT: .byte 0 @ 0x0		; CHECK-NEXT: .byte 0 @ 0x0
Show All 9 Lines	; CHECK-NEXT: .byte 6 @ 0x6
ret <8 x i8> %rv		ret <8 x i8> %rv
}		}

; We're using large data types here, and we have to fill with undef values until we		; We're using large data types here, and we have to fill with undef values until we
; get some vector size that we can represent.		; get some vector size that we can represent.
define <10 x i8> @vuzp_wide_type(<10 x i8> %tr0, <10 x i8> %tr1,		define <10 x i8> @vuzp_wide_type(<10 x i8> %tr0, <10 x i8> %tr1,
; CHECK-LABEL: vuzp_wide_type:		; CHECK-LABEL: vuzp_wide_type:
; CHECK: @ %bb.0:		; CHECK: @ %bb.0:
; CHECK-NEXT: .save {r4, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r4, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: add r12, sp, #32		; CHECK-NEXT: add r12, sp, #32
; CHECK-NEXT: add lr, sp, #48		; CHECK-NEXT: add lr, sp, #48
; CHECK-NEXT: vld1.32 {d17[0]}, [r12:32]		; CHECK-NEXT: vld1.32 {d17[0]}, [r12:32]
; CHECK-NEXT: add r12, sp, #24		; CHECK-NEXT: add r12, sp, #24
; CHECK-NEXT: vld1.32 {d16[0]}, [r12:32]		; CHECK-NEXT: vld1.32 {d16[0]}, [r12:32]
; CHECK-NEXT: add r12, sp, #56		; CHECK-NEXT: add r12, sp, #56
; CHECK-NEXT: vld1.32 {d19[0]}, [r12:32]		; CHECK-NEXT: vld1.32 {d19[0]}, [r12:32]
; CHECK-NEXT: ldr r12, [sp, #68]
; CHECK-NEXT: vld1.32 {d18[0]}, [lr:32]		; CHECK-NEXT: vld1.32 {d18[0]}, [lr:32]
; CHECK-NEXT: add lr, sp, #40		; CHECK-NEXT: add lr, sp, #40
; CHECK-NEXT: vld1.32 {d20[0]}, [lr:32]		; CHECK-NEXT: vld1.32 {d20[0]}, [lr:32]
		; CHECK-NEXT: ldr r12, [sp, #68]
; CHECK-NEXT: ldr r4, [r12]		; CHECK-NEXT: ldr r4, [r12]
; CHECK-NEXT: vmov.32 d23[0], r4		; CHECK-NEXT: vmov.32 d23[0], r4
; CHECK-NEXT: add r4, sp, #64		; CHECK-NEXT: add r4, sp, #64
; CHECK-NEXT: vld1.32 {d24[0]}, [r4:32]		; CHECK-NEXT: vld1.32 {d24[0]}, [r4:32]
; CHECK-NEXT: add r4, sp, #36		; CHECK-NEXT: add r4, sp, #36
		; CHECK-NEXT: vcgt.u32 q10, q12, q10
; CHECK-NEXT: vld1.32 {d17[1]}, [r4:32]		; CHECK-NEXT: vld1.32 {d17[1]}, [r4:32]
; CHECK-NEXT: add r4, sp, #28		; CHECK-NEXT: add r4, sp, #28
; CHECK-NEXT: vcgt.u32 q10, q12, q10
; CHECK-NEXT: vmov.u8 lr, d23[3]
; CHECK-NEXT: vld1.32 {d16[1]}, [r4:32]		; CHECK-NEXT: vld1.32 {d16[1]}, [r4:32]
; CHECK-NEXT: add r4, sp, #60		; CHECK-NEXT: add r4, sp, #60
; CHECK-NEXT: vld1.32 {d19[1]}, [r4:32]		; CHECK-NEXT: vld1.32 {d19[1]}, [r4:32]
; CHECK-NEXT: add r4, sp, #52		; CHECK-NEXT: add r4, sp, #52
; CHECK-NEXT: vld1.32 {d18[1]}, [r4:32]		; CHECK-NEXT: vld1.32 {d18[1]}, [r4:32]
; CHECK-NEXT: add r4, r12, #4		; CHECK-NEXT: add r4, r12, #4
; CHECK-NEXT: vcgt.u32 q8, q9, q8		; CHECK-NEXT: vcgt.u32 q8, q9, q8
; CHECK-NEXT: vmovn.i32 d19, q10		; CHECK-NEXT: vmovn.i32 d19, q10
		; CHECK-NEXT: vmov.u8 lr, d23[3]
; CHECK-NEXT: vldr d20, .LCPI23_0		; CHECK-NEXT: vldr d20, .LCPI23_0
; CHECK-NEXT: vmovn.i32 d18, q8		; CHECK-NEXT: vmovn.i32 d18, q8
; CHECK-NEXT: vmovn.i16 d22, q9		; CHECK-NEXT: vmovn.i16 d22, q9
; CHECK-NEXT: vmov.i8 q9, #0x7		; CHECK-NEXT: vmov.i8 q9, #0x7
; CHECK-NEXT: vmov.8 d17[0], lr
; CHECK-NEXT: vneg.s8 q9, q9		; CHECK-NEXT: vneg.s8 q9, q9
		; CHECK-NEXT: vmov.8 d17[0], lr
; CHECK-NEXT: vtbl.8 d16, {d22, d23}, d20		; CHECK-NEXT: vtbl.8 d16, {d22, d23}, d20
; CHECK-NEXT: vld1.8 {d17[1]}, [r4]		; CHECK-NEXT: vld1.8 {d17[1]}, [r4]
; CHECK-NEXT: add r4, sp, #8		; CHECK-NEXT: add r4, sp, #8
; CHECK-NEXT: vshl.i8 q8, q8, #7		; CHECK-NEXT: vshl.i8 q8, q8, #7
; CHECK-NEXT: vld1.64 {d20, d21}, [r4]		; CHECK-NEXT: vld1.64 {d20, d21}, [r4]
; CHECK-NEXT: vshl.s8 q8, q8, q9		; CHECK-NEXT: vshl.s8 q8, q8, q9
; CHECK-NEXT: vmov d19, r2, r3		; CHECK-NEXT: vmov d19, r2, r3
; CHECK-NEXT: vmov d18, r0, r1		; CHECK-NEXT: vmov d18, r0, r1
; CHECK-NEXT: vbsl q8, q9, q10		; CHECK-NEXT: vbsl q8, q9, q10
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: vmov r2, r3, d17		; CHECK-NEXT: vmov r2, r3, d17
; CHECK-NEXT: pop {r4, lr}		; CHECK-NEXT: pop {r4, lr}
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
; CHECK-NEXT: .p2align 3		; CHECK-NEXT: .p2align 3
; CHECK-NEXT: @ %bb.1:		; CHECK-NEXT: @ %bb.1:
; CHECK-NEXT: .LCPI23_0:		; CHECK-NEXT: .LCPI23_0:
; CHECK-NEXT: .byte 0 @ 0x0		; CHECK-NEXT: .byte 0 @ 0x0
; CHECK-NEXT: .byte 1 @ 0x1		; CHECK-NEXT: .byte 1 @ 0x1
; CHECK-NEXT: .byte 2 @ 0x2		; CHECK-NEXT: .byte 2 @ 0x2
; CHECK-NEXT: .byte 3 @ 0x3		; CHECK-NEXT: .byte 3 @ 0x3
; CHECK-NEXT: .byte 4 @ 0x4		; CHECK-NEXT: .byte 4 @ 0x4
Show All 30 Lines

test/CodeGen/SystemZ/misched-readadvances.mir

This file was added.

				# Check that the extra operand for the full register added by RegAlloc does
				# not have a latency that interferes with the latency adjustment
				MatzeBUnsubmitted Not Done Reply Inline Actions If you have the time look at: https://llvm.org/docs/MIRLangRef.html#simplifying-mir-files This smells like you can do things like dropping the IR part, not listing the successor blocks (at least for the blocks that don't use the jumptable)... MatzeB: If you have the time look at: https://llvm.org/docs/MIRLangRef.html#simplifying-mir-files This…
				# (ReadAdvance) for the MSY register operand.

				# RUN: llc %s -mtriple=s390x-linux-gnu -mcpu=z13 -start-before=machine-scheduler \
				# RUN: -debug-only=machine-scheduler -o - 2>&1 \| FileCheck %s
				# REQUIRES: asserts

				# CHECK: ScheduleDAGMI::schedule starting
				# CHECK: SU(4): renamable $r2l = MSR renamable $r2l(tied-def 0), renamable $r2l
				# CHECK: Latency : 6
				# CHECK: SU(5): renamable $r2l = MSY renamable $r2l(tied-def 0), renamable $r1d, -4, $noreg, implicit $r2d
				# CHECK: Predecessors:
				# CHECK: SU(4): Data Latency=2 Reg=$r2l
				# CHECK: SU(4): Data Latency=0 Reg=$r2d

				---
				name: Perl_do_sv_dump
				alignment: 4
				tracksRegLiveness: true
				body: \|
				bb.0 :
				%1:addr64bit = IMPLICIT_DEF
				%2:addr64bit = IMPLICIT_DEF
				%3:vr64bit = IMPLICIT_DEF

				bb.1 :
				%2:addr64bit = ALGFI %2, 4294967291, implicit-def dead $cc
				%2.subreg_l32:addr64bit = MSR %2.subreg_l32, %2.subreg_l32
				%2.subreg_l32:addr64bit = MSY %2.subreg_l32, %1, -4, $noreg
				...

test/CodeGen/Thumb2/umulo-128-legalisation-lowering.ll

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
	; THUMBV7-NEXT: movne r4, #1			; THUMBV7-NEXT: movne r4, #1
	; THUMBV7-NEXT: cmp.w r10, #0			; THUMBV7-NEXT: cmp.w r10, #0
	; THUMBV7-NEXT: and.w r1, r1, r7			; THUMBV7-NEXT: and.w r1, r1, r7
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne.w r10, #1			; THUMBV7-NEXT: movne.w r10, #1
	; THUMBV7-NEXT: orrs r3, r2			; THUMBV7-NEXT: orrs r3, r2
	; THUMBV7-NEXT: ldr r2, [sp, #80]			; THUMBV7-NEXT: ldr r2, [sp, #80]
	; THUMBV7-NEXT: orr.w r1, r1, r4			; THUMBV7-NEXT: orr.w r1, r1, r4
				; THUMBV7-NEXT: orr.w r1, r1, r10
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne r3, #1			; THUMBV7-NEXT: movne r3, #1
	; THUMBV7-NEXT: orr.w r1, r1, r10
	; THUMBV7-NEXT: orrs.w r7, r2, r11			; THUMBV7-NEXT: orrs.w r7, r2, r11
	; THUMBV7-NEXT: orr.w r1, r1, r9			; THUMBV7-NEXT: orr.w r1, r1, r9
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne r7, #1			; THUMBV7-NEXT: movne r7, #1
	; THUMBV7-NEXT: orr.w r0, r0, r12
	; THUMBV7-NEXT: ands r3, r7			; THUMBV7-NEXT: ands r3, r7
				; THUMBV7-NEXT: orr.w r0, r0, r12
	; THUMBV7-NEXT: orrs r1, r3			; THUMBV7-NEXT: orrs r1, r3
	; THUMBV7-NEXT: orrs r0, r1			; THUMBV7-NEXT: orrs r0, r1
	; THUMBV7-NEXT: orr.w r0, r0, r8			; THUMBV7-NEXT: orr.w r0, r0, r8
	; THUMBV7-NEXT: and r0, r0, #1			; THUMBV7-NEXT: and r0, r0, #1
	; THUMBV7-NEXT: strb.w r0, [lr, #16]			; THUMBV7-NEXT: strb.w r0, [lr, #16]
	; THUMBV7-NEXT: add sp, #44			; THUMBV7-NEXT: add sp, #44
	; THUMBV7-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}			; THUMBV7-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
	start:			start:
	Show All 15 Lines

test/CodeGen/Thumb2/umulo-64-legalisation-lowering.ll

	Show All 14 Lines
	; THUMBV7-NEXT: adc r2, r6, #0			; THUMBV7-NEXT: adc r2, r6, #0
	; THUMBV7-NEXT: cmp r3, #0			; THUMBV7-NEXT: cmp r3, #0
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne r3, #1			; THUMBV7-NEXT: movne r3, #1
	; THUMBV7-NEXT: cmp r1, #0			; THUMBV7-NEXT: cmp r1, #0
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne r1, #1			; THUMBV7-NEXT: movne r1, #1
	; THUMBV7-NEXT: cmp r5, #0			; THUMBV7-NEXT: cmp r5, #0
				; THUMBV7-NEXT: and.w r1, r1, r3
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne r5, #1			; THUMBV7-NEXT: movne r5, #1
	; THUMBV7-NEXT: ands r1, r3			; THUMBV7-NEXT: orrs r1, r5
	; THUMBV7-NEXT: cmp.w lr, #0			; THUMBV7-NEXT: cmp.w lr, #0
	; THUMBV7-NEXT: orr.w r1, r1, r5
	; THUMBV7-NEXT: it ne			; THUMBV7-NEXT: it ne
	; THUMBV7-NEXT: movne.w lr, #1			; THUMBV7-NEXT: movne.w lr, #1
	; THUMBV7-NEXT: orr.w r1, r1, lr			; THUMBV7-NEXT: orr.w r1, r1, lr
	; THUMBV7-NEXT: orrs r2, r1			; THUMBV7-NEXT: orrs r2, r1
	; THUMBV7-NEXT: mov r1, r12			; THUMBV7-NEXT: mov r1, r12
	; THUMBV7-NEXT: pop {r4, r5, r6, pc}			; THUMBV7-NEXT: pop {r4, r5, r6, pc}
	start:			start:
	%0 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %l, i64 %r) #2			%0 = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %l, i64 %r) #2
	Show All 14 Lines

test/CodeGen/X86/lsr-loop-exit-cond.ll

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	;			;
	; ATOM-LABEL: t:			; ATOM-LABEL: t:
	; ATOM: ## %bb.0: ## %entry			; ATOM: ## %bb.0: ## %entry
	; ATOM-NEXT: pushq %rbp			; ATOM-NEXT: pushq %rbp
	; ATOM-NEXT: pushq %r15			; ATOM-NEXT: pushq %r15
	; ATOM-NEXT: pushq %r14			; ATOM-NEXT: pushq %r14
	; ATOM-NEXT: pushq %rbx			; ATOM-NEXT: pushq %rbx
	; ATOM-NEXT: ## kill: def $ecx killed $ecx def $rcx			; ATOM-NEXT: ## kill: def $ecx killed $ecx def $rcx
	; ATOM-NEXT: movl 4(%rdx), %eax
	; ATOM-NEXT: movl (%rdx), %r15d			; ATOM-NEXT: movl (%rdx), %r15d
				; ATOM-NEXT: movl 4(%rdx), %eax
	; ATOM-NEXT: leaq 20(%rdx), %r14			; ATOM-NEXT: leaq 20(%rdx), %r14
	; ATOM-NEXT: movq _Te0@{{.*}}(%rip), %r9			; ATOM-NEXT: movq _Te0@{{.*}}(%rip), %r9
	; ATOM-NEXT: movq _Te1@{{.*}}(%rip), %r8			; ATOM-NEXT: movq _Te1@{{.*}}(%rip), %r8
	; ATOM-NEXT: movq _Te3@{{.*}}(%rip), %r10			; ATOM-NEXT: movq _Te3@{{.*}}(%rip), %r10
	; ATOM-NEXT: decl %ecx			; ATOM-NEXT: decl %ecx
	; ATOM-NEXT: movq %rcx, %r11			; ATOM-NEXT: movq %rcx, %r11
	; ATOM-NEXT: jmp LBB0_1			; ATOM-NEXT: jmp LBB0_1
	; ATOM-NEXT: .p2align 4, 0x90			; ATOM-NEXT: .p2align 4, 0x90
	; ATOM-NEXT: LBB0_2: ## %bb1			; ATOM-NEXT: LBB0_2: ## %bb1
	; ATOM-NEXT: ## in Loop: Header=BB0_1 Depth=1			; ATOM-NEXT: ## in Loop: Header=BB0_1 Depth=1
	; ATOM-NEXT: shrl $16, %eax			; ATOM-NEXT: shrl $16, %eax
	; ATOM-NEXT: shrl $24, %edi			; ATOM-NEXT: shrl $24, %edi
	; ATOM-NEXT: decq %r11			; ATOM-NEXT: decq %r11
	; ATOM-NEXT: movzbl %al, %ebp			; ATOM-NEXT: movzbl %al, %ebp
	; ATOM-NEXT: movzbl %bl, %eax			; ATOM-NEXT: movzbl %bl, %eax
	; ATOM-NEXT: movl (%r10,%rax,4), %eax			; ATOM-NEXT: movl (%r10,%rax,4), %eax
	; ATOM-NEXT: xorl (%r8,%rbp,4), %r15d			; ATOM-NEXT: xorl (%r8,%rbp,4), %r15d
	; ATOM-NEXT: xorl -4(%r14), %r15d
	; ATOM-NEXT: xorl (%r9,%rdi,4), %eax			; ATOM-NEXT: xorl (%r9,%rdi,4), %eax
				; ATOM-NEXT: xorl -4(%r14), %r15d
	; ATOM-NEXT: xorl (%r14), %eax			; ATOM-NEXT: xorl (%r14), %eax
	; ATOM-NEXT: addq $16, %r14			; ATOM-NEXT: addq $16, %r14
	; ATOM-NEXT: LBB0_1: ## %bb			; ATOM-NEXT: LBB0_1: ## %bb
	; ATOM-NEXT: ## =>This Inner Loop Header: Depth=1			; ATOM-NEXT: ## =>This Inner Loop Header: Depth=1
	; ATOM-NEXT: movl %eax, %edi			; ATOM-NEXT: movl %eax, %edi
	; ATOM-NEXT: movl %r15d, %ebp			; ATOM-NEXT: movl %r15d, %ebp
	; ATOM-NEXT: shrl $24, %eax			; ATOM-NEXT: shrl $24, %eax
	; ATOM-NEXT: shrl $16, %edi			; ATOM-NEXT: shrl $16, %edi
	; ATOM-NEXT: shrl $24, %ebp			; ATOM-NEXT: shrl $24, %ebp
	; ATOM-NEXT: movzbl %dil, %edi			; ATOM-NEXT: movzbl %dil, %edi
	; ATOM-NEXT: movl (%r8,%rdi,4), %ebx			; ATOM-NEXT: movl (%r8,%rdi,4), %ebx
	; ATOM-NEXT: movzbl %r15b, %edi			; ATOM-NEXT: movzbl %r15b, %edi
	; ATOM-NEXT: movl (%r10,%rdi,4), %edi
	; ATOM-NEXT: xorl (%r9,%rbp,4), %ebx			; ATOM-NEXT: xorl (%r9,%rbp,4), %ebx
				; ATOM-NEXT: movl (%r10,%rdi,4), %edi
	; ATOM-NEXT: xorl -12(%r14), %ebx			; ATOM-NEXT: xorl -12(%r14), %ebx
	; ATOM-NEXT: xorl (%r9,%rax,4), %edi			; ATOM-NEXT: xorl (%r9,%rax,4), %edi
	; ATOM-NEXT: movl %ebx, %eax			; ATOM-NEXT: movl %ebx, %eax
				; ATOM-NEXT: xorl -8(%r14), %edi
	; ATOM-NEXT: shrl $24, %eax			; ATOM-NEXT: shrl $24, %eax
	; ATOM-NEXT: movl (%r9,%rax,4), %r15d			; ATOM-NEXT: movl (%r9,%rax,4), %r15d
	; ATOM-NEXT: xorl -8(%r14), %edi
	; ATOM-NEXT: testq %r11, %r11			; ATOM-NEXT: testq %r11, %r11
	; ATOM-NEXT: movl %edi, %eax			; ATOM-NEXT: movl %edi, %eax
	; ATOM-NEXT: jne LBB0_2			; ATOM-NEXT: jne LBB0_2
	; ATOM-NEXT: ## %bb.3: ## %bb2			; ATOM-NEXT: ## %bb.3: ## %bb2
	; ATOM-NEXT: shrl $16, %eax			; ATOM-NEXT: shrl $16, %eax
	; ATOM-NEXT: shrl $8, %edi			; ATOM-NEXT: shrl $8, %edi
	; ATOM-NEXT: movzbl %bl, %ebp			; ATOM-NEXT: movzbl %bl, %ebp
	; ATOM-NEXT: andl $-16777216, %r15d ## imm = 0xFF000000			; ATOM-NEXT: andl $-16777216, %r15d ## imm = 0xFF000000
	▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

test/CodeGen/X86/phys-reg-local-regalloc.ll

	Show All 14 Lines
	; CHECK-NOT: movl			; CHECK-NOT: movl
	; CHECK: movl %ebx, 40(%esp)			; CHECK: movl %ebx, 40(%esp)
	; CHECK-NOT: movl			; CHECK-NOT: movl
	; CHECK: addl %ebx, %eax			; CHECK: addl %ebx, %eax

	; On Intel Atom the scheduler moves a movl instruction			; On Intel Atom the scheduler moves a movl instruction
	; used for the printf call to follow movl 24(%esp), %eax			; used for the printf call to follow movl 24(%esp), %eax
	; ATOM: movl 24(%esp), %eax			; ATOM: movl 24(%esp), %eax
	; ATOM: movl
	; ATOM: movl %eax, 36(%esp)
	; ATOM-NOT: movl			; ATOM-NOT: movl
				; ATOM: movl %eax, 36(%esp)
				; ATOM: movl
				RKSimonUnsubmitted Not Done Reply Inline Actions Sorry - missed that test - looks OK. RKSimon: Sorry - missed that test - looks OK.
	; ATOM: movl 28(%esp), %ebx			; ATOM: movl 28(%esp), %ebx
	; ATOM-NOT: movl			; ATOM-NOT: movl
	; ATOM: movl %ebx, 40(%esp)			; ATOM: movl %ebx, 40(%esp)
	; ATOM-NOT: movl			; ATOM-NOT: movl
	; ATOM: addl %ebx, %eax			; ATOM: addl %ebx, %eax

	%retval = alloca i32 ; <i32*> [#uses=2]			%retval = alloca i32 ; <i32*> [#uses=2]
	%"%ebx" = alloca i32 ; <i32*> [#uses=1]			%"%ebx" = alloca i32 ; <i32*> [#uses=1]
	Show All 32 Lines

test/CodeGen/X86/schedule-x86-64-shld.ll

	Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movq %rdx, %rcx # sched: [1:0.33]			; GENERIC-NEXT: movq %rdx, %rcx # sched: [1:0.33]
	; GENERIC-NEXT: movq %rdi, %rax # sched: [1:0.33]			; GENERIC-NEXT: movq %rdi, %rax # sched: [1:0.33]
	; GENERIC-NEXT: # kill: def $cl killed $cl killed $rcx			; GENERIC-NEXT: # kill: def $cl killed $cl killed $rcx
	; GENERIC-NEXT: shldq %cl, %rsi, %rax # sched: [4:1.50]			; GENERIC-NEXT: shldq %cl, %rsi, %rax # sched: [4:1.50]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; BTVER2-LABEL: lshift_cl_optsize:			; BTVER2-LABEL: lshift_cl_optsize:
	; BTVER2: # %bb.0: # %entry			; BTVER2: # %bb.0: # %entry
	; BTVER2-NEXT: movq %rdx, %rcx # sched: [1:0.50]
	; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.50]			; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.50]
				; BTVER2-NEXT: movq %rdx, %rcx # sched: [1:0.50]
	; BTVER2-NEXT: # kill: def $cl killed $cl killed $rcx			; BTVER2-NEXT: # kill: def $cl killed $cl killed $rcx
	; BTVER2-NEXT: shldq %cl, %rsi, %rax # sched: [4:4.00]			; BTVER2-NEXT: shldq %cl, %rsi, %rax # sched: [4:4.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; BDVER1-LABEL: lshift_cl_optsize:			; BDVER1-LABEL: lshift_cl_optsize:
	; BDVER1: # %bb.0: # %entry			; BDVER1: # %bb.0: # %entry
	; BDVER1-NEXT: movq %rdx, %rcx			; BDVER1-NEXT: movq %rdx, %rcx
	; BDVER1-NEXT: movq %rdi, %rax			; BDVER1-NEXT: movq %rdi, %rax
	▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movq %rdx, %rcx # sched: [1:0.33]			; GENERIC-NEXT: movq %rdx, %rcx # sched: [1:0.33]
	; GENERIC-NEXT: movq %rdi, %rax # sched: [1:0.33]			; GENERIC-NEXT: movq %rdi, %rax # sched: [1:0.33]
	; GENERIC-NEXT: # kill: def $cl killed $cl killed $rcx			; GENERIC-NEXT: # kill: def $cl killed $cl killed $rcx
	; GENERIC-NEXT: shrdq %cl, %rsi, %rax # sched: [4:1.50]			; GENERIC-NEXT: shrdq %cl, %rsi, %rax # sched: [4:1.50]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; BTVER2-LABEL: rshift_cl_optsize:			; BTVER2-LABEL: rshift_cl_optsize:
	; BTVER2: # %bb.0: # %entry			; BTVER2: # %bb.0: # %entry
	; BTVER2-NEXT: movq %rdx, %rcx # sched: [1:0.50]
	; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.50]			; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.50]
				; BTVER2-NEXT: movq %rdx, %rcx # sched: [1:0.50]
	; BTVER2-NEXT: # kill: def $cl killed $cl killed $rcx			; BTVER2-NEXT: # kill: def $cl killed $cl killed $rcx
	; BTVER2-NEXT: shrdq %cl, %rsi, %rax # sched: [4:4.00]			; BTVER2-NEXT: shrdq %cl, %rsi, %rax # sched: [4:4.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; BDVER1-LABEL: rshift_cl_optsize:			; BDVER1-LABEL: rshift_cl_optsize:
	; BDVER1: # %bb.0: # %entry			; BDVER1: # %bb.0: # %entry
	; BDVER1-NEXT: movq %rdx, %rcx			; BDVER1-NEXT: movq %rdx, %rcx
	; BDVER1-NEXT: movq %rdi, %rax			; BDVER1-NEXT: movq %rdi, %rax
	▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines

test/CodeGen/X86/schedule-x86_32.ll

	Show First 20 Lines • Show All 445 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [5:1.00]			; BTVER2-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [5:1.00]
	; BTVER2-NEXT: #APP			; BTVER2-NEXT: #APP
	; BTVER2-NEXT: arpl %ax, (%ecx) # sched: [100:0.50]			; BTVER2-NEXT: arpl %ax, (%ecx) # sched: [100:0.50]
	; BTVER2-NEXT: #NO_APP			; BTVER2-NEXT: #NO_APP
	; BTVER2-NEXT: retl # sched: [4:1.00]			; BTVER2-NEXT: retl # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_arpl:			; ZNVER1-LABEL: test_arpl:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]			; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
				; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: #APP			; ZNVER1-NEXT: #APP
	; ZNVER1-NEXT: arpl %ax, (%ecx) # sched: [100:0.25]			; ZNVER1-NEXT: arpl %ax, (%ecx) # sched: [100:0.25]
	; ZNVER1-NEXT: #NO_APP			; ZNVER1-NEXT: #NO_APP
	; ZNVER1-NEXT: retl # sched: [1:0.50]			; ZNVER1-NEXT: retl # sched: [1:0.50]
	call void asm sideeffect "arpl $0, $1", "r,m"(i16 %a0, i16 %a1)			call void asm sideeffect "arpl $0, $1", "r,m"(i16 %a0, i16 %a1)
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: .cfi_def_cfa_offset 4			; BTVER2-NEXT: .cfi_def_cfa_offset 4
	; BTVER2-NEXT: retl # sched: [4:1.00]			; BTVER2-NEXT: retl # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_bound:			; ZNVER1-LABEL: test_bound:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: pushl %esi # sched: [1:0.50]			; ZNVER1-NEXT: pushl %esi # sched: [1:0.50]
	; ZNVER1-NEXT: .cfi_def_cfa_offset 8			; ZNVER1-NEXT: .cfi_def_cfa_offset 8
	; ZNVER1-NEXT: .cfi_offset %esi, -8			; ZNVER1-NEXT: .cfi_offset %esi, -8
				; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]			; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %edx # sched: [8:0.50]			; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %edx # sched: [8:0.50]
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %esi # sched: [8:0.50]			; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %esi # sched: [8:0.50]
	; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
	; ZNVER1-NEXT: #APP			; ZNVER1-NEXT: #APP
	; ZNVER1-NEXT: bound %ax, (%esi) # sched: [100:0.25]			; ZNVER1-NEXT: bound %ax, (%esi) # sched: [100:0.25]
	; ZNVER1-NEXT: bound %ecx, (%edx) # sched: [100:0.25]			; ZNVER1-NEXT: bound %ecx, (%edx) # sched: [100:0.25]
	; ZNVER1-NEXT: #NO_APP			; ZNVER1-NEXT: #NO_APP
	; ZNVER1-NEXT: popl %esi # sched: [8:0.50]			; ZNVER1-NEXT: popl %esi # sched: [8:0.50]
	; ZNVER1-NEXT: .cfi_def_cfa_offset 4			; ZNVER1-NEXT: .cfi_def_cfa_offset 4
	; ZNVER1-NEXT: retl # sched: [1:0.50]			; ZNVER1-NEXT: retl # sched: [1:0.50]
	call void asm sideeffect "bound $0, $1 \0A\09 bound $2, $3", "r,m,r,m"(i16 %a0, i16 %a1, i32 %a2, i32 %a3)			call void asm sideeffect "bound $0, $1 \0A\09 bound $2, $3", "r,m,r,m"(i16 %a0, i16 %a1, i32 %a2, i32 %a3)
	▲ Show 20 Lines • Show All 258 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: #APP			; BTVER2-NEXT: #APP
	; BTVER2-NEXT: decw %ax # sched: [1:0.50]			; BTVER2-NEXT: decw %ax # sched: [1:0.50]
	; BTVER2-NEXT: decw (%ecx) # sched: [5:1.00]			; BTVER2-NEXT: decw (%ecx) # sched: [5:1.00]
	; BTVER2-NEXT: #NO_APP			; BTVER2-NEXT: #NO_APP
	; BTVER2-NEXT: retl # sched: [4:1.00]			; BTVER2-NEXT: retl # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_dec16:			; ZNVER1-LABEL: test_dec16:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]			; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
				; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: #APP			; ZNVER1-NEXT: #APP
	; ZNVER1-NEXT: decw %ax # sched: [1:0.25]			; ZNVER1-NEXT: decw %ax # sched: [1:0.25]
	; ZNVER1-NEXT: decw (%ecx) # sched: [5:0.50]			; ZNVER1-NEXT: decw (%ecx) # sched: [5:0.50]
	; ZNVER1-NEXT: #NO_APP			; ZNVER1-NEXT: #NO_APP
	; ZNVER1-NEXT: retl # sched: [1:0.50]			; ZNVER1-NEXT: retl # sched: [1:0.50]
	tail call void asm "decw $0 \0A\09 decw $1", "r,m"(i16 %a0, i16 %a1) nounwind			tail call void asm "decw $0 \0A\09 decw $1", "r,m"(i16 %a0, i16 %a1) nounwind
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: #APP			; BTVER2-NEXT: #APP
	; BTVER2-NEXT: incw %ax # sched: [1:0.50]			; BTVER2-NEXT: incw %ax # sched: [1:0.50]
	; BTVER2-NEXT: incw (%ecx) # sched: [5:1.00]			; BTVER2-NEXT: incw (%ecx) # sched: [5:1.00]
	; BTVER2-NEXT: #NO_APP			; BTVER2-NEXT: #NO_APP
	; BTVER2-NEXT: retl # sched: [4:1.00]			; BTVER2-NEXT: retl # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_inc16:			; ZNVER1-LABEL: test_inc16:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]			; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
				; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: #APP			; ZNVER1-NEXT: #APP
	; ZNVER1-NEXT: incw %ax # sched: [1:0.25]			; ZNVER1-NEXT: incw %ax # sched: [1:0.25]
	; ZNVER1-NEXT: incw (%ecx) # sched: [5:0.50]			; ZNVER1-NEXT: incw (%ecx) # sched: [5:0.50]
	; ZNVER1-NEXT: #NO_APP			; ZNVER1-NEXT: #NO_APP
	; ZNVER1-NEXT: retl # sched: [1:0.50]			; ZNVER1-NEXT: retl # sched: [1:0.50]
	tail call void asm "incw $0 \0A\09 incw $1", "r,m"(i16 %a0, i16 %a1) nounwind			tail call void asm "incw $0 \0A\09 incw $1", "r,m"(i16 %a0, i16 %a1) nounwind
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 654 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: pushw $4095 # imm = 0xFFF			; BTVER2-NEXT: pushw $4095 # imm = 0xFFF
	; BTVER2-NEXT: # sched: [1:1.00]			; BTVER2-NEXT: # sched: [1:1.00]
	; BTVER2-NEXT: pushw $7 # sched: [1:1.00]			; BTVER2-NEXT: pushw $7 # sched: [1:1.00]
	; BTVER2-NEXT: #NO_APP			; BTVER2-NEXT: #NO_APP
	; BTVER2-NEXT: retl # sched: [4:1.00]			; BTVER2-NEXT: retl # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_pop_push_16:			; ZNVER1-LABEL: test_pop_push_16:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]			; ZNVER1-NEXT: movzwl {{[0-9]+}}(%esp), %eax # sched: [8:0.50]
				; ZNVER1-NEXT: movl {{[0-9]+}}(%esp), %ecx # sched: [8:0.50]
	; ZNVER1-NEXT: #APP			; ZNVER1-NEXT: #APP
	; ZNVER1-NEXT: popw %ax # sched: [8:0.50]			; ZNVER1-NEXT: popw %ax # sched: [8:0.50]
	; ZNVER1-NEXT: popw (%ecx) # sched: [5:0.50]			; ZNVER1-NEXT: popw (%ecx) # sched: [5:0.50]
	; ZNVER1-NEXT: pushw %ax # sched: [1:0.50]			; ZNVER1-NEXT: pushw %ax # sched: [1:0.50]
	; ZNVER1-NEXT: pushw (%ecx) # sched: [4:0.50]			; ZNVER1-NEXT: pushw (%ecx) # sched: [4:0.50]
	; ZNVER1-NEXT: pushw $4095 # imm = 0xFFF			; ZNVER1-NEXT: pushw $4095 # imm = 0xFFF
	; ZNVER1-NEXT: # sched: [1:0.50]			; ZNVER1-NEXT: # sched: [1:0.50]
	; ZNVER1-NEXT: pushw $7 # sched: [1:0.50]			; ZNVER1-NEXT: pushw $7 # sched: [1:0.50]
	▲ Show 20 Lines • Show All 584 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SchedModel] Propagate read advance cycles to implicit operands outside instruction descriptorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 171255

lib/CodeGen/ScheduleDAGInstrs.cpp

test/CodeGen/AMDGPU/call-argument-types.ll

test/CodeGen/AMDGPU/call-preserved-registers.ll

test/CodeGen/AMDGPU/callee-special-input-sgprs.ll

test/CodeGen/AMDGPU/indirect-addressing-si.ll

test/CodeGen/AMDGPU/inline-asm.ll

test/CodeGen/AMDGPU/insert_vector_elt.ll

test/CodeGen/AMDGPU/misched-killflags.mir

test/CodeGen/AMDGPU/nested-calls.ll

test/CodeGen/AMDGPU/undefined-subreg-liverange.ll

test/CodeGen/ARM/Windows/chkstk-movw-movt-isel.ll

test/CodeGen/ARM/Windows/chkstk.ll

test/CodeGen/ARM/Windows/memset.ll

test/CodeGen/ARM/arm-and-tst-peephole.ll

test/CodeGen/ARM/arm-shrink-wrapping.ll

test/CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll

test/CodeGen/ARM/cortex-a57-misched-ldm.ll

test/CodeGen/ARM/cortex-a57-misched-vldm-wrback.ll

test/CodeGen/ARM/cortex-a57-misched-vldm.ll

test/CodeGen/ARM/fp16-instructions.ll

test/CodeGen/ARM/select.ll

test/CodeGen/ARM/twoaddrinstr.ll

test/CodeGen/ARM/vcombine.ll

test/CodeGen/ARM/vuzp.ll

test/CodeGen/SystemZ/misched-readadvances.mir

test/CodeGen/Thumb2/umulo-128-legalisation-lowering.ll

test/CodeGen/Thumb2/umulo-64-legalisation-lowering.ll

test/CodeGen/X86/lsr-loop-exit-cond.ll

test/CodeGen/X86/phys-reg-local-regalloc.ll

test/CodeGen/X86/schedule-x86-64-shld.ll

test/CodeGen/X86/schedule-x86_32.ll

[SchedModel] Propagate read advance cycles to implicit operands outside instruction descriptor
ClosedPublic