This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Support for floating-point FMA on ARM64
ClosedPublic

Authored by Gerolf on Apr 3 2016, 11:46 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
t.p.northover

Summary

Patch to evaluate fmul+fadd -> fmadd combines and similar code sequences in the
machine combiner. It adds support for float and double similar to the existing
integer implmentation. The key features of this patch are:

DAGCombiner checks whether it should combine greedily or let the machine combiner do the evaluation. This is (initially) only supported for ARM64.
It gives preference to throughput over latency: the heuristic used is to combine always in loops, but this can be chosen by the target. For in-order cores latency over throughput might be the better choice
Support for fmadd, f(n)msub, fmla, fmls
On by default at O3 fast-math
Performance: (mostly) single digits gains on kernels and SPEC2006 fp

Diff Detail

Event Timeline

Gerolf updated this revision to Diff 52522.Apr 3 2016, 11:46 PM

Gerolf retitled this revision from to [MachineCombiner] Support for floating-point FMA on ARM64.

Gerolf updated this object.

Gerolf added reviewers: qcolombet, t.p.northover, spatel.

Gerolf added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptApr 3 2016, 11:46 PM

Hi Gerolf,

At a high level, could you please explain in what situations you expect *not* combining FMUL+FADD->FMA is a benefit? They use the same resource types on every chip I know of, and FMA is shorter in latency in every chip I know of than FMUL+FADD.

Cheers,

James

include/llvm/CodeGen/MachineCombinerPattern.h
42	For the future: the pattern list is starting to grow quite large. I wonder if in the future we should consider moving the MachineCombinerPatterns to be table-generated?

haicheng added a subscriber: haicheng.Apr 4 2016, 8:01 AM

Hi James,

sure, sorry I missed that. I looked at this too long, I guess :-). It is principally the same ‘better ILP' story as for integers. The prototypical idea is this: imagine two fmul operands feeding the fadd. When the two fmul can execute in parallel it can be faster to issue fmul, fmul, fadd rather than fmul, fmadd.

Cheers
Gerolf

flyingforyou added a subscriber: flyingforyou.Apr 4 2016, 4:11 PM

sure, sorry I missed that. I looked at this too long, I guess :-). It is principally the same ‘better ILP' story as for integers. The prototypical idea is this: imagine two fmul operands feeding the fadd. When the two fmul can execute in parallel it can be faster to issue fmul, fmul, fadd rather than fmul, fmadd.

I think this opt's effect is depend on uarchitecture implementation. If some OoO uarchitectures can divide fmadd to small uops like fmul and fadd, this optimization is not worth for that kind of uarchitecture. (It's also not good for code-size. This means there is more overhead with instruction fetch.)

How about making flag for controling this optimization which is controled by uarch or core?

flyingforyou added inline comments.Apr 5 2016, 6:34 PM

test/CodeGen/AArch64/arm64-fma-combines.ll
122	Please, remove trailing whitespace.

v_klochkov added a subscriber: v_klochkov.Apr 6 2016, 11:40 AM

Hello,

I did just a very quick scan through the change-set,
and added few minor comments related to styling.

Not ready to comment anything about the general idea of the changes.

Thank you,
Slava

include/llvm/CodeGen/SelectionDAGTargetInfo.h
142	There are several misprints here. 'effiicent'->'efficient', 'generated'->'generate', 'instead on'->instead of'. Regarding the name of the function. Shouldn't it start with a lower case letter? Also, the name of the function would be a bit more informative if it was fuseMulAddInMachineCombiner() or generateFMAsInMachineCombiner() The word 'Evaluate' initially made me think that the code did the replacement of FMAs with MUL/ADD. (Yes, in many cases such un-fusing may be very efficient).
lib/Target/AArch64/AArch64InstrInfo.cpp
2626–2631	I think it is a good practice to add a descriptive comment telling what functions do, including the description of the function parameters and returns. Well, at least for the newly added functions.
2641	One statement following an if-statement does not need braces { }.
2807	In my opinion, the function name is a bit confusing. The comment says about MADD operations (i.e. MUL and ADD), but the function name has only the word 'Multiply'. The current name makes me think that function is doing some sort of fusion of few MULs into a bigger operation, e.g. abc*d--> some_operation(a,b,c,d).

include/llvm/CodeGen/MachineCombinerPattern.h
42	Yes, I agree. And more generally we need to decide which pattern should be handled by DAGCombiner/Global Isel and what should be moved into the MachineCombiner.
include/llvm/CodeGen/SelectionDAGTargetInfo.h
142	Agree & done, except for the lower case/upper case issue. I picked the spelling consistent with other names in the corresponding header files. I will change all names following the current naming convention in a follow up commit.
lib/Target/AArch64/AArch64InstrInfo.cpp
2626–2631	Done.
2641	Done.
2807	Changed to getFMAPatterns()
test/CodeGen/AArch64/arm64-fma-combines.ll
122	done.

Addresses minor issues like spelling, function names, comments. Is there any
major issue that needs attention?

`I worked on X86-FMA optimization in some other compiler and was switched to LLVM project just recently.

It is unlikely that anything from written below can be implemented in this change-set.
It is good that the new interface asks the target for permissions to generate or not generate
FMAs in DAGCombiner, but it does not help to solve the problems described below.

This should not stop you from adding these changes to LLVM.
Also, do not consider me as a reviewer, The primary reviewers should give their opinion.

Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
to let other developers have them in view/attention when they add latency-vs-throughput fixes.

My biggest concern regarding making Latency-vs-Throughput decisions is that
such decisions are often made using just one pattern or DAG, it is not based on the whole loop
analysis (perhaps I am missing something in LLVM).

I provided 4 examples having quite similar code in them.

Example1 - shows that FMAs can be very harmful for performance on Haswell.
Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
           It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
           exists, and tuning may be different for different DAGs in one loop.

Ok, let's start...

Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.

Example1:
!NOTE: Please assume that the C code below only represents the structure of the final ASM code
(i.e. the loop is not unrolled, etc.)

  // LOOP1
  for (unsigned i = 0; i < N; i++) {
    accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
  }
  with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
  without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
              MUL operation adds nothing as it is computed out-of-order,
			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.

Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
comparing to the code without FMAs.

On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
work the same time because the latency of ADD is equal to latency of FMA there.

Example2:
The same problem still can be easily reproduced on SkyLake even though the
latencies of MUL/ADD/FMA are all equal there:

// LOOP2
for (unsigned i = 0; i < N; i++) {
  accu = a[i] * b + c[i] * d + accu;
}

There may be at least 3 different sequences for the LOOP2:
S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); LATENCY = 2xLAT(FMA) = 2*4
S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) LATENCY = 2xLAT(ADD) = 2*4
S3: 1xFMA: ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4

In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
before they are ready to be consumed by ADD.
So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.

Example3:
It shows that the heuristics that could be implemented for Example1 and Example2
may be wrong if applied without the whole loop analysis.

// LOOP3
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1;
  accu2 = a2[i] * b + c2[i] * d + accu2;
  accu3 = a3[i] * b + c3[i] * d + accu3;
  accu4 = a4[i] * b + c4[i] * d + accu4;
  accu5 = a5[i] * b + c5[i] * d + accu5;
  accu6 = a6[i] * b + c6[i] * d + accu6;
  accu7 = a7[i] * b + c7[i] * d + accu7;
  accu8 = a8[i] * b + c8[i] * d + accu8;
}

This loop must be tuned for throughput because there are many independent DAGs
putting high pressure on the CPU execution units.
The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
"ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
It works faster because the loop is bounded by throughput.

On SkyLake:

T = approximate throughput of the loop counted in clock-ticks = 
  N * 16 operations / 2 execution units = N*8
L = latency of the loop = 
  N * 2*Lat(FMA) = N*2*4 = N*8

The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).

The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
the time spent in the loop is MAX(L,T) will only be bigger.

Example4:
There may be mixed tuning, i.e. for both throughput and latency in one loop:

// LOOP4
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
  accu2 = a2[i] * b + accu2; // tune for throughput
  accu3 = a3[i] * b + accu3; // tune for throughput
  accu4 = a4[i] * b + accu4; // tune for throughput
  accu5 = a5[i] * b + accu5; // tune for throughput
  accu6 = a6[i] * b + accu6; // tune for throughput
}

On Haswell:
If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then

Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
Throughput T = N * 7 / 2
MAX (L,T) = N*10

Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to

L = N*Latency(FMA) = N*5,
and will only slightly increase T from N*3.5 to 
T = N * 8 operations / 2 execution units = N*4

As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.

Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).

L = N*Latency(ADD) = N*3
T = N * 13 operations / 2 = N*6.5
MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.

So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
`

Hi Gerolf,

I haven’t looked into neither the machine combiner or the aarch64 specific changes, but here are some thoughts on DAG related parts.

Let me know if you want me to look into the other parts as well.

Cheers,
-Quentin

include/llvm/CodeGen/SelectionDAGTargetInfo.h
145	Shouldn’t the comment say something like: return true if we want to delegate the generation of fma to the machine combiner? With the current comment I fell like this is a shift between what the comment said and what the hook is used for: From the name of the method, I would expect the decision to generate fma is delegated to the machine combiner. From the comment, it seems I need to return true if my target supports FMA and the fact that this combining is done by the machine combiner is almost anecdotic here.
145	Take the optimization level as input of the hook to give more control to the target.
145	A remark for the future. Would it make sense to have this hook take an enum or something of the combines we want to delegate to the machine combiner? We don’t have to decide now, but I believe the approach could be generalized to other hooks and we do not want one hook per instruction/combine.
include/llvm/Target/TargetInstrInfo.h
824	Could we have a property on the pattern instead? What are the other kinds of patterns? The machine combiner optimize either for throughput or latency, something else?
lib/CodeGen/MachineCombiner.cpp
415	Capital letter for the first letter per the coding standard.
418	No curly braces per the coding standard.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
89	Since this is already accessible through the DAG and we use that only twice, I am not sure it is worth adding a field for it. I don’t feel strongly about it though.
7864	Query the opt level from the STI hook. That way the target has a better control on when it wants to move the combine to the machine combiners.

Comments and local handling of optlevel as Quentin suggested.

Hi Gerolf,

LGTM for the non-target, non-machine combiner changes.

Make sure to follow LLVM coding style (the few things I’ve highlighted previously) and please run opt -instnamer on the test cases.

Cheers,
-Quentin

include/llvm/Target/TargetInstrInfo.h
824	For the sake of history summarizing the discussion Gerolf and I had offline. Long term, we should be smart enough to compute this information by ourselves using the scheduling model. To move forward in the meantime, we have this hook to have the targets emulate that with their own heuristics.

This revision is now accepted and ready to land.Apr 21 2016, 11:05 AM

Thanks, Quentin! I should have addressed all your concerns.

include/llvm/CodeGen/SelectionDAGTargetInfo.h
145	Done.
include/llvm/Target/TargetInstrInfo.h
824	Yes, thanks for that note! I filed rdar://25867823 for the throughput API.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
89	No preference either. I just made it look like TLI and expect more accesses to come.

Thanks, Quentin! I should have addressed all your concerns.

include/llvm/CodeGen/SelectionDAGTargetInfo.h
145	That should have been Agree. TBD.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
89	For the record I removed the field and access it through the DAG. It turned out that the field for some target could result in a null reference.

Closing. The patch was committed after a bug fix at:
http://reviews.llvm.org/rL267328

evandro added a subscriber: evandro.Oct 3 2016, 8:15 AM

spatel mentioned this in D25291: [AArch64] Optionally use the Newton series for reciprocal estimation.Oct 7 2016, 9:29 AM

spatel mentioned this in D80604: [MachineCombiner] add a hook to allow some extension for resource length - NFC.May 30 2020, 9:43 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

MachineCombinerPattern.h

35 lines

SelectionDAGTargetInfo.h

6 lines

Target/

TargetInstrInfo.h

5 lines

lib/

CodeGen/

MachineCombiner.cpp

12 lines

SelectionDAG/

DAGCombiner.cpp

13 lines

TargetInstrInfo.cpp

6 lines

Target/

AArch64/

AArch64InstrInfo.h

5 lines

AArch64InstrInfo.cpp

580 lines

AArch64SelectionDAGInfo.h

1 line

AArch64SelectionDAGInfo.cpp

6 lines

test/

CodeGen/

AArch64/

arm64-fma-combines.ll

136 lines

arm64-fml-combines.ll

128 lines

Diff 54452

include/llvm/CodeGen/MachineCombinerPattern.h

Show All 32 Lines	enum class MachineCombinerPattern {
MULSUBW_OP2,		MULSUBW_OP2,
MULADDWI_OP1,		MULADDWI_OP1,
MULSUBWI_OP1,		MULSUBWI_OP1,
MULADDX_OP1,		MULADDX_OP1,
MULADDX_OP2,		MULADDX_OP2,
MULSUBX_OP1,		MULSUBX_OP1,
MULSUBX_OP2,		MULSUBX_OP2,
MULADDXI_OP1,		MULADDXI_OP1,
MULSUBXI_OP1		MULSUBXI_OP1,
		// Floating Point
		jmolloyUnsubmitted Not Done Reply Inline Actions For the future: the pattern list is starting to grow quite large. I wonder if in the future we should consider moving the MachineCombinerPatterns to be table-generated? jmolloy: For the future: the pattern list is starting to grow quite large. I wonder if in the future we…
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Yes, I agree. And more generally we need to decide which pattern should be handled by DAGCombiner/Global Isel and what should be moved into the MachineCombiner. Gerolf: Yes, I agree. And more generally we need to decide which pattern should be handled by…
		FMULADDS_OP1,
		FMULADDS_OP2,
		FMULSUBS_OP1,
		FMULSUBS_OP2,
		FMULADDD_OP1,
		FMULADDD_OP2,
		FMULSUBD_OP1,
		FMULSUBD_OP2,
		FMLAv1i32_indexed_OP1,
		FMLAv1i32_indexed_OP2,
		FMLAv1i64_indexed_OP1,
		FMLAv1i64_indexed_OP2,
		FMLAv2f32_OP2,
		FMLAv2f32_OP1,
		FMLAv2f64_OP1,
		FMLAv2f64_OP2,
		FMLAv2i32_indexed_OP1,
		FMLAv2i32_indexed_OP2,
		FMLAv2i64_indexed_OP1,
		FMLAv2i64_indexed_OP2,
		FMLAv4f32_OP1,
		FMLAv4f32_OP2,
		FMLAv4i32_indexed_OP1,
		FMLAv4i32_indexed_OP2,
		FMLSv1i32_indexed_OP2,
		FMLSv1i64_indexed_OP2,
		FMLSv2i32_indexed_OP2,
		FMLSv2i64_indexed_OP2,
		FMLSv2f32_OP2,
		FMLSv2f64_OP2,
		FMLSv4i32_indexed_OP2,
		FMLSv4f32_OP2
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

include/llvm/CodeGen/SelectionDAGTargetInfo.h

Show All 11 Lines
// selection process.		// selection process.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_CODEGEN_SELECTIONDAGTARGETINFO_H		#ifndef LLVM_CODEGEN_SELECTIONDAGTARGETINFO_H
#define LLVM_CODEGEN_SELECTIONDAGTARGETINFO_H		#define LLVM_CODEGEN_SELECTIONDAGTARGETINFO_H

#include "llvm/CodeGen/SelectionDAGNodes.h"		#include "llvm/CodeGen/SelectionDAGNodes.h"
		#include "llvm/Support/CodeGen.h"

namespace llvm {		namespace llvm {

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// Targets can subclass this to parameterize the		/// Targets can subclass this to parameterize the
/// SelectionDAG lowering and instruction selection process.		/// SelectionDAG lowering and instruction selection process.
///		///
class SelectionDAGTargetInfo {		class SelectionDAGTargetInfo {
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	public:
}		}

virtual std::pair<SDValue, SDValue>		virtual std::pair<SDValue, SDValue>
EmitTargetCodeForStrnlen(SelectionDAG &DAG, SDLoc DL, SDValue Chain,		EmitTargetCodeForStrnlen(SelectionDAG &DAG, SDLoc DL, SDValue Chain,
SDValue Src, SDValue MaxLength,		SDValue Src, SDValue MaxLength,
MachinePointerInfo SrcPtrInfo) const {		MachinePointerInfo SrcPtrInfo) const {
return std::make_pair(SDValue(), SDValue());		return std::make_pair(SDValue(), SDValue());
}		}
		// Return true when the decision to generate FMA's (or FMS, FMLA etc) rather
		v_klochkovUnsubmitted Not Done Reply Inline Actions There are several misprints here. 'effiicent'->'efficient', 'generated'->'generate', 'instead on'->instead of'. Regarding the name of the function. Shouldn't it start with a lower case letter? Also, the name of the function would be a bit more informative if it was fuseMulAddInMachineCombiner() or generateFMAsInMachineCombiner() The word 'Evaluate' initially made me think that the code did the replacement of FMAs with MUL/ADD. (Yes, in many cases such un-fusing may be very efficient). v_klochkov: There are several misprints here. 'effiicent'->'efficient', 'generated'->'generate', 'instead…
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Agree & done, except for the lower case/upper case issue. I picked the spelling consistent with other names in the corresponding header files. I will change all names following the current naming convention in a follow up commit. Gerolf: Agree & done, except for the lower case/upper case issue. I picked the spelling consistent with…
		// than FMUL and ADD is delegated to the machine combiner.
		virtual bool GenerateFMAsInMachineCombiner(CodeGenOpt::Level OptLevel) const {
		return false;
		qcolombetUnsubmitted Not Done Reply Inline Actions Shouldn’t the comment say something like: return true if we want to delegate the generation of fma to the machine combiner? With the current comment I fell like this is a shift between what the comment said and what the hook is used for: From the name of the method, I would expect the decision to generate fma is delegated to the machine combiner. From the comment, it seems I need to return true if my target supports FMA and the fact that this combining is done by the machine combiner is almost anecdotic here. qcolombet: Shouldn’t the comment say something like: return true if we want to delegate the generation of…
		qcolombetUnsubmitted Done Reply Inline Actions Take the optimization level as input of the hook to give more control to the target. qcolombet: Take the optimization level as input of the hook to give more control to the target.
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Done. Gerolf: Done.
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions That should have been Agree. TBD. Gerolf: That should have been Agree. TBD.
		qcolombetUnsubmitted Not Done Reply Inline Actions A remark for the future. Would it make sense to have this hook take an enum or something of the combines we want to delegate to the machine combiner? We don’t have to decide now, but I believe the approach could be generalized to other hooks and we do not want one hook per instruction/combine. qcolombet: A remark for the future. Would it make sense to have this hook take an enum or something of…
		}
};		};

} // end llvm namespace		} // end llvm namespace

#endif		#endif

include/llvm/Target/TargetInstrInfo.h

Show First 20 Lines • Show All 812 Lines • ▼ Show 20 Lines	public:
/// order since the pattern evaluator stops checking as soon as it finds a		/// order since the pattern evaluator stops checking as soon as it finds a
/// faster sequence.		/// faster sequence.
/// \param Root - Instruction that could be combined with one of its operands		/// \param Root - Instruction that could be combined with one of its operands
/// \param Patterns - Vector of possible combination patterns		/// \param Patterns - Vector of possible combination patterns
virtual bool getMachineCombinerPatterns(		virtual bool getMachineCombinerPatterns(
MachineInstr &Root,		MachineInstr &Root,
SmallVectorImpl<MachineCombinerPattern> &Patterns) const;		SmallVectorImpl<MachineCombinerPattern> &Patterns) const;

		/// Return true when a code sequence can improve throughput. It
		/// should be called only for instructions in loops.
		/// \param Pattern - combiner pattern
		virtual bool isThroughputPattern(MachineCombinerPattern Pattern) const;
		qcolombetUnsubmitted Not Done Reply Inline Actions Could we have a property on the pattern instead? What are the other kinds of patterns? The machine combiner optimize either for throughput or latency, something else? qcolombet: Could we have a property on the pattern instead? What are the other kinds of patterns? The…
		qcolombetUnsubmitted Not Done Reply Inline Actions For the sake of history summarizing the discussion Gerolf and I had offline. Long term, we should be smart enough to compute this information by ourselves using the scheduling model. To move forward in the meantime, we have this hook to have the targets emulate that with their own heuristics. qcolombet: For the sake of history summarizing the discussion Gerolf and I had offline. Long term, we…
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Yes, thanks for that note! I filed rdar://25867823 for the throughput API. Gerolf: Yes, thanks for that note! I filed rdar://25867823 for the throughput API.

/// Return true if the input \P Inst is part of a chain of dependent ops		/// Return true if the input \P Inst is part of a chain of dependent ops
/// that are suitable for reassociation, otherwise return false.		/// that are suitable for reassociation, otherwise return false.
/// If the instruction's operands must be commuted to have a previous		/// If the instruction's operands must be commuted to have a previous
/// instruction of the same type define the first source operand, \P Commuted		/// instruction of the same type define the first source operand, \P Commuted
/// will be set to true.		/// will be set to true.
bool isReassociationCandidate(const MachineInstr &Inst, bool &Commuted) const;		bool isReassociationCandidate(const MachineInstr &Inst, bool &Commuted) const;

/// Return true when \P Inst is both associative and commutative.		/// Return true when \P Inst is both associative and commutative.
▲ Show 20 Lines • Show All 625 Lines • Show Last 20 Lines

lib/CodeGen/MachineCombiner.cpp

Show All 35 Lines
STATISTIC(NumInstCombined, "Number of machineinst combined");		STATISTIC(NumInstCombined, "Number of machineinst combined");

namespace {		namespace {
class MachineCombiner : public MachineFunctionPass {		class MachineCombiner : public MachineFunctionPass {
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
MCSchedModel SchedModel;		MCSchedModel SchedModel;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
		MachineLoopInfo *MLI; // Current MachineLoopInfo
MachineTraceMetrics *Traces;		MachineTraceMetrics *Traces;
MachineTraceMetrics::Ensemble *MinInstr;		MachineTraceMetrics::Ensemble *MinInstr;

TargetSchedModel TSchedModel;		TargetSchedModel TSchedModel;

/// True if optimizing for code size.		/// True if optimizing for code size.
bool OptSize;		bool OptSize;

Show All 30 Lines
};		};
}		}

char MachineCombiner::ID = 0;		char MachineCombiner::ID = 0;
char &llvm::MachineCombinerID = MachineCombiner::ID;		char &llvm::MachineCombinerID = MachineCombiner::ID;

INITIALIZE_PASS_BEGIN(MachineCombiner, "machine-combiner",		INITIALIZE_PASS_BEGIN(MachineCombiner, "machine-combiner",
"Machine InstCombiner", false, false)		"Machine InstCombiner", false, false)
		INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
INITIALIZE_PASS_DEPENDENCY(MachineTraceMetrics)		INITIALIZE_PASS_DEPENDENCY(MachineTraceMetrics)
INITIALIZE_PASS_END(MachineCombiner, "machine-combiner", "Machine InstCombiner",		INITIALIZE_PASS_END(MachineCombiner, "machine-combiner", "Machine InstCombiner",
false, false)		false, false)

void MachineCombiner::getAnalysisUsage(AnalysisUsage &AU) const {		void MachineCombiner::getAnalysisUsage(AnalysisUsage &AU) const {
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addPreserved<MachineDominatorTree>();		AU.addPreserved<MachineDominatorTree>();
		AU.addRequired<MachineLoopInfo>();
AU.addPreserved<MachineLoopInfo>();		AU.addPreserved<MachineLoopInfo>();
AU.addRequired<MachineTraceMetrics>();		AU.addRequired<MachineTraceMetrics>();
AU.addPreserved<MachineTraceMetrics>();		AU.addPreserved<MachineTraceMetrics>();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}

MachineInstr *MachineCombiner::getOperandDef(const MachineOperand &MO) {		MachineInstr *MachineCombiner::getOperandDef(const MachineOperand &MO) {
MachineInstr *DefInstr = nullptr;		MachineInstr *DefInstr = nullptr;
▲ Show 20 Lines • Show All 245 Lines • ▼ Show 20 Lines
/// instructions when this neither lengthens the critical path nor increases		/// instructions when this neither lengthens the critical path nor increases
/// resource pressure. When optimizing for codesize always combine when the new		/// resource pressure. When optimizing for codesize always combine when the new
/// sequence is shorter.		/// sequence is shorter.
bool MachineCombiner::combineInstructions(MachineBasicBlock *MBB) {		bool MachineCombiner::combineInstructions(MachineBasicBlock *MBB) {
bool Changed = false;		bool Changed = false;
DEBUG(dbgs() << "Combining MBB " << MBB->getName() << "\n");		DEBUG(dbgs() << "Combining MBB " << MBB->getName() << "\n");

auto BlockIter = MBB->begin();		auto BlockIter = MBB->begin();
		// Check if the block is in a loop.
		const MachineLoop *ML = MLI->getLoopFor(MBB);

while (BlockIter != MBB->end()) {		while (BlockIter != MBB->end()) {
auto &MI = *BlockIter++;		auto &MI = *BlockIter++;

DEBUG(dbgs() << "INSTR "; MI.dump(); dbgs() << "\n";);		DEBUG(dbgs() << "INSTR "; MI.dump(); dbgs() << "\n";);
SmallVector<MachineCombinerPattern, 16> Patterns;		SmallVector<MachineCombinerPattern, 16> Patterns;
// The motivating example is:		// The motivating example is:
//		//
Show All 36 Lines	for (auto P : Patterns) {
unsigned NewInstCount = InsInstrs.size();		unsigned NewInstCount = InsInstrs.size();
unsigned OldInstCount = DelInstrs.size();		unsigned OldInstCount = DelInstrs.size();
// Found pattern, but did not generate alternative sequence.		// Found pattern, but did not generate alternative sequence.
// This can happen e.g. when an immediate could not be materialized		// This can happen e.g. when an immediate could not be materialized
// in a single instruction.		// in a single instruction.
if (!NewInstCount)		if (!NewInstCount)
continue;		continue;

		bool substituteAlways = false;
		qcolombetUnsubmitted Done Reply Inline Actions Capital letter for the first letter per the coding standard. qcolombet: Capital letter for the first letter per the coding standard.
		if (ML && TII->isThroughputPattern(P)) {
		substituteAlways = true;
		}
		qcolombetUnsubmitted Done Reply Inline Actions No curly braces per the coding standard. qcolombet: No curly braces per the coding standard.
// Substitute when we optimize for codesize and the new sequence has		// Substitute when we optimize for codesize and the new sequence has
// fewer instructions OR		// fewer instructions OR
// the new sequence neither lengthens the critical path nor increases		// the new sequence neither lengthens the critical path nor increases
// resource pressure.		// resource pressure.
if (doSubstitute(NewInstCount, OldInstCount) \|\|		if (substituteAlways \|\| doSubstitute(NewInstCount, OldInstCount) \|\|
(improvesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,		(improvesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,
InstrIdxForVirtReg, P) &&		InstrIdxForVirtReg, P) &&
preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {		preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {
for (auto *InstrPtr : InsInstrs)		for (auto *InstrPtr : InsInstrs)
MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);		MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);
for (auto *InstrPtr : DelInstrs)		for (auto *InstrPtr : DelInstrs)
InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();		InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();

Show All 20 Lines

bool MachineCombiner::runOnMachineFunction(MachineFunction &MF) {		bool MachineCombiner::runOnMachineFunction(MachineFunction &MF) {
const TargetSubtargetInfo &STI = MF.getSubtarget();		const TargetSubtargetInfo &STI = MF.getSubtarget();
TII = STI.getInstrInfo();		TII = STI.getInstrInfo();
TRI = STI.getRegisterInfo();		TRI = STI.getRegisterInfo();
SchedModel = STI.getSchedModel();		SchedModel = STI.getSchedModel();
TSchedModel.init(SchedModel, &STI, TII);		TSchedModel.init(SchedModel, &STI, TII);
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
		MLI = &getAnalysis<MachineLoopInfo>();
Traces = &getAnalysis<MachineTraceMetrics>();		Traces = &getAnalysis<MachineTraceMetrics>();
MinInstr = nullptr;		MinInstr = nullptr;
OptSize = MF.getFunction()->optForSize();		OptSize = MF.getFunction()->optForSize();

DEBUG(dbgs() << getPassName() << ": " << MF.getName() << '\n');		DEBUG(dbgs() << getPassName() << ": " << MF.getName() << '\n');
if (!TII->useMachineCombiner()) {		if (!TII->useMachineCombiner()) {
DEBUG(dbgs() << " Skipping pass: Target does not support machine combiner\n");		DEBUG(dbgs() << " Skipping pass: Target does not support machine combiner\n");
return false;		return false;
Show All 10 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 18 Lines
#include "llvm/CodeGen/SelectionDAG.h"		#include "llvm/CodeGen/SelectionDAG.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallBitVector.h"		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
		#include "llvm/CodeGen/SelectionDAGTargetInfo.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	#endif
static cl::opt<bool>		static cl::opt<bool>
MaySplitLoadIndex("combiner-split-load-index", cl::Hidden, cl::init(true),		MaySplitLoadIndex("combiner-split-load-index", cl::Hidden, cl::init(true),
cl::desc("DAG combiner may split indexing from loads"));		cl::desc("DAG combiner may split indexing from loads"));

//------------------------------ DAGCombiner ---------------------------------//		//------------------------------ DAGCombiner ---------------------------------//

class DAGCombiner {		class DAGCombiner {
SelectionDAG &DAG;		SelectionDAG &DAG;
		const SelectionDAGTargetInfo &STI;
		qcolombetUnsubmitted Not Done Reply Inline Actions Since this is already accessible through the DAG and we use that only twice, I am not sure it is worth adding a field for it. I don’t feel strongly about it though. qcolombet: Since this is already accessible through the DAG and we use that only twice, I am not sure it…
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions No preference either. I just made it look like TLI and expect more accesses to come. Gerolf: No preference either. I just made it look like TLI and expect more accesses to come.
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions For the record I removed the field and access it through the DAG. It turned out that the field for some target could result in a null reference. Gerolf: For the record I removed the field and access it through the DAG. It turned out that the field…
const TargetLowering &TLI;		const TargetLowering &TLI;
CombineLevel Level;		CombineLevel Level;
CodeGenOpt::Level OptLevel;		CodeGenOpt::Level OptLevel;
bool LegalOperations;		bool LegalOperations;
bool LegalTypes;		bool LegalTypes;
bool ForCodeSize;		bool ForCodeSize;

/// \brief Worklist of all of the nodes that need to be simplified.		/// \brief Worklist of all of the nodes that need to be simplified.
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	private:
///		///
/// \p N needs to be a truncation and its first operand an AND. Other		/// \p N needs to be a truncation and its first operand an AND. Other
/// requirements are checked by the function (e.g. that trunc is		/// requirements are checked by the function (e.g. that trunc is
/// single-use) and if missed an empty SDValue is returned.		/// single-use) and if missed an empty SDValue is returned.
SDValue distributeTruncateThroughAnd(SDNode *N);		SDValue distributeTruncateThroughAnd(SDNode *N);

public:		public:
DAGCombiner(SelectionDAG &D, AliasAnalysis &A, CodeGenOpt::Level OL)		DAGCombiner(SelectionDAG &D, AliasAnalysis &A, CodeGenOpt::Level OL)
: DAG(D), TLI(D.getTargetLoweringInfo()), Level(BeforeLegalizeTypes),		: DAG(D), STI(D.getSelectionDAGInfo()), TLI(D.getTargetLoweringInfo()),
OptLevel(OL), LegalOperations(false), LegalTypes(false), AA(A) {		Level(BeforeLegalizeTypes), OptLevel(OL), LegalOperations(false),
		LegalTypes(false), AA(A) {
ForCodeSize = DAG.getMachineFunction().getFunction()->optForSize();		ForCodeSize = DAG.getMachineFunction().getFunction()->optForSize();
}		}

/// Runs the dag combiner on all nodes in the work list		/// Runs the dag combiner on all nodes in the work list
void Run(CombineLevel AtLevel);		void Run(CombineLevel AtLevel);

SelectionDAG &getDAG() const { return DAG; }		SelectionDAG &getDAG() const { return DAG; }

▲ Show 20 Lines • Show All 7,184 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFADDForFMACombine(SDNode *N) {
bool HasFMA =		bool HasFMA =
AllowFusion && TLI.isFMAFasterThanFMulAndFAdd(VT) &&		AllowFusion && TLI.isFMAFasterThanFMulAndFAdd(VT) &&
(!LegalOperations \|\| TLI.isOperationLegalOrCustom(ISD::FMA, VT));		(!LegalOperations \|\| TLI.isOperationLegalOrCustom(ISD::FMA, VT));

// No valid opcode, do not combine.		// No valid opcode, do not combine.
if (!HasFMAD && !HasFMA)		if (!HasFMAD && !HasFMA)
return SDValue();		return SDValue();

		if (AllowFusion && STI.GenerateFMAsInMachineCombiner(OptLevel))
		return SDValue();

// Always prefer FMAD to FMA for precision.		// Always prefer FMAD to FMA for precision.
unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;		unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;
bool Aggressive = TLI.enableAggressiveFMAFusion(VT);		bool Aggressive = TLI.enableAggressiveFMAFusion(VT);
bool LookThroughFPExt = TLI.isFPExtFree(VT);		bool LookThroughFPExt = TLI.isFPExtFree(VT);

// If we have two choices trying to fold (fadd (fmul u, v), (fmul x, y)),		// If we have two choices trying to fold (fadd (fmul u, v), (fmul x, y)),
// prefer to fold the multiply with fewer uses.		// prefer to fold the multiply with fewer uses.
if (Aggressive && N0.getOpcode() == ISD::FMUL &&		if (Aggressive && N0.getOpcode() == ISD::FMUL &&
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFSUBForFMACombine(SDNode *N) {
bool HasFMA =		bool HasFMA =
AllowFusion && TLI.isFMAFasterThanFMulAndFAdd(VT) &&		AllowFusion && TLI.isFMAFasterThanFMulAndFAdd(VT) &&
(!LegalOperations \|\| TLI.isOperationLegalOrCustom(ISD::FMA, VT));		(!LegalOperations \|\| TLI.isOperationLegalOrCustom(ISD::FMA, VT));

// No valid opcode, do not combine.		// No valid opcode, do not combine.
if (!HasFMAD && !HasFMA)		if (!HasFMAD && !HasFMA)
return SDValue();		return SDValue();

		if (AllowFusion && STI.GenerateFMAsInMachineCombiner(OptLevel))
		return SDValue();
		qcolombetUnsubmitted Done Reply Inline Actions Query the opt level from the STI hook. That way the target has a better control on when it wants to move the combine to the machine combiners. qcolombet: Query the opt level from the STI hook. That way the target has a better control on when it…

// Always prefer FMAD to FMA for precision.		// Always prefer FMAD to FMA for precision.
unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;		unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;
bool Aggressive = TLI.enableAggressiveFMAFusion(VT);		bool Aggressive = TLI.enableAggressiveFMAFusion(VT);
bool LookThroughFPExt = TLI.isFPExtFree(VT);		bool LookThroughFPExt = TLI.isFPExtFree(VT);

// fold (fsub (fmul x, y), z) -> (fma x, y, (fneg z))		// fold (fsub (fmul x, y), z) -> (fma x, y, (fneg z))
if (N0.getOpcode() == ISD::FMUL &&		if (N0.getOpcode() == ISD::FMUL &&
(Aggressive \|\| N0->hasOneUse())) {		(Aggressive \|\| N0->hasOneUse())) {
▲ Show 20 Lines • Show All 7,014 Lines • Show Last 20 Lines

lib/CodeGen/TargetInstrInfo.cpp

Show First 20 Lines • Show All 649 Lines • ▼ Show 20 Lines	if (Commute) {
Patterns.push_back(MachineCombinerPattern::REASSOC_AX_BY);		Patterns.push_back(MachineCombinerPattern::REASSOC_AX_BY);
Patterns.push_back(MachineCombinerPattern::REASSOC_XA_BY);		Patterns.push_back(MachineCombinerPattern::REASSOC_XA_BY);
}		}
return true;		return true;
}		}

return false;		return false;
}		}
		/// Return true when a code sequence can improve loop throughput.
		bool
		TargetInstrInfo::isThroughputPattern(MachineCombinerPattern Pattern) const {
		return false;
		}
/// Attempt the reassociation transformation to reduce critical path length.		/// Attempt the reassociation transformation to reduce critical path length.
/// See the above comments before getMachineCombinerPatterns().		/// See the above comments before getMachineCombinerPatterns().
void TargetInstrInfo::reassociateOps(		void TargetInstrInfo::reassociateOps(
MachineInstr &Root, MachineInstr &Prev,		MachineInstr &Root, MachineInstr &Prev,
MachineCombinerPattern Pattern,		MachineCombinerPattern Pattern,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs,		SmallVectorImpl<MachineInstr *> &DelInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const {		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const {
▲ Show 20 Lines • Show All 542 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64InstrInfo.h

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	bool analyzeCompare(const MachineInstr *MI, unsigned &SrcReg,
unsigned &SrcReg2, int &CmpMask,		unsigned &SrcReg2, int &CmpMask,
int &CmpValue) const override;		int &CmpValue) const override;
/// optimizeCompareInstr - Convert the instruction supplying the argument to		/// optimizeCompareInstr - Convert the instruction supplying the argument to
/// the comparison into one that sets the zero bit in the flags register.		/// the comparison into one that sets the zero bit in the flags register.
bool optimizeCompareInstr(MachineInstr *CmpInstr, unsigned SrcReg,		bool optimizeCompareInstr(MachineInstr *CmpInstr, unsigned SrcReg,
unsigned SrcReg2, int CmpMask, int CmpValue,		unsigned SrcReg2, int CmpMask, int CmpValue,
const MachineRegisterInfo *MRI) const override;		const MachineRegisterInfo *MRI) const override;
bool optimizeCondBranch(MachineInstr *MI) const override;		bool optimizeCondBranch(MachineInstr *MI) const override;

		/// Return true when a code sequence can improve throughput. It
		/// should be called only for instructions in loops.
		/// \param Pattern - combiner pattern
		bool isThroughputPattern(MachineCombinerPattern Pattern) const override;
/// Return true when there is potentially a faster code sequence		/// Return true when there is potentially a faster code sequence
/// for an instruction chain ending in <Root>. All potential patterns are		/// for an instruction chain ending in <Root>. All potential patterns are
/// listed in the <Patterns> array.		/// listed in the <Patterns> array.
bool getMachineCombinerPatterns(MachineInstr &Root,		bool getMachineCombinerPatterns(MachineInstr &Root,
SmallVectorImpl<MachineCombinerPattern> &Patterns)		SmallVectorImpl<MachineCombinerPattern> &Patterns)
const override;		const override;
/// Return true when Inst is associative and commutative so that it can be		/// Return true when Inst is associative and commutative so that it can be
/// reassociated.		/// reassociated.
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64InstrInfo.cpp

Show First 20 Lines • Show All 2,592 Lines • ▼ Show 20 Lines	static bool isCombineInstrCandidate64(unsigned Opc) {
case AArch64::SUBSXri:		case AArch64::SUBSXri:
return true;		return true;
default:		default:
break;		break;
}		}
return false;		return false;
}		}
//		//
		// FP Opcodes that can be combined with a FMUL
		static bool isCombineInstrCandidateFP(const MachineInstr &Inst) {
		switch (Inst.getOpcode()) {
		case AArch64::FADDSrr:
		case AArch64::FADDDrr:
		case AArch64::FADDv2f32:
		case AArch64::FADDv2f64:
		case AArch64::FADDv4f32:
		case AArch64::FSUBSrr:
		case AArch64::FSUBDrr:
		case AArch64::FSUBv2f32:
		case AArch64::FSUBv2f64:
		case AArch64::FSUBv4f32:
		return Inst.getParent()->getParent()->getTarget().Options.UnsafeFPMath;
		default:
		break;
		}
		return false;
		}
		//
// Opcodes that can be combined with a MUL		// Opcodes that can be combined with a MUL
static bool isCombineInstrCandidate(unsigned Opc) {		static bool isCombineInstrCandidate(unsigned Opc) {
return (isCombineInstrCandidate32(Opc) \|\| isCombineInstrCandidate64(Opc));		return (isCombineInstrCandidate32(Opc) \|\| isCombineInstrCandidate64(Opc));
}		}

static bool canCombineWithMUL(MachineBasicBlock &MBB, MachineOperand &MO,		//
unsigned MulOpc, unsigned ZeroReg) {		// Utility routine that checks if \param MO is defined by an
		// \param CombineOpc instruction in the basic block \param MBB
		static bool canCombine(MachineBasicBlock &MBB, MachineOperand &MO,
		unsigned CombineOpc, unsigned ZeroReg = 0,
		bool CheckZeroReg = false) {
		v_klochkovUnsubmitted Not Done Reply Inline Actions I think it is a good practice to add a descriptive comment telling what functions do, including the description of the function parameters and returns. Well, at least for the newly added functions. v_klochkov: I think it is a good practice to add a descriptive comment telling what functions do…
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Done. Gerolf: Done.
MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();		MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
MachineInstr *MI = nullptr;		MachineInstr *MI = nullptr;
// We need a virtual register definition.
if (MO.isReg() && TargetRegisterInfo::isVirtualRegister(MO.getReg()))		if (MO.isReg() && TargetRegisterInfo::isVirtualRegister(MO.getReg()))
MI = MRI.getUniqueVRegDef(MO.getReg());		MI = MRI.getUniqueVRegDef(MO.getReg());
// And it needs to be in the trace (otherwise, it won't have a depth).		// And it needs to be in the trace (otherwise, it won't have a depth).
if (!MI \|\| MI->getParent() != &MBB \|\| (unsigned)MI->getOpcode() != MulOpc)		if (!MI \|\| MI->getParent() != &MBB \|\| (unsigned)MI->getOpcode() != CombineOpc)
		return false;
		// Must only used by the user we combine with.
		if (!MRI.hasOneNonDBGUse(MI->getOperand(0).getReg()))
		v_klochkovUnsubmitted Not Done Reply Inline Actions One statement following an if-statement does not need braces { }. v_klochkov: One statement following an if-statement does not need braces { }.
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Done. Gerolf: Done.
return false;		return false;

		if (CheckZeroReg) {
assert(MI->getNumOperands() >= 4 && MI->getOperand(0).isReg() &&		assert(MI->getNumOperands() >= 4 && MI->getOperand(0).isReg() &&
MI->getOperand(1).isReg() && MI->getOperand(2).isReg() &&		MI->getOperand(1).isReg() && MI->getOperand(2).isReg() &&
MI->getOperand(3).isReg() && "MAdd/MSub must have a least 4 regs");		MI->getOperand(3).isReg() && "MAdd/MSub must have a least 4 regs");

// The third input reg must be zero.		// The third input reg must be zero.
if (MI->getOperand(3).getReg() != ZeroReg)		if (MI->getOperand(3).getReg() != ZeroReg)
return false;		return false;
		}
// Must only used by the user we combine with.
if (!MRI.hasOneNonDBGUse(MI->getOperand(0).getReg()))
return false;

return true;		return true;
}		}

		//
		// Is \param MO defined by an integer multiply and can be combined?
		static bool canCombineWithMUL(MachineBasicBlock &MBB, MachineOperand &MO,
		unsigned MulOpc, unsigned ZeroReg) {
		return canCombine(MBB, MO, MulOpc, ZeroReg, true);
		}

		//
		// Is \param MO defined by a floating-point multiply and can be combined?
		static bool canCombineWithFMUL(MachineBasicBlock &MBB, MachineOperand &MO,
		unsigned MulOpc) {
		return canCombine(MBB, MO, MulOpc);
		}

// TODO: There are many more machine instruction opcodes to match:		// TODO: There are many more machine instruction opcodes to match:
// 1. Other data types (integer, vectors)		// 1. Other data types (integer, vectors)
// 2. Other math / logic operations (xor, or)		// 2. Other math / logic operations (xor, or)
// 3. Other forms of the same operation (intrinsics and other variants)		// 3. Other forms of the same operation (intrinsics and other variants)
bool AArch64InstrInfo::isAssociativeAndCommutative(const MachineInstr &Inst) const {		bool AArch64InstrInfo::isAssociativeAndCommutative(const MachineInstr &Inst) const {
switch (Inst.getOpcode()) {		switch (Inst.getOpcode()) {
case AArch64::FADDDrr:		case AArch64::FADDDrr:
case AArch64::FADDSrr:		case AArch64::FADDSrr:
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	if (canCombineWithMUL(MBB, Root.getOperand(1), AArch64::MADDXrrr,
AArch64::XZR)) {		AArch64::XZR)) {
Patterns.push_back(MachineCombinerPattern::MULSUBXI_OP1);		Patterns.push_back(MachineCombinerPattern::MULSUBXI_OP1);
Found = true;		Found = true;
}		}
break;		break;
}		}
return Found;		return Found;
}		}
		/// Floating-Point Support

		/// Find instructions that can be turned into madd.
		static bool getFMAPatterns(MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) {
		v_klochkovUnsubmitted Not Done Reply Inline Actions In my opinion, the function name is a bit confusing. The comment says about MADD operations (i.e. MUL and ADD), but the function name has only the word 'Multiply'. The current name makes me think that function is doing some sort of fusion of few MULs into a bigger operation, e.g. abcd--> some_operation(a,b,c,d). v_klochkov:* In my opinion, the function name is a bit confusing. The comment says about MADD operations (i.
		GerolfAuthorUnsubmitted Not Done Reply Inline Actions Changed to getFMAPatterns() Gerolf: Changed to getFMAPatterns()

		if (!isCombineInstrCandidateFP(Root))
		return 0;

		MachineBasicBlock &MBB = *Root.getParent();
		bool Found = false;

		switch (Root.getOpcode()) {
		default:
		assert(false && "Unsupported FP instruction in combiner\n");
		break;
		case AArch64::FADDSrr:
		assert(Root.getOperand(1).isReg() && Root.getOperand(2).isReg() &&
		"FADDWrr does not have register operands");
		if (canCombineWithFMUL(MBB, Root.getOperand(1), AArch64::FMULSrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULADDS_OP1);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv1i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv1i32_indexed_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2), AArch64::FMULSrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULADDS_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv1i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv1i32_indexed_OP2);
		Found = true;
		}
		break;
		case AArch64::FADDDrr:
		if (canCombineWithFMUL(MBB, Root.getOperand(1), AArch64::FMULDrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULADDD_OP1);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv1i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv1i64_indexed_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2), AArch64::FMULDrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULADDD_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv1i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv1i64_indexed_OP2);
		Found = true;
		}
		break;
		case AArch64::FADDv2f32:
		if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv2i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2i32_indexed_OP1);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv2f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2f32_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2i32_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2f32_OP2);
		Found = true;
		}
		break;
		case AArch64::FADDv2f64:
		if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv2i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2i64_indexed_OP1);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv2f64)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2f64_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2i64_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2f64)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv2f64_OP2);
		Found = true;
		}
		break;
		case AArch64::FADDv4f32:
		if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv4i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv4i32_indexed_OP1);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(1),
		AArch64::FMULv4f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv4f32_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv4i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv4i32_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv4f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLAv4f32_OP2);
		Found = true;
		}
		break;

		case AArch64::FSUBSrr:
		if (canCombineWithFMUL(MBB, Root.getOperand(1), AArch64::FMULSrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULSUBS_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2), AArch64::FMULSrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULSUBS_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv1i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv1i32_indexed_OP2);
		Found = true;
		}
		break;
		case AArch64::FSUBDrr:
		if (canCombineWithFMUL(MBB, Root.getOperand(1), AArch64::FMULDrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULSUBD_OP1);
		Found = true;
		}
		if (canCombineWithFMUL(MBB, Root.getOperand(2), AArch64::FMULDrr)) {
		Patterns.push_back(MachineCombinerPattern::FMULSUBD_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv1i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv1i64_indexed_OP2);
		Found = true;
		}
		break;
		case AArch64::FSUBv2f32:
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv2i32_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv2f32_OP2);
		Found = true;
		}
		break;
		case AArch64::FSUBv2f64:
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2i64_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv2i64_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv2f64)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv2f64_OP2);
		Found = true;
		}
		break;
		case AArch64::FSUBv4f32:
		if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv4i32_indexed)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv4i32_indexed_OP2);
		Found = true;
		} else if (canCombineWithFMUL(MBB, Root.getOperand(2),
		AArch64::FMULv4f32)) {
		Patterns.push_back(MachineCombinerPattern::FMLSv4f32_OP2);
		Found = true;
		}
		break;
		}
		return Found;
		}

		/// Return true when a code sequence can improve throughput. It
		/// should be called only for instructions in loops.
		/// \param Pattern - combiner pattern
		bool
		AArch64InstrInfo::isThroughputPattern(MachineCombinerPattern Pattern) const {
		switch (Pattern) {
		default:
		break;
		case MachineCombinerPattern::FMULADDS_OP1:
		case MachineCombinerPattern::FMULADDS_OP2:
		case MachineCombinerPattern::FMULSUBS_OP1:
		case MachineCombinerPattern::FMULSUBS_OP2:
		case MachineCombinerPattern::FMULADDD_OP1:
		case MachineCombinerPattern::FMULADDD_OP2:
		case MachineCombinerPattern::FMULSUBD_OP1:
		case MachineCombinerPattern::FMULSUBD_OP2:
		case MachineCombinerPattern::FMLAv1i32_indexed_OP1:
		case MachineCombinerPattern::FMLAv1i32_indexed_OP2:
		case MachineCombinerPattern::FMLAv1i64_indexed_OP1:
		case MachineCombinerPattern::FMLAv1i64_indexed_OP2:
		case MachineCombinerPattern::FMLAv2f32_OP2:
		case MachineCombinerPattern::FMLAv2f32_OP1:
		case MachineCombinerPattern::FMLAv2f64_OP1:
		case MachineCombinerPattern::FMLAv2f64_OP2:
		case MachineCombinerPattern::FMLAv2i32_indexed_OP1:
		case MachineCombinerPattern::FMLAv2i32_indexed_OP2:
		case MachineCombinerPattern::FMLAv2i64_indexed_OP1:
		case MachineCombinerPattern::FMLAv2i64_indexed_OP2:
		case MachineCombinerPattern::FMLAv4f32_OP1:
		case MachineCombinerPattern::FMLAv4f32_OP2:
		case MachineCombinerPattern::FMLAv4i32_indexed_OP1:
		case MachineCombinerPattern::FMLAv4i32_indexed_OP2:
		case MachineCombinerPattern::FMLSv1i32_indexed_OP2:
		case MachineCombinerPattern::FMLSv1i64_indexed_OP2:
		case MachineCombinerPattern::FMLSv2i32_indexed_OP2:
		case MachineCombinerPattern::FMLSv2i64_indexed_OP2:
		case MachineCombinerPattern::FMLSv2f32_OP2:
		case MachineCombinerPattern::FMLSv2f64_OP2:
		case MachineCombinerPattern::FMLSv4i32_indexed_OP2:
		case MachineCombinerPattern::FMLSv4f32_OP2:
		return true;
		} // end switch (Pattern)
		return false;
		}
/// Return true when there is potentially a faster code sequence for an		/// Return true when there is potentially a faster code sequence for an
/// instruction chain ending in \p Root. All potential patterns are listed in		/// instruction chain ending in \p Root. All potential patterns are listed in
/// the \p Pattern vector. Pattern should be sorted in priority order since the		/// the \p Pattern vector. Pattern should be sorted in priority order since the
/// pattern evaluator stops checking as soon as it finds a faster sequence.		/// pattern evaluator stops checking as soon as it finds a faster sequence.

bool AArch64InstrInfo::getMachineCombinerPatterns(		bool AArch64InstrInfo::getMachineCombinerPatterns(
MachineInstr &Root,		MachineInstr &Root,
SmallVectorImpl<MachineCombinerPattern> &Patterns) const {		SmallVectorImpl<MachineCombinerPattern> &Patterns) const {
		// Integer patterns
if (getMaddPatterns(Root, Patterns))		if (getMaddPatterns(Root, Patterns))
return true;		return true;
		// Floating point patterns
		if (getFMAPatterns(Root, Patterns))
		return true;

return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns);		return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns);
}		}

/// genMadd - Generate madd instruction and combine mul and add.		enum class FMAInstKind { Default, Indexed, Accumulator };
/// Example:		/// genFusedMultiply - Generate fused multiply instructions.
/// MUL I=A,B,0		/// This function supports both integer and floating point instructions.
/// ADD R,I,C		/// A typical example:
/// ==> MADD R,A,B,C		/// F\|MUL I=A,B,0
/// \param Root is the ADD instruction		/// F\|ADD R,I,C
		/// ==> F\|MADD R,A,B,C
		/// \param Root is the F\|ADD instruction
/// \param [out] InsInstrs is a vector of machine instructions and will		/// \param [out] InsInstrs is a vector of machine instructions and will
/// contain the generated madd instruction		/// contain the generated madd instruction
/// \param IdxMulOpd is index of operand in Root that is the result of		/// \param IdxMulOpd is index of operand in Root that is the result of
/// the MUL. In the example above IdxMulOpd is 1.		/// the F\|MUL. In the example above IdxMulOpd is 1.
/// \param MaddOpc the opcode fo the madd instruction		/// \param MaddOpc the opcode fo the f\|madd instruction
static MachineInstr *genMadd(MachineFunction &MF, MachineRegisterInfo &MRI,		static MachineInstr *
		genFusedMultiply(MachineFunction &MF, MachineRegisterInfo &MRI,
const TargetInstrInfo *TII, MachineInstr &Root,		const TargetInstrInfo *TII, MachineInstr &Root,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs, unsigned IdxMulOpd,
unsigned IdxMulOpd, unsigned MaddOpc,		unsigned MaddOpc, const TargetRegisterClass *RC,
const TargetRegisterClass *RC) {		FMAInstKind kind = FMAInstKind::Default) {
assert(IdxMulOpd == 1 \|\| IdxMulOpd == 2);		assert(IdxMulOpd == 1 \|\| IdxMulOpd == 2);

unsigned IdxOtherOpd = IdxMulOpd == 1 ? 2 : 1;		unsigned IdxOtherOpd = IdxMulOpd == 1 ? 2 : 1;
MachineInstr *MUL = MRI.getUniqueVRegDef(Root.getOperand(IdxMulOpd).getReg());		MachineInstr *MUL = MRI.getUniqueVRegDef(Root.getOperand(IdxMulOpd).getReg());
unsigned ResultReg = Root.getOperand(0).getReg();		unsigned ResultReg = Root.getOperand(0).getReg();
unsigned SrcReg0 = MUL->getOperand(1).getReg();		unsigned SrcReg0 = MUL->getOperand(1).getReg();
bool Src0IsKill = MUL->getOperand(1).isKill();		bool Src0IsKill = MUL->getOperand(1).isKill();
unsigned SrcReg1 = MUL->getOperand(2).getReg();		unsigned SrcReg1 = MUL->getOperand(2).getReg();
bool Src1IsKill = MUL->getOperand(2).isKill();		bool Src1IsKill = MUL->getOperand(2).isKill();
unsigned SrcReg2 = Root.getOperand(IdxOtherOpd).getReg();		unsigned SrcReg2 = Root.getOperand(IdxOtherOpd).getReg();
bool Src2IsKill = Root.getOperand(IdxOtherOpd).isKill();		bool Src2IsKill = Root.getOperand(IdxOtherOpd).isKill();

if (TargetRegisterInfo::isVirtualRegister(ResultReg))		if (TargetRegisterInfo::isVirtualRegister(ResultReg))
MRI.constrainRegClass(ResultReg, RC);		MRI.constrainRegClass(ResultReg, RC);
if (TargetRegisterInfo::isVirtualRegister(SrcReg0))		if (TargetRegisterInfo::isVirtualRegister(SrcReg0))
MRI.constrainRegClass(SrcReg0, RC);		MRI.constrainRegClass(SrcReg0, RC);
if (TargetRegisterInfo::isVirtualRegister(SrcReg1))		if (TargetRegisterInfo::isVirtualRegister(SrcReg1))
MRI.constrainRegClass(SrcReg1, RC);		MRI.constrainRegClass(SrcReg1, RC);
if (TargetRegisterInfo::isVirtualRegister(SrcReg2))		if (TargetRegisterInfo::isVirtualRegister(SrcReg2))
MRI.constrainRegClass(SrcReg2, RC);		MRI.constrainRegClass(SrcReg2, RC);

MachineInstrBuilder MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MaddOpc),		MachineInstrBuilder MIB;
ResultReg)		if (kind == FMAInstKind::Default)
		MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MaddOpc), ResultReg)
.addReg(SrcReg0, getKillRegState(Src0IsKill))		.addReg(SrcReg0, getKillRegState(Src0IsKill))
.addReg(SrcReg1, getKillRegState(Src1IsKill))		.addReg(SrcReg1, getKillRegState(Src1IsKill))
.addReg(SrcReg2, getKillRegState(Src2IsKill));		.addReg(SrcReg2, getKillRegState(Src2IsKill));
// Insert the MADD		else if (kind == FMAInstKind::Indexed)
		MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MaddOpc), ResultReg)
		.addReg(SrcReg2, getKillRegState(Src2IsKill))
		.addReg(SrcReg0, getKillRegState(Src0IsKill))
		.addReg(SrcReg1, getKillRegState(Src1IsKill))
		.addImm(MUL->getOperand(3).getImm());
		else if (kind == FMAInstKind::Accumulator)
		MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MaddOpc), ResultReg)
		.addReg(SrcReg2, getKillRegState(Src2IsKill))
		.addReg(SrcReg0, getKillRegState(Src0IsKill))
		.addReg(SrcReg1, getKillRegState(Src1IsKill));
		else
		assert(false && "Invalid FMA instruction kind \n");
		// Insert the MADD (MADD, FMA, FMS, FMLA, FMSL)
InsInstrs.push_back(MIB);		InsInstrs.push_back(MIB);
return MUL;		return MUL;
}		}

/// genMaddR - Generate madd instruction and combine mul and add using		/// genMaddR - Generate madd instruction and combine mul and add using
/// an extra virtual register		/// an extra virtual register
/// Example - an ADD intermediate needs to be stored in a register:		/// Example - an ADD intermediate needs to be stored in a register:
/// MUL I=A,B,0		/// MUL I=A,B,0
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	case MachineCombinerPattern::MULADDX_OP1:
// --- Create(MADD);		// --- Create(MADD);
if (Pattern == MachineCombinerPattern::MULADDW_OP1) {		if (Pattern == MachineCombinerPattern::MULADDW_OP1) {
Opc = AArch64::MADDWrrr;		Opc = AArch64::MADDWrrr;
RC = &AArch64::GPR32RegClass;		RC = &AArch64::GPR32RegClass;
} else {		} else {
Opc = AArch64::MADDXrrr;		Opc = AArch64::MADDXrrr;
RC = &AArch64::GPR64RegClass;		RC = &AArch64::GPR64RegClass;
}		}
MUL = genMadd(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC);		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC);
break;		break;
case MachineCombinerPattern::MULADDW_OP2:		case MachineCombinerPattern::MULADDW_OP2:
case MachineCombinerPattern::MULADDX_OP2:		case MachineCombinerPattern::MULADDX_OP2:
// MUL I=A,B,0		// MUL I=A,B,0
// ADD R,C,I		// ADD R,C,I
// ==> MADD R,A,B,C		// ==> MADD R,A,B,C
// --- Create(MADD);		// --- Create(MADD);
if (Pattern == MachineCombinerPattern::MULADDW_OP2) {		if (Pattern == MachineCombinerPattern::MULADDW_OP2) {
Opc = AArch64::MADDWrrr;		Opc = AArch64::MADDWrrr;
RC = &AArch64::GPR32RegClass;		RC = &AArch64::GPR32RegClass;
} else {		} else {
Opc = AArch64::MADDXrrr;		Opc = AArch64::MADDXrrr;
RC = &AArch64::GPR64RegClass;		RC = &AArch64::GPR64RegClass;
}		}
MUL = genMadd(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);
break;		break;
case MachineCombinerPattern::MULADDWI_OP1:		case MachineCombinerPattern::MULADDWI_OP1:
case MachineCombinerPattern::MULADDXI_OP1: {		case MachineCombinerPattern::MULADDXI_OP1: {
// MUL I=A,B,0		// MUL I=A,B,0
// ADD R,I,Imm		// ADD R,I,Imm
// ==> ORR V, ZR, Imm		// ==> ORR V, ZR, Imm
// ==> MADD R,A,B,V		// ==> MADD R,A,B,V
// --- Create(MADD);		// --- Create(MADD);
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	case MachineCombinerPattern::MULSUBX_OP2:
// --- Create(MSUB);		// --- Create(MSUB);
if (Pattern == MachineCombinerPattern::MULSUBW_OP2) {		if (Pattern == MachineCombinerPattern::MULSUBW_OP2) {
Opc = AArch64::MSUBWrrr;		Opc = AArch64::MSUBWrrr;
RC = &AArch64::GPR32RegClass;		RC = &AArch64::GPR32RegClass;
} else {		} else {
Opc = AArch64::MSUBXrrr;		Opc = AArch64::MSUBXrrr;
RC = &AArch64::GPR64RegClass;		RC = &AArch64::GPR64RegClass;
}		}
MUL = genMadd(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);
break;		break;
case MachineCombinerPattern::MULSUBWI_OP1:		case MachineCombinerPattern::MULSUBWI_OP1:
case MachineCombinerPattern::MULSUBXI_OP1: {		case MachineCombinerPattern::MULSUBXI_OP1: {
// MUL I=A,B,0		// MUL I=A,B,0
// SUB R,I, Imm		// SUB R,I, Imm
// ==> ORR V, ZR, -Imm		// ==> ORR V, ZR, -Imm
// ==> MADD R,A,B,V // = -Imm + A*B		// ==> MADD R,A,B,V // = -Imm + A*B
// --- Create(MADD);		// --- Create(MADD);
Show All 28 Lines	if (AArch64_AM::processLogicalImmediate(UImm, BitSize, Encoding)) {
.addReg(ZeroReg)		.addReg(ZeroReg)
.addImm(Encoding);		.addImm(Encoding);
InsInstrs.push_back(MIB1);		InsInstrs.push_back(MIB1);
InstrIdxForVirtReg.insert(std::make_pair(NewVR, 0));		InstrIdxForVirtReg.insert(std::make_pair(NewVR, 0));
MUL = genMaddR(MF, MRI, TII, Root, InsInstrs, 1, Opc, NewVR, RC);		MUL = genMaddR(MF, MRI, TII, Root, InsInstrs, 1, Opc, NewVR, RC);
}		}
break;		break;
}		}
		// Floating Point Support
		case MachineCombinerPattern::FMULADDS_OP1:
		case MachineCombinerPattern::FMULADDD_OP1:
		// MUL I=A,B,0
		// ADD R,I,C
		// ==> MADD R,A,B,C
		// --- Create(MADD);
		if (Pattern == MachineCombinerPattern::FMULADDS_OP1) {
		Opc = AArch64::FMADDSrrr;
		RC = &AArch64::FPR32RegClass;
		} else {
		Opc = AArch64::FMADDDrrr;
		RC = &AArch64::FPR64RegClass;
		}
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC);
		break;
		case MachineCombinerPattern::FMULADDS_OP2:
		case MachineCombinerPattern::FMULADDD_OP2:
		// FMUL I=A,B,0
		// FADD R,C,I
		// ==> FMADD R,A,B,C
		// --- Create(FMADD);
		if (Pattern == MachineCombinerPattern::FMULADDS_OP2) {
		Opc = AArch64::FMADDSrrr;
		RC = &AArch64::FPR32RegClass;
		} else {
		Opc = AArch64::FMADDDrrr;
		RC = &AArch64::FPR64RegClass;
		}
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);
		break;

		case MachineCombinerPattern::FMLAv1i32_indexed_OP1:
		Opc = AArch64::FMLAv1i32_indexed;
		RC = &AArch64::FPR32RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Indexed);
		break;
		case MachineCombinerPattern::FMLAv1i32_indexed_OP2:
		Opc = AArch64::FMLAv1i32_indexed;
		RC = &AArch64::FPR32RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		break;

		case MachineCombinerPattern::FMLAv1i64_indexed_OP1:
		Opc = AArch64::FMLAv1i64_indexed;
		RC = &AArch64::FPR64RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Indexed);
		break;
		case MachineCombinerPattern::FMLAv1i64_indexed_OP2:
		Opc = AArch64::FMLAv1i64_indexed;
		RC = &AArch64::FPR64RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		break;

		case MachineCombinerPattern::FMLAv2i32_indexed_OP1:
		case MachineCombinerPattern::FMLAv2f32_OP1:
		RC = &AArch64::FPR64RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv2i32_indexed_OP1) {
		Opc = AArch64::FMLAv2i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv2f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;
		case MachineCombinerPattern::FMLAv2i32_indexed_OP2:
		case MachineCombinerPattern::FMLAv2f32_OP2:
		RC = &AArch64::FPR64RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv2i32_indexed_OP2) {
		Opc = AArch64::FMLAv2i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv2f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMLAv2i64_indexed_OP1:
		case MachineCombinerPattern::FMLAv2f64_OP1:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv2i64_indexed_OP1) {
		Opc = AArch64::FMLAv2i64_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv2f64;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;
		case MachineCombinerPattern::FMLAv2i64_indexed_OP2:
		case MachineCombinerPattern::FMLAv2f64_OP2:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv2i64_indexed_OP2) {
		Opc = AArch64::FMLAv2i64_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv2f64;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMLAv4i32_indexed_OP1:
		case MachineCombinerPattern::FMLAv4f32_OP1:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv4i32_indexed_OP1) {
		Opc = AArch64::FMLAv4i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv4f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMLAv4i32_indexed_OP2:
		case MachineCombinerPattern::FMLAv4f32_OP2:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLAv4i32_indexed_OP2) {
		Opc = AArch64::FMLAv4i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLAv4f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMULSUBS_OP1:
		case MachineCombinerPattern::FMULSUBD_OP1: {
		// FMUL I=A,B,0
		// FSUB R,I,C
		// ==> FNMSUB R,A,B,C // = -C + A*B
		// --- Create(FNMSUB);
		if (Pattern == MachineCombinerPattern::FMULSUBS_OP1) {
		Opc = AArch64::FNMSUBSrrr;
		RC = &AArch64::FPR32RegClass;
		} else {
		Opc = AArch64::FNMSUBDrrr;
		RC = &AArch64::FPR64RegClass;
		}
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC);
		break;
		}
		case MachineCombinerPattern::FMULSUBS_OP2:
		case MachineCombinerPattern::FMULSUBD_OP2: {
		// FMUL I=A,B,0
		// FSUB R,C,I
		// ==> FMSUB R,A,B,C (computes C - A*B)
		// --- Create(FMSUB);
		if (Pattern == MachineCombinerPattern::FMULSUBS_OP2) {
		Opc = AArch64::FMSUBSrrr;
		RC = &AArch64::FPR32RegClass;
		} else {
		Opc = AArch64::FMSUBDrrr;
		RC = &AArch64::FPR64RegClass;
		}
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC);
		break;

		case MachineCombinerPattern::FMLSv1i32_indexed_OP2:
		Opc = AArch64::FMLSv1i32_indexed;
		RC = &AArch64::FPR32RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		break;

		case MachineCombinerPattern::FMLSv1i64_indexed_OP2:
		Opc = AArch64::FMLSv1i64_indexed;
		RC = &AArch64::FPR64RegClass;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		break;

		case MachineCombinerPattern::FMLSv2f32_OP2:
		case MachineCombinerPattern::FMLSv2i32_indexed_OP2:
		RC = &AArch64::FPR64RegClass;
		if (Pattern == MachineCombinerPattern::FMLSv2i32_indexed_OP2) {
		Opc = AArch64::FMLSv2i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLSv2f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMLSv2f64_OP2:
		case MachineCombinerPattern::FMLSv2i64_indexed_OP2:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLSv2i64_indexed_OP2) {
		Opc = AArch64::FMLSv2i64_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLSv2f64;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;

		case MachineCombinerPattern::FMLSv4f32_OP2:
		case MachineCombinerPattern::FMLSv4i32_indexed_OP2:
		RC = &AArch64::FPR128RegClass;
		if (Pattern == MachineCombinerPattern::FMLSv4i32_indexed_OP2) {
		Opc = AArch64::FMLSv4i32_indexed;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Indexed);
		} else {
		Opc = AArch64::FMLSv4f32;
		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 2, Opc, RC,
		FMAInstKind::Accumulator);
		}
		break;
		}
} // end switch (Pattern)		} // end switch (Pattern)
// Record MUL and ADD/SUB for deletion		// Record MUL and ADD/SUB for deletion
DelInstrs.push_back(MUL);		DelInstrs.push_back(MUL);
DelInstrs.push_back(&Root);		DelInstrs.push_back(&Root);

return;		return;
}		}

▲ Show 20 Lines • Show All 182 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64SelectionDAGInfo.h

	Show All 19 Lines

	class AArch64SelectionDAGInfo : public SelectionDAGTargetInfo {			class AArch64SelectionDAGInfo : public SelectionDAGTargetInfo {
	public:			public:

	SDValue EmitTargetCodeForMemset(SelectionDAG &DAG, SDLoc dl, SDValue Chain,			SDValue EmitTargetCodeForMemset(SelectionDAG &DAG, SDLoc dl, SDValue Chain,
	SDValue Dst, SDValue Src, SDValue Size,			SDValue Dst, SDValue Src, SDValue Size,
	unsigned Align, bool isVolatile,			unsigned Align, bool isVolatile,
	MachinePointerInfo DstPtrInfo) const override;			MachinePointerInfo DstPtrInfo) const override;
				bool GenerateFMAsInMachineCombiner(CodeGenOpt::Level OptLevel) const override;
	};			};
	}			}

	#endif			#endif

lib/Target/AArch64/AArch64SelectionDAGInfo.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	CLI.setDebugLoc(dl).setChain(Chain)
.setCallee(CallingConv::C, Type::getVoidTy(*DAG.getContext()),		.setCallee(CallingConv::C, Type::getVoidTy(*DAG.getContext()),
DAG.getExternalSymbol(bzeroEntry, IntPtr), std::move(Args), 0)		DAG.getExternalSymbol(bzeroEntry, IntPtr), std::move(Args), 0)
.setDiscardResult();		.setDiscardResult();
std::pair<SDValue, SDValue> CallResult = TLI.LowerCallTo(CLI);		std::pair<SDValue, SDValue> CallResult = TLI.LowerCallTo(CLI);
return CallResult.second;		return CallResult.second;
}		}
return SDValue();		return SDValue();
}		}
		bool AArch64SelectionDAGInfo::GenerateFMAsInMachineCombiner(
		CodeGenOpt::Level OptLevel) const {
		if (OptLevel >= CodeGenOpt::Aggressive)
		return true;
		return false;
		}

test/CodeGen/AArch64/arm64-fma-combines.ll

This file was added.

				; RUN: llc < %s -O=3 -mtriple=arm64-apple-ios -mcpu=cyclone -enable-unsafe-fp-math \| FileCheck %s
				define void @foo_2d(double* %src) {
				; CHECK-LABEL: %entry
				; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
				; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
				entry:
				%arrayidx1 = getelementptr inbounds double, double* %src, i64 5
				%arrayidx2 = getelementptr inbounds double, double* %src, i64 11
				%0 = bitcast double* %arrayidx1 to <2 x double>*
				%1 = load double, double* %arrayidx2, align 8
				%2 = load double, double* %arrayidx1, align 8
				%fmul = fmul fast double %1, %1
				%fmul2 = fmul fast double %2, 0x3F94AFD6A052BF5B
				%fadd = fadd fast double %fmul, %fmul2
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmla.2d {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmla.2d {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				; CHECK: fmla.d {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds double, double* %src, i64 %indvars.iv.next
				%3 = load double, double* %arrayidx3, align 8
				%add = fadd fast double %3, %3
				%mul = fmul fast double %add, %fadd
				%e1 = insertelement <2 x double> undef, double %add, i32 0
				%e2 = insertelement <2 x double> %e1, double %add, i32 1
				%add2 = fadd fast <2 x double> %e2, <double 3.000000e+00, double -3.000000e+00>
				%e3 = insertelement <2 x double> undef, double %mul, i32 0
				%e4 = insertelement <2 x double> %e3, double %mul, i32 1
				%mul2 = fmul fast <2 x double> %add2,<double 3.000000e+00, double -3.000000e+00>
				%e5 = insertelement <2 x double> undef, double %add, i32 0
				%e6 = insertelement <2 x double> %e5, double %add, i32 1
				%add3 = fadd fast <2 x double> %mul2, <double 3.000000e+00, double -3.000000e+00>
				%mulx = fmul fast <2 x double> %add2, %e2
				%addx = fadd fast <2 x double> %mulx, %e4
				%e7 = insertelement <2 x double> undef, double %mul, i32 0
				%e8 = insertelement <2 x double> %e7, double %mul, i32 1
				%e9 = fmul fast <2 x double> %addx, %add3
				store <2 x double> %e9, <2 x double>* %0, align 8
				%e10 = extractelement <2 x double> %add3, i32 0
				%mul3 = fmul fast double %mul, %e10
				%add4 = fadd fast double %mul3, %mul
				store double %add4, double* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}
				define void @foo_2s(float* %src) {
				entry:
				%arrayidx1 = getelementptr inbounds float, float* %src, i64 5
				%arrayidx2 = getelementptr inbounds float, float* %src, i64 11
				%0 = bitcast float* %arrayidx1 to <2 x float>*
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmla.2s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmla.2s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				; CHECK: fmla.s {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds float, float* %src, i64 %indvars.iv.next
				%1 = load float, float* %arrayidx3, align 8
				%add = fadd fast float %1, %1
				%mul = fmul fast float %add, %add
				%e1 = insertelement <2 x float> undef, float %add, i32 0
				%e2 = insertelement <2 x float> %e1, float %add, i32 1
				%add2 = fadd fast <2 x float> %e2, <float 3.000000e+00, float -3.000000e+00>
				%e3 = insertelement <2 x float> undef, float %mul, i32 0
				%e4 = insertelement <2 x float> %e3, float %mul, i32 1
				%mul2 = fmul fast <2 x float> %add2,<float 3.000000e+00, float -3.000000e+00>
				%e5 = insertelement <2 x float> undef, float %add, i32 0
				%e6 = insertelement <2 x float> %e5, float %add, i32 1
				%add3 = fadd fast <2 x float> %mul2, <float 3.000000e+00, float -3.000000e+00>
				%mulx = fmul fast <2 x float> %add2, %e2
				%addx = fadd fast <2 x float> %mulx, %e4
				%e7 = insertelement <2 x float> undef, float %mul, i32 0
				%e8 = insertelement <2 x float> %e7, float %mul, i32 1
				%e9 = fmul fast <2 x float> %addx, %add3
				store <2 x float> %e9, <2 x float>* %0, align 8
				%e10 = extractelement <2 x float> %add3, i32 0
				%mul3 = fmul fast float %mul, %e10
				%add4 = fadd fast float %mul3, %mul
				store float %add4, float* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}
				define void @foo_4s(float* %src) {
				entry:
				%arrayidx1 = getelementptr inbounds float, float* %src, i64 5
				%arrayidx2 = getelementptr inbounds float, float* %src, i64 11
				%0 = bitcast float* %arrayidx1 to <4 x float>*
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmla.4s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmla.4s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds float, float* %src, i64 %indvars.iv.next
				%1 = load float, float* %arrayidx3, align 8
				%add = fadd fast float %1, %1
				%mul = fmul fast float %add, %add
				%e1 = insertelement <4 x float> undef, float %add, i32 0
				%e2 = insertelement <4 x float> %e1, float %add, i32 1
				%add2 = fadd fast <4 x float> %e2, <float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00>
				%e3 = insertelement <4 x float> undef, float %mul, i32 0
				%e4 = insertelement <4 x float> %e3, float %mul, i32 1
				%mul2 = fmul fast <4 x float> %add2,<float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00>
				%e5 = insertelement <4 x float> undef, float %add, i32 0
				%e6 = insertelement <4 x float> %e5, float %add, i32 1
				%add3 = fadd fast <4 x float> %mul2, <float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00>
				%mulx = fmul fast <4 x float> %add2, %e2
				flyingforyouUnsubmitted Not Done Reply Inline Actions Please, remove trailing whitespace. flyingforyou: Please, remove trailing whitespace.
				GerolfAuthorUnsubmitted Not Done Reply Inline Actions done. Gerolf: done.
				%addx = fadd fast <4 x float> %mulx, %e4
				%e7 = insertelement <4 x float> undef, float %mul, i32 0
				%e8 = insertelement <4 x float> %e7, float %mul, i32 1
				%e9 = fmul fast <4 x float> %addx, %add3
				store <4 x float> %e9, <4 x float>* %0, align 8
				%e10 = extractelement <4 x float> %add3, i32 0
				%mul3 = fmul fast float %mul, %e10
				store float %mul3, float* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

test/CodeGen/AArch64/arm64-fml-combines.ll

This file was added.

				; RUN: llc < %s -O=3 -mtriple=arm64-apple-ios -mcpu=cyclone -enable-unsafe-fp-math \| FileCheck %s
				define void @foo_2d(double* %src) {
				entry:
				%arrayidx1 = getelementptr inbounds double, double* %src, i64 5
				%arrayidx2 = getelementptr inbounds double, double* %src, i64 11
				%0 = bitcast double* %arrayidx1 to <2 x double>*
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmls.2d {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmls.2d {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				; CHECK: fmls.d {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = sub nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds double, double* %src, i64 %indvars.iv.next
				%1 = load double, double* %arrayidx3, align 8
				%add = fadd fast double %1, %1
				%mul = fmul fast double %add, %add
				%e1 = insertelement <2 x double> undef, double %add, i32 0
				%e2 = insertelement <2 x double> %e1, double %add, i32 1
				%sub2 = fsub fast <2 x double> %e2, <double 3.000000e+00, double -3.000000e+00>
				%e3 = insertelement <2 x double> undef, double %mul, i32 0
				%e4 = insertelement <2 x double> %e3, double %mul, i32 1
				%mul2 = fmul fast <2 x double> %sub2,<double 3.000000e+00, double -3.000000e+00>
				%e5 = insertelement <2 x double> undef, double %add, i32 0
				%e6 = insertelement <2 x double> %e5, double %add, i32 1
				%sub3 = fsub fast <2 x double> <double 3.000000e+00, double -3.000000e+00>, %mul2
				%mulx = fmul fast <2 x double> %sub2, %e2
				%subx = fsub fast <2 x double> %e4, %mulx
				%e7 = insertelement <2 x double> undef, double %mul, i32 0
				%e8 = insertelement <2 x double> %e7, double %mul, i32 1
				%e9 = fmul fast <2 x double> %subx, %sub3
				store <2 x double> %e9, <2 x double>* %0, align 8
				%e10 = extractelement <2 x double> %sub3, i32 0
				%mul3 = fmul fast double %mul, %e10
				%sub4 = fsub fast double %mul, %mul3
				store double %sub4, double* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}
				define void @foo_2s(float* %src) {
				entry:
				%arrayidx1 = getelementptr inbounds float, float* %src, i64 5
				%arrayidx2 = getelementptr inbounds float, float* %src, i64 11
				%0 = bitcast float* %arrayidx1 to <2 x float>*
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmls.2s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmls.2s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				; CHECK: fmls.s {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds float, float* %src, i64 %indvars.iv.next
				%1 = load float, float* %arrayidx3, align 8
				%add = fadd fast float %1, %1
				%mul = fmul fast float %add, %add
				%e1 = insertelement <2 x float> undef, float %add, i32 0
				%e2 = insertelement <2 x float> %e1, float %add, i32 1
				%add2 = fsub fast <2 x float> %e2, <float 3.000000e+00, float -3.000000e+00>
				%e3 = insertelement <2 x float> undef, float %mul, i32 0
				%e4 = insertelement <2 x float> %e3, float %mul, i32 1
				%mul2 = fmul fast <2 x float> %add2,<float 3.000000e+00, float -3.000000e+00>
				%e5 = insertelement <2 x float> undef, float %add, i32 0
				%e6 = insertelement <2 x float> %e5, float %add, i32 1
				%add3 = fsub fast <2 x float> <float 3.000000e+00, float -3.000000e+00>, %mul2
				%mulx = fmul fast <2 x float> %add2, %e2
				%addx = fsub fast <2 x float> %e4, %mulx
				%e7 = insertelement <2 x float> undef, float %mul, i32 0
				%e8 = insertelement <2 x float> %e7, float %mul, i32 1
				%e9 = fmul fast <2 x float> %addx, %add3
				store <2 x float> %e9, <2 x float>* %0, align 8
				%e10 = extractelement <2 x float> %add3, i32 0
				%mul3 = fmul fast float %mul, %e10
				%add4 = fsub fast float %mul, %mul3
				store float %add4, float* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}
				define void @foo_4s(float* %src) {
				entry:
				%arrayidx1 = getelementptr inbounds float, float* %src, i64 5
				%arrayidx2 = getelementptr inbounds float, float* %src, i64 11
				%0 = bitcast float* %arrayidx1 to <4 x float>*
				br label %for.body

				; CHECK-LABEL: %for.body
				; CHECK: fmls.4s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}
				; CHECK: fmls.4s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds float, float* %src, i64 %indvars.iv.next
				%1 = load float, float* %arrayidx3, align 8
				%add = fadd fast float %1, %1
				%mul = fmul fast float %add, %add
				%e1 = insertelement <4 x float> undef, float %add, i32 0
				%e2 = insertelement <4 x float> %e1, float %add, i32 1
				%add2 = fadd fast <4 x float> %e2, <float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00>
				%e3 = insertelement <4 x float> undef, float %mul, i32 0
				%e4 = insertelement <4 x float> %e3, float %mul, i32 1
				%mul2 = fmul fast <4 x float> %add2,<float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00>
				%e5 = insertelement <4 x float> undef, float %add, i32 0
				%e6 = insertelement <4 x float> %e5, float %add, i32 1
				%add3 = fsub fast <4 x float> <float 3.000000e+00, float -3.000000e+00, float 5.000000e+00, float 7.000000e+00> , %mul2
				%mulx = fmul fast <4 x float> %add2, %e2
				%addx = fsub fast <4 x float> %e4, %mulx
				%e7 = insertelement <4 x float> undef, float %mul, i32 0
				%e8 = insertelement <4 x float> %e7, float %mul, i32 1
				%e9 = fmul fast <4 x float> %addx, %add3
				store <4 x float> %e9, <4 x float>* %0, align 8
				%e10 = extractelement <4 x float> %add3, i32 0
				%mul3 = fmul fast float %mul, %e10
				store float %mul3, float* %arrayidx2, align 8
				%exitcond = icmp eq i64 %indvars.iv.next, 25
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Support for floating-point FMA on ARM64ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 54452

include/llvm/CodeGen/MachineCombinerPattern.h

include/llvm/CodeGen/SelectionDAGTargetInfo.h

include/llvm/Target/TargetInstrInfo.h

lib/CodeGen/MachineCombiner.cpp

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/CodeGen/TargetInstrInfo.cpp

lib/Target/AArch64/AArch64InstrInfo.h

lib/Target/AArch64/AArch64InstrInfo.cpp

lib/Target/AArch64/AArch64SelectionDAGInfo.h

lib/Target/AArch64/AArch64SelectionDAGInfo.cpp

test/CodeGen/AArch64/arm64-fma-combines.ll

test/CodeGen/AArch64/arm64-fml-combines.ll

[MachineCombiner] Support for floating-point FMA on ARM64
ClosedPublic