Download Raw Diff

Details

Reviewers

Summary

Have been experimenting with enabling MachineCombiner.

It seemed very likely better to me to let the MachineCombiner work on reg/reg opcodes as it didn't work at all when the input had ADBs. It is expecting opcodes to be the same LHS/RHS, and of course that didn't work when the RHS was a memory operand.

To do this I tried to first select all FP64 adds directly to WFADB always. The MachineCombiner now could do its work, and after that the Peephole optimizer will help with folding the loads again to ADB:

main <> patched: isel + peep of f64 adds, machine combiner disabled.

adb            :                10796                 9057    -1739
...
wfadb          :                10733                11691     +958
adbr           :                 4345                 5203     +858
...

Spill|Reload   :               642581               641631     -950
Copies         :              1011928              1011420     -508

Most of the loads are folded, but not all (84%). Spilling/copies however improves, so not sure if it's worth further effort to fold everything, or even if that's better.

Preliminary benchmarking shows 20% (!) improvement on LBM - twice as much as with the SLP vectorizer (confirmed with a preliminary "full" run also).

Slight improvement on namd and imagick (~1%), without any reassociation, just with the add reg/mem change, which makes it seem like this worked out pretty well.

A complication, though, with this: changing VL64;WFADB -> ADB introduces a CC clobbering. Scanning from the WFADB and forward many times lead to a quick termination when a new CC def was encountered. There were however also many longer searches.

Measuring compile time for the Peephole pass shows a slight average increase, with a bad worst case or two:

main <> patch (without running machine combiner)

Num stats: 3465         Num stats: 3465
Average Wall: 0.49%   | Average Wall: 0.50%
Wall %   Count          Wall %   Count
4.0      1            | 9.3      1
1.5      1            | 4.3      1
1.4      4            | 3.3      1
                      > 2.7      1
                      > 1.5      2
                      > 1.4      5
1.3      4              1.3      4
1.2      3            | 1.2      2
1.1      4            | 1.1      6
1.0      10           | 1.0      11
0.9      58           | 0.9      69
0.8      152          | 0.8      148
0.7      270          | 0.7      281
0.6      586          | 0.6      611
0.5      894          | 0.5      921
0.4      842          | 0.4      801
0.3      480          | 0.3      460
0.2      139          | 0.2      121
0.1      14           | 0.1      10

This might be acceptable perhaps.

Selecting WFADB (the real instruction, not a pseudo) directly in Select() lead to one file having a few cases of LOC instructions moved around in the block crossing the places where ADB would no longer be possible. Just in one file, but a dozen or so times inside a loop, so not quite ideal. Using an isel pseudo that clobbers CC remedied this as it was the isel scheduler that were causing this difference. With this, there are no cases of CC clobbering that prevents optimizeLoadInstr() on SPEC, but the scan to make sure still has to be made of course.

One idea I tried was to simply add the CC def to WFADB, which then would be trivially replacable with ADB. It did cause some changes in ~50 files, but not sure if that would make a difference.

A compromise would be to have a WFADB_CC pseudo live all the way through optimizeLoadInst() and then convert to the real opcode before scheduling. That would however require a new pass it seems, as now *all* of them has to be handled, while optimizeLoadInst() only sees the candidates for the load folding. If that new pass would have LiveIntervals available, the scan for CC would not be necessary as that query should now be available.

In short, this seems promising performance-wise even compared to the SLP vectorizer generating reductions, but I'm not quite sure which way to handle the CC clobbering problem is the best fit.

Diff Detail

Unit TestsFailed

	Time	Test
	60,050 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

jonpa created this revision.Apr 7 2023, 11:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 11:36 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

jonpa requested review of this revision.Apr 7 2023, 11:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 11:36 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B224260: Diff 511746.Apr 7 2023, 12:34 PM

Using a pseudo reg/reg that pretends to clobber CC to simplify the later reg/mem folding. This seems to work well.
Handling all floating point add, sub and mul, which are the set of operations that depend on the reassociable/nsz flags.
Having scheduler info for the _CCPseudo:s turned out to be important as MachineCombiner looks at the latencies.
New target hook "processFunctionAfterPeepholeOpt()", which lets target do any post-processing after peephole optimizations. This is needed in order to substitute the _CCPseudo instructions that did not get optimized (into reg/mem), with the target instruction without the CC operand.
Using a pattern for selecting the _CCPseudo instructions instead of doing it manually in select(). This change made it obvious that the NoFPExcept flag is to be added here, which the pattern based selector does. This looks ok to me, but not quite sure if it would make more sense to instead predicate the reg/mem patterns or not.

FMin, FMax and the integer instructions are the remaining instructions that need to be handled.

Preliminary benchmarking looks good - lbm improved another percent with the subtractions (now 21% improvement).

I have not yet looked further into any effects of this way of doing the instruction selection with later peephole folding of loads. One thing I noticed is some cases where MDEBR is not used, instead WLDEB + WFMDB. It seems two separate instructions is slower in this case. Not sure if it would be worth handling that in processFunctionAfterPeepholeOpt(), or if perhaps just not handling those cases with reassociation (relatively few cases):

main <> patched

mdebr          :                  170                   16     -154
wldeb          :                  661                  793     +132
ldebr          :                 8803                 8790      -13
ldeb           :                 5599                 5596       -3

Harbormaster completed remote builds in B229397: Diff 518670.May 2 2023, 4:36 AM

MachineSink behaved differently with the new pseudos that clobber CC - fixed with a patch in MachineSink plus making sure to mark the CC def as dead on the newly created instructions in MachineCombiner.

Previously tried selecting _CCPseudo:s by using a pattern with added complexity for them, and then also an even higher complexity for MDEBR. Seemed better to instead predicate the reg/mem pattern with "no reassociation flags", and selecting the _CCPseudo in case of "reassociation flags", or else the target instruction, which the patch now does. These two alternatives gave identical output on SPEC.

Experimented with MDEBR but it seemed that those cases are rare and there was not any more reassociation done on benchmarks - so I removed the folding I had working in FinalizaeReassocication (ldebr; wfmdb -> mdebr).

PeepholeOptimizer does not fold loads across basic blocks but it seems good to fold them in FinalizeReassociation. Tried doing this first with only loads from constant pool, but it seemed to be even better to do it on any load.

Removed the check (in optimizeLoad()) when folding into reg/mem that there is no other user in MBB. With this restriction:

                                 main                patch
mdb            :                 9667                 5507    -4160
meeb           :                 8838                 4831    -4007
adb            :                10787                 8591    -2196
aeb            :                 7322                 5534    -1788
sdb            :                 4271                 4409     +138
seb            :                 4706                 4094     -612
Copies         :              1006170               999666    -6504

Without it (as patch is now), the number of reg/mem instructions are much closer to main:

mdb            :                 9667                 9061     -606
meeb           :                 8838                 8210     -628
adb            :                10787                 9467    -1320
aeb            :                 7322                 6992     -330
sdb            :                 4271                 4637     +366
seb            :                 4706                 4697       -9
Copies         :              1006170              1006497     +327

As seen in the number of register moves (copies) in the output, the folding into 2-addres reg/mem has a price of copying the source reg. The lesser number of copies didn't seem to matter in performance. With the extra folding I see a great improvement in f538.imagick_r (~15%), which is probably be the same improvement as if disabling the pre-ra machine-scheduler, so it seems that the increased spilling there is avoided also this way. LBM also gains another 2% with this (now ~20%), so it looks preferable at least the moment. If the scheduler is improved to improve on the register pressure consistently, perhaps this could be reevaluated.

With nightly full runs I now see three big improvements on z15:

Improvements:
0.794: f519.lbm_r 
0.855: f538.imagick_r 
0.906: f510.parest_r

Will now give it a try to find further improvements with fused add/sub and multiply.

Harbormaster completed remote builds in B232798: Diff 523291.May 18 2023, 2:33 AM

Better not reject reg/mem folding during isel (like adb) for older machines.

Harbormaster completed remote builds in B232868: Diff 523381.May 18 2023, 8:50 AM

The Add/Sub/Multiply reassociation seems to work well, so this is adding a handling of FMAs on top of that. This is still under development as it is not yet clear exactly what is the best approach.

SystemZReassocAdditions.cpp: New experimental optimization that finds chains of FMAs and Adds and reorders computations to minimize stalls. This should help MachineCombiner reassociation but could also perhaps be good on its own. A little unsure of how aggressive this pass should/could be. Right now assuming that if the FmContract flag is passed in addition to FmReassoc and FmNsz, it is ok to take apart FMAs/Adds and form new a new FMA on the top of the chain.

Reg/mem handling for WFMADB and WFMASB the same way as for the binary instructions.

Experimental FMA patterns added with tests. Two of them are inspired by the PPC ILP patterns. Have not yet tried these on benchmarks so still mostly unclear what will be the best in the end. One issue might be with longer FMA chains where it might be good to have MachineCombiner step back and revisit the resulting chain of Adds. If it could do that, using FMA2 or FM4 should be the same on a chain of 4 FMAs, for example.

The current idea is to let MachineCombiner decide if a transformation is beneficial, looking at the actual depths of the MachineInstrs using the SchedModel. For the binops, TargetInstrInfo reassociation tries two variants to let MachineCombiner decide which one is improving the Critical Path. For FMA patterns, this is not done so it is kind of arbitrary if a pattern is beneficial - it depends on the incoming operand latencies of the multiplication factors. Maybe that could be done on smaller patterns, like having three or four variants of it, or perhaps the target hook could use the BlockTrace to decide that directly instead. Compile time has been mentioned in comments, so perhaps that would be a good idea (if useful). Then again, the ReassocAdditions pass would be the alternative to doing this, if that turns out to be of good use.

Harbormaster completed remote builds in B236364: Diff 528098.Jun 3 2023, 6:35 AM

Diff 511746

llvm/include/llvm/CodeGen/TargetInstrInfo.h

Show First 20 Lines • Show All 1,617 Lines • ▼ Show 20 Lines	public:
/// Try to remove the load by folding it to a register operand at the use.		/// Try to remove the load by folding it to a register operand at the use.
/// We fold the load instructions if and only if the		/// We fold the load instructions if and only if the
/// def and use are in the same BB. We only look at one load and see		/// def and use are in the same BB. We only look at one load and see
/// whether it can be folded into MI. FoldAsLoadDefReg is the virtual register		/// whether it can be folded into MI. FoldAsLoadDefReg is the virtual register
/// defined by the load we are trying to fold. DefMI returns the machine		/// defined by the load we are trying to fold. DefMI returns the machine
/// instruction that defines FoldAsLoadDefReg, and the function returns		/// instruction that defines FoldAsLoadDefReg, and the function returns
/// the machine instruction generated due to folding.		/// the machine instruction generated due to folding.
virtual MachineInstr *optimizeLoadInstr(MachineInstr &MI,		virtual MachineInstr *optimizeLoadInstr(MachineInstr &MI,
const MachineRegisterInfo *MRI,		MachineRegisterInfo *MRI,
Register &FoldAsLoadDefReg,		Register &FoldAsLoadDefReg,
MachineInstr *&DefMI) const {		MachineInstr *&DefMI) const {
return nullptr;		return nullptr;
}		}

/// 'Reg' is known to be defined by a move immediate instruction,		/// 'Reg' is known to be defined by a move immediate instruction,
/// try to fold the immediate into the use instruction.		/// try to fold the immediate into the use instruction.
/// If MRI->hasOneNonDBGUse(Reg) is true, and this function returns true,		/// If MRI->hasOneNonDBGUse(Reg) is true, and this function returns true,
▲ Show 20 Lines • Show All 506 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp

Show First 20 Lines • Show All 1,666 Lines • ▼ Show 20 Lines	if (ElemBitSize == 32) {
if (tryScatter(Store, SystemZ::VSCEF))		if (tryScatter(Store, SystemZ::VSCEF))
return;		return;
} else if (ElemBitSize == 64) {		} else if (ElemBitSize == 64) {
if (tryScatter(Store, SystemZ::VSCEG))		if (tryScatter(Store, SystemZ::VSCEG))
return;		return;
}		}
break;		break;
}		}

		case ISD::FADD: {
		// Wait with reg/mem folding if reassociation is allowed. Use a pseudo
		// that clobbers CC during isel to help later load folding into ADB.
		if (Node->getValueType(0) == MVT::f64 &&
		Node->getFlags().hasAllowReassociation() &&
		Node->getFlags().hasNoSignedZeros()) {
		CurDAG->SelectNodeTo(Node, SystemZ::WFADB_CCPseudo, MVT::f64,
		Node->getOperand(0), Node->getOperand(1));
		return;
		}
		}
}		}

SelectCode(Node);		SelectCode(Node);
}		}

bool SystemZDAGToDAGISel::		bool SystemZDAGToDAGISel::
SelectInlineAsmMemoryOperand(const SDValue &Op,		SelectInlineAsmMemoryOperand(const SDValue &Op,
unsigned ConstraintID,		unsigned ConstraintID,
▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,034 Lines • ▼ Show 20 Lines	MachineBasicBlock *SystemZTargetLowering::EmitInstrWithCustomInserter(

case SystemZ::PROBED_ALLOCA:		case SystemZ::PROBED_ALLOCA:
return emitProbedAlloca(MI, MBB);		return emitProbedAlloca(MI, MBB);

case TargetOpcode::STACKMAP:		case TargetOpcode::STACKMAP:
case TargetOpcode::PATCHPOINT:		case TargetOpcode::PATCHPOINT:
return emitPatchPoint(MI, MBB);		return emitPatchPoint(MI, MBB);

		case SystemZ::WFADB_CCPseudo:
		MI.setDesc(Subtarget.getInstrInfo()->get(SystemZ::WFADB));
		MI.removeOperand(3); // CC
		return MBB;

default:		default:
llvm_unreachable("Unexpected instr type to insert");		llvm_unreachable("Unexpected instr type to insert");
}		}
}		}

// This is only used by the isel schedulers, and is needed only to prevent		// This is only used by the isel schedulers, and is needed only to prevent
// compiler from crashing when list-ilp is used.		// compiler from crashing when list-ilp is used.
const TargetRegisterClass *		const TargetRegisterClass *
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZInstrFormats.td

Show First 20 Lines • Show All 5,382 Lines • ▼ Show 20 Lines	multiclass StringRRE<string mnemonic, bits<16> opcode,
let Uses = [R0L] in		let Uses = [R0L] in
def "" : SideEffectBinaryMemMemRRE<mnemonic, opcode, GR64, GR64>;		def "" : SideEffectBinaryMemMemRRE<mnemonic, opcode, GR64, GR64>;
let usesCustomInserter = 1, hasNoSchedulingInfo = 1 in		let usesCustomInserter = 1, hasNoSchedulingInfo = 1 in
def Loop : Pseudo<(outs GR64:$end),		def Loop : Pseudo<(outs GR64:$end),
(ins GR64:$start1, GR64:$start2, GR32:$char),		(ins GR64:$start1, GR64:$start2, GR32:$char),
[(set GR64:$end, (operator GR64:$start1, GR64:$start2,		[(set GR64:$end, (operator GR64:$start1, GR64:$start2,
GR32:$char))]>;		GR32:$char))]>;
}		}

		multiclass BinaryVRRcAndCCPseudo<string mnemonic, bits<16> opcode,
		SDPatternOperator operator, TypedReg tr1,
		TypedReg tr2, bits<4> type = 0, bits<4> m5 = 0,
		bits<4> m6 = 0, string fp_mnemonic = ""> {
		def "" : BinaryVRRc<mnemonic, opcode, operator, tr1, tr2, type, m5, m6,
		fp_mnemonic>;
		let Defs = [CC], usesCustomInserter = 1, hasNoSchedulingInfo = 1 in
		def _CCPseudo : Pseudo<(outs tr1.op:$V1), (ins tr2.op:$V2, tr2.op:$V3), []>;
		}

llvm/lib/Target/SystemZ/SystemZInstrInfo.h

Show First 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	bool analyzeCompare(const MachineInstr &MI, Register &SrcReg,
int64_t &Value) const override;		int64_t &Value) const override;
bool canInsertSelect(const MachineBasicBlock &, ArrayRef<MachineOperand> Cond,		bool canInsertSelect(const MachineBasicBlock &, ArrayRef<MachineOperand> Cond,
Register, Register, Register, int &, int &,		Register, Register, Register, int &, int &,
int &) const override;		int &) const override;
void insertSelect(MachineBasicBlock &MBB, MachineBasicBlock::iterator MI,		void insertSelect(MachineBasicBlock &MBB, MachineBasicBlock::iterator MI,
const DebugLoc &DL, Register DstReg,		const DebugLoc &DL, Register DstReg,
ArrayRef<MachineOperand> Cond, Register TrueReg,		ArrayRef<MachineOperand> Cond, Register TrueReg,
Register FalseReg) const override;		Register FalseReg) const override;
		MachineInstr *optimizeLoadInstr(MachineInstr &MI,
		MachineRegisterInfo *MRI,
		Register &FoldAsLoadDefReg,
		MachineInstr *&DefMI) const override;
bool FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, Register Reg,		bool FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, Register Reg,
MachineRegisterInfo *MRI) const override;		MachineRegisterInfo *MRI) const override;
bool isPredicable(const MachineInstr &MI) const override;		bool isPredicable(const MachineInstr &MI) const override;
bool isProfitableToIfCvt(MachineBasicBlock &MBB, unsigned NumCycles,		bool isProfitableToIfCvt(MachineBasicBlock &MBB, unsigned NumCycles,
unsigned ExtraPredCycles,		unsigned ExtraPredCycles,
BranchProbability Probability) const override;		BranchProbability Probability) const override;
bool isProfitableToIfCvt(MachineBasicBlock &TMBB,		bool isProfitableToIfCvt(MachineBasicBlock &TMBB,
unsigned NumCyclesT, unsigned ExtraPredCyclesT,		unsigned NumCyclesT, unsigned ExtraPredCyclesT,
Show All 15 Lines	void storeRegToStackSlot(MachineBasicBlock &MBB,
Register VReg) const override;		Register VReg) const override;
void loadRegFromStackSlot(MachineBasicBlock &MBB,		void loadRegFromStackSlot(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI, Register DestReg,		MachineBasicBlock::iterator MBBI, Register DestReg,
int FrameIdx, const TargetRegisterClass *RC,		int FrameIdx, const TargetRegisterClass *RC,
const TargetRegisterInfo *TRI,		const TargetRegisterInfo *TRI,
Register VReg) const override;		Register VReg) const override;
MachineInstr convertToThreeAddress(MachineInstr &MI, LiveVariables LV,		MachineInstr convertToThreeAddress(MachineInstr &MI, LiveVariables LV,
LiveIntervals *LIS) const override;		LiveIntervals *LIS) const override;

		bool useMachineCombiner() const override { return true; }
		bool isAssociativeAndCommutative(const MachineInstr &Inst,
		bool Invert) const override;

MachineInstr *		MachineInstr *
foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,		foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,
ArrayRef<unsigned> Ops,		ArrayRef<unsigned> Ops,
MachineBasicBlock::iterator InsertPt, int FrameIndex,		MachineBasicBlock::iterator InsertPt, int FrameIndex,
LiveIntervals *LIS = nullptr,		LiveIntervals *LIS = nullptr,
VirtRegMap *VRM = nullptr) const override;		VirtRegMap *VRM = nullptr) const override;
MachineInstr *foldMemoryOperandImpl(		MachineInstr *foldMemoryOperandImpl(
MachineFunction &MF, MachineInstr &MI, ArrayRef<unsigned> Ops,		MachineFunction &MF, MachineInstr &MI, ArrayRef<unsigned> Ops,
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

Show First 20 Lines • Show All 604 Lines • ▼ Show 20 Lines	void SystemZInstrInfo::insertSelect(MachineBasicBlock &MBB,
} else		} else
llvm_unreachable("Invalid register class");		llvm_unreachable("Invalid register class");

BuildMI(MBB, I, DL, get(Opc), DstReg)		BuildMI(MBB, I, DL, get(Opc), DstReg)
.addReg(FalseReg).addReg(TrueReg)		.addReg(FalseReg).addReg(TrueReg)
.addImm(CCValid).addImm(CCMask);		.addImm(CCValid).addImm(CCMask);
}		}

		static void transferDeadCC(MachineInstr OldMI, MachineInstr NewMI) {
		if (OldMI->registerDefIsDead(SystemZ::CC)) {
		MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
		if (CCDef != nullptr)
		CCDef->setIsDead(true);
		}
		}

		static void transferMIFlag(MachineInstr OldMI, MachineInstr NewMI,
		MachineInstr::MIFlag Flag) {
		if (OldMI->getFlag(Flag))
		NewMI->setFlag(Flag);
		}

		MachineInstr *SystemZInstrInfo::optimizeLoadInstr(MachineInstr &MI,
		MachineRegisterInfo *MRI,
		Register &FoldAsLoadDefReg,
		MachineInstr *&DefMI) const {
		// Check whether we can move DefMI here.
		DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
		assert(DefMI);
		bool SawStore = false;
		if (!DefMI->isSafeToMove(nullptr, SawStore))
		return nullptr;

		// For reassociatable FP additions, any loads have been purposefully been
		// left unfolded so that MachineCombiner can do its work on reg/reg
		// opcodes. After that has been done as many loads as possible are now
		// folded into reg/mem instructions.
		if (MI.getOpcode() == SystemZ::WFADB && DefMI->getOpcode() == SystemZ::VL64 &&
		MRI->hasOneNonDBGUse(FoldAsLoadDefReg)) {
		MachineBasicBlock *MBB = MI.getParent();
		Register DstReg = MI.getOperand(0).getReg();
		MachineOperand LHS = MI.getOperand(1);
		MachineOperand RHS = MI.getOperand(2);
		MachineOperand &SrcMO = LHS.getReg() == FoldAsLoadDefReg ? RHS : LHS;
		// Only use the 2-address ADB if there is no other use of SrcMO in MBB.
		for (auto &UseMI : MRI->use_nodbg_instructions(SrcMO.getReg()))
		if (UseMI.getParent() == MBB && &UseMI != &MI)
		return nullptr;

		// Make sure CC is not live at this point as ADB clobbers it.
		MachineBasicBlock::iterator I = std::next(MI.getIterator());
		for (; I != MBB->end(); ++I) {
		if (I->readsRegister(SystemZ::CC))
		return nullptr;
		if (I->modifiesRegister(SystemZ::CC))
		break;
		}
		if (I == MBB->end()) {
		LivePhysRegs LiveRegs(RI);
		LiveRegs.addLiveOuts(*MBB);
		if (LiveRegs.contains(SystemZ::CC))
		return nullptr;
		}

		MachineInstrBuilder MIB =
		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), get(SystemZ::ADB), DstReg)
		.add(SrcMO)
		.add(DefMI->getOperand(1))
		.add(DefMI->getOperand(2))
		.add(DefMI->getOperand(3))
		.addMemOperand(*DefMI->memoperands_begin());
		transferMIFlag(&MI, MIB, MachineInstr::NoFPExcept);
		MRI->setRegClass(SrcMO.getReg(), &SystemZ::FP64BitRegClass);
		MRI->setRegClass(DstReg, &SystemZ::FP64BitRegClass);
		MIB->getOperand(5).setIsDead(); // CC implicit def
		return MIB;
		}

		return nullptr;
		}

bool SystemZInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,		bool SystemZInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
Register Reg,		Register Reg,
MachineRegisterInfo *MRI) const {		MachineRegisterInfo *MRI) const {
unsigned DefOpc = DefMI.getOpcode();		unsigned DefOpc = DefMI.getOpcode();
if (DefOpc != SystemZ::LHIMux && DefOpc != SystemZ::LHI &&		if (DefOpc != SystemZ::LHIMux && DefOpc != SystemZ::LHI &&
DefOpc != SystemZ::LGHI)		DefOpc != SystemZ::LGHI)
return false;		return false;
if (DefMI.getOperand(0).getReg() != Reg)		if (DefMI.getOperand(0).getReg() != Reg)
▲ Show 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	static LogicOp interpretAndImmediate(unsigned Opcode) {
case SystemZ::NIHH64: return LogicOp(64, 48, 16);		case SystemZ::NIHH64: return LogicOp(64, 48, 16);
case SystemZ::NIFMux: return LogicOp(32, 0, 32);		case SystemZ::NIFMux: return LogicOp(32, 0, 32);
case SystemZ::NILF64: return LogicOp(64, 0, 32);		case SystemZ::NILF64: return LogicOp(64, 0, 32);
case SystemZ::NIHF64: return LogicOp(64, 32, 32);		case SystemZ::NIHF64: return LogicOp(64, 32, 32);
default: return LogicOp();		default: return LogicOp();
}		}
}		}

static void transferDeadCC(MachineInstr OldMI, MachineInstr NewMI) {
if (OldMI->registerDefIsDead(SystemZ::CC)) {
MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
if (CCDef != nullptr)
CCDef->setIsDead(true);
}
}

static void transferMIFlag(MachineInstr OldMI, MachineInstr NewMI,
MachineInstr::MIFlag Flag) {
if (OldMI->getFlag(Flag))
NewMI->setFlag(Flag);
}

MachineInstr *		MachineInstr *
SystemZInstrInfo::convertToThreeAddress(MachineInstr &MI, LiveVariables *LV,		SystemZInstrInfo::convertToThreeAddress(MachineInstr &MI, LiveVariables *LV,
LiveIntervals *LIS) const {		LiveIntervals *LIS) const {
MachineBasicBlock *MBB = MI.getParent();		MachineBasicBlock *MBB = MI.getParent();

// Try to convert an AND into an RISBG-type instruction.		// Try to convert an AND into an RISBG-type instruction.
// TODO: It might be beneficial to select RISBG and shorten to AND instead.		// TODO: It might be beneficial to select RISBG and shorten to AND instead.
if (LogicOp And = interpretAndImmediate(MI.getOpcode())) {		if (LogicOp And = interpretAndImmediate(MI.getOpcode())) {
Show All 36 Lines	if (isRxSBGMask(Imm, And.RegSize, Start, End)) {
LIS->ReplaceMachineInstrInMaps(MI, *MIB);		LIS->ReplaceMachineInstrInMaps(MI, *MIB);
transferDeadCC(&MI, MIB);		transferDeadCC(&MI, MIB);
return MIB;		return MIB;
}		}
}		}
return nullptr;		return nullptr;
}		}

		bool SystemZInstrInfo::isAssociativeAndCommutative(const MachineInstr &Inst,
		bool Invert) const {
		if (Invert)
		return false; // TODO..?

		switch (Inst.getOpcode()) {
		default: break;
		// TODO: Other opcodes.
		case SystemZ::WFADB:
		return Inst.getFlag(MachineInstr::MIFlag::FmReassoc) &&
		Inst.getFlag(MachineInstr::MIFlag::FmNsz);
		}

		return false;
		}

MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(		MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(
MachineFunction &MF, MachineInstr &MI, ArrayRef<unsigned> Ops,		MachineFunction &MF, MachineInstr &MI, ArrayRef<unsigned> Ops,
MachineBasicBlock::iterator InsertPt, int FrameIndex,		MachineBasicBlock::iterator InsertPt, int FrameIndex,
LiveIntervals LIS, VirtRegMap VRM) const {		LiveIntervals LIS, VirtRegMap VRM) const {
const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();		const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
const MachineFrameInfo &MFI = MF.getFrameInfo();		const MachineFrameInfo &MFI = MF.getFrameInfo();
unsigned Size = MFI.getObjectSize(FrameIndex);		unsigned Size = MFI.getObjectSize(FrameIndex);
▲ Show 20 Lines • Show All 1,023 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZInstrVector.td

Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	def : Pat<(v4f32 (z_replicate_loadf32 bdxaddr12only:$addr)),
(VLREPF bdxaddr12only:$addr)>;		(VLREPF bdxaddr12only:$addr)>;
def : Pat<(v2f64 (z_replicate_loadf64 bdxaddr12only:$addr)),		def : Pat<(v2f64 (z_replicate_loadf64 bdxaddr12only:$addr)),
(VLREPG bdxaddr12only:$addr)>;		(VLREPG bdxaddr12only:$addr)>;

// Use VLREP to load subvectors. These patterns use "12pair" because		// Use VLREP to load subvectors. These patterns use "12pair" because
// LEY and LDY offer full 20-bit displacement fields. It's often better		// LEY and LDY offer full 20-bit displacement fields. It's often better
// to use those instructions rather than force a 20-bit displacement		// to use those instructions rather than force a 20-bit displacement
// into a GPR temporary.		// into a GPR temporary.
let mayLoad = 1 in {		let mayLoad = 1, canFoldAsLoad = 1 in {
def VL32 : UnaryAliasVRX<load, v32sb, bdxaddr12pair>;		def VL32 : UnaryAliasVRX<load, v32sb, bdxaddr12pair>;
def VL64 : UnaryAliasVRX<load, v64db, bdxaddr12pair>;		def VL64 : UnaryAliasVRX<load, v64db, bdxaddr12pair>;
}		}

// Load logical element and zero.		// Load logical element and zero.
def VLLEZ : UnaryVRXGeneric<"vllez", 0xE704>;		def VLLEZ : UnaryVRXGeneric<"vllez", 0xE704>;
def VLLEZB : UnaryVRX<"vllezb", 0xE704, z_vllezi8, v128b, 1, 0>;		def VLLEZB : UnaryVRX<"vllezb", 0xE704, z_vllezi8, v128b, 1, 0>;
def VLLEZH : UnaryVRX<"vllezh", 0xE704, z_vllezi16, v128h, 2, 1>;		def VLLEZH : UnaryVRX<"vllezh", 0xE704, z_vllezi16, v128h, 2, 1>;
▲ Show 20 Lines • Show All 891 Lines • ▼ Show 20 Lines	multiclass VectorRounding<Instruction insn, TypedReg tr> {
def : FPConversion<insn, any_fround, tr, tr, 4, 1>;		def : FPConversion<insn, any_fround, tr, tr, 4, 1>;
}		}

let Predicates = [FeatureVector] in {		let Predicates = [FeatureVector] in {
// Add.		// Add.
let Uses = [FPC], mayRaiseFPException = 1, isCommutable = 1 in {		let Uses = [FPC], mayRaiseFPException = 1, isCommutable = 1 in {
def VFA : BinaryVRRcFloatGeneric<"vfa", 0xE7E3>;		def VFA : BinaryVRRcFloatGeneric<"vfa", 0xE7E3>;
def VFADB : BinaryVRRc<"vfadb", 0xE7E3, any_fadd, v128db, v128db, 3, 0>;		def VFADB : BinaryVRRc<"vfadb", 0xE7E3, any_fadd, v128db, v128db, 3, 0>;
def WFADB : BinaryVRRc<"wfadb", 0xE7E3, any_fadd, v64db, v64db, 3, 8, 0,		defm WFADB : BinaryVRRcAndCCPseudo<"wfadb", 0xE7E3, any_fadd, v64db, v64db,
"adbr">;		3, 8, 0, "adbr">;
let Predicates = [FeatureVectorEnhancements1] in {		let Predicates = [FeatureVectorEnhancements1] in {
def VFASB : BinaryVRRc<"vfasb", 0xE7E3, any_fadd, v128sb, v128sb, 2, 0>;		def VFASB : BinaryVRRc<"vfasb", 0xE7E3, any_fadd, v128sb, v128sb, 2, 0>;
def WFASB : BinaryVRRc<"wfasb", 0xE7E3, any_fadd, v32sb, v32sb, 2, 8, 0,		def WFASB : BinaryVRRc<"wfasb", 0xE7E3, any_fadd, v32sb, v32sb, 2, 8, 0,
"aebr">;		"aebr">;
def WFAXB : BinaryVRRc<"wfaxb", 0xE7E3, any_fadd, v128xb, v128xb, 4, 8>;		def WFAXB : BinaryVRRc<"wfaxb", 0xE7E3, any_fadd, v128xb, v128xb, 4, 8>;
}		}
}		}

▲ Show 20 Lines • Show All 783 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp

	Show All 24 Lines
	#include "llvm/Support/CodeGen.h"			#include "llvm/Support/CodeGen.h"
	#include "llvm/Target/TargetLoweringObjectFile.h"			#include "llvm/Target/TargetLoweringObjectFile.h"
	#include "llvm/Transforms/Scalar.h"			#include "llvm/Transforms/Scalar.h"
	#include <optional>			#include <optional>
	#include <string>			#include <string>

	using namespace llvm;			using namespace llvm;

				static cl::opt<bool>
				EnableMachineCombinerPass("systemz-machine-combiner",
				cl::desc("Enable the machine combiner pass"),
				cl::init(true), cl::Hidden);

	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeSystemZTarget() {			extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeSystemZTarget() {
	// Register the target.			// Register the target.
	RegisterTargetMachine<SystemZTargetMachine> X(getTheSystemZTarget());			RegisterTargetMachine<SystemZTargetMachine> X(getTheSystemZTarget());
	auto &PR = *PassRegistry::getPassRegistry();			auto &PR = *PassRegistry::getPassRegistry();
	initializeSystemZElimComparePass(PR);			initializeSystemZElimComparePass(PR);
	initializeSystemZShortenInstPass(PR);			initializeSystemZShortenInstPass(PR);
	initializeSystemZLongBranchPass(PR);			initializeSystemZLongBranchPass(PR);
	initializeSystemZLDCleanupPass(PR);			initializeSystemZLDCleanupPass(PR);
	▲ Show 20 Lines • Show All 194 Lines • ▼ Show 20 Lines
	if (getOptLevel() != CodeGenOpt::None)			if (getOptLevel() != CodeGenOpt::None)
	addPass(createSystemZLDCleanupPass(getSystemZTargetMachine()));			addPass(createSystemZLDCleanupPass(getSystemZTargetMachine()));

	return false;			return false;
	}			}

	bool SystemZPassConfig::addILPOpts() {			bool SystemZPassConfig::addILPOpts() {
	addPass(&EarlyIfConverterID);			addPass(&EarlyIfConverterID);

				if (EnableMachineCombinerPass)
				addPass(&MachineCombinerID);

	return true;			return true;
	}			}

	void SystemZPassConfig::addPreRegAlloc() {			void SystemZPassConfig::addPreRegAlloc() {
	addPass(createSystemZCopyPhysRegsPass(getSystemZTargetMachine()));			addPass(createSystemZCopyPhysRegsPass(getSystemZTargetMachine()));
	}			}

	void SystemZPassConfig::addPostRewrite() {			void SystemZPassConfig::addPostRewrite() {
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 535 Lines • ▼ Show 20 Lines	public:
/// optimizeLoadInstr - Try to remove the load by folding it to a register		/// optimizeLoadInstr - Try to remove the load by folding it to a register
/// operand at the use. We fold the load instructions if and only if the		/// operand at the use. We fold the load instructions if and only if the
/// def and use are in the same BB. We only look at one load and see		/// def and use are in the same BB. We only look at one load and see
/// whether it can be folded into MI. FoldAsLoadDefReg is the virtual register		/// whether it can be folded into MI. FoldAsLoadDefReg is the virtual register
/// defined by the load we are trying to fold. DefMI returns the machine		/// defined by the load we are trying to fold. DefMI returns the machine
/// instruction that defines FoldAsLoadDefReg, and the function returns		/// instruction that defines FoldAsLoadDefReg, and the function returns
/// the machine instruction generated due to folding.		/// the machine instruction generated due to folding.
MachineInstr *optimizeLoadInstr(MachineInstr &MI,		MachineInstr *optimizeLoadInstr(MachineInstr &MI,
const MachineRegisterInfo *MRI,		MachineRegisterInfo *MRI,
Register &FoldAsLoadDefReg,		Register &FoldAsLoadDefReg,
MachineInstr *&DefMI) const override;		MachineInstr *&DefMI) const override;

std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
decomposeMachineOperandsTargetFlags(unsigned TF) const override;		decomposeMachineOperandsTargetFlags(unsigned TF) const override;

ArrayRef<std::pair<unsigned, const char *>>		ArrayRef<std::pair<unsigned, const char *>>
getSerializableDirectMachineOperandTargetFlags() const override;		getSerializableDirectMachineOperandTargetFlags() const override;
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,672 Lines • ▼ Show 20 Lines	bool X86InstrInfo::optimizeCompareInstr(MachineInstr &CmpInstr, Register SrcReg,
return true;		return true;
}		}

/// Try to remove the load by folding it to a register		/// Try to remove the load by folding it to a register
/// operand at the use. We fold the load instructions if load defines a virtual		/// operand at the use. We fold the load instructions if load defines a virtual
/// register, the virtual register is used once in the same BB, and the		/// register, the virtual register is used once in the same BB, and the
/// instructions in-between do not load or store, and have no side effects.		/// instructions in-between do not load or store, and have no side effects.
MachineInstr *X86InstrInfo::optimizeLoadInstr(MachineInstr &MI,		MachineInstr *X86InstrInfo::optimizeLoadInstr(MachineInstr &MI,
const MachineRegisterInfo *MRI,		MachineRegisterInfo *MRI,
Register &FoldAsLoadDefReg,		Register &FoldAsLoadDefReg,
MachineInstr *&DefMI) const {		MachineInstr *&DefMI) const {
// Check whether we can move DefMI here.		// Check whether we can move DefMI here.
DefMI = MRI->getVRegDef(FoldAsLoadDefReg);		DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
assert(DefMI);		assert(DefMI);
bool SawStore = false;		bool SawStore = false;
if (!DefMI->isSafeToMove(nullptr, SawStore))		if (!DefMI->isSafeToMove(nullptr, SawStore))
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 5,064 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/fp-add-reassoc-01.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z15 -verify-machineinstrs \| FileCheck %s

				define double @fun(ptr %x) {
				; CHECK-LABEL: fun:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: ld %f0, 0(%r2)
				; CHECK-NEXT: adb %f0, 8(%r2)
				; CHECK-NEXT: ld %f1, 24(%r2)
				; CHECK-NEXT: adb %f1, 16(%r2)
				; CHECK-NEXT: adbr %f0, %f1
				; CHECK-NEXT: ld %f1, 40(%r2)
				; CHECK-NEXT: adb %f1, 32(%r2)
				; CHECK-NEXT: adb %f1, 48(%r2)
				; CHECK-NEXT: adbr %f0, %f1
				; CHECK-NEXT: adb %f0, 56(%r2)
				; CHECK-NEXT: br %r14
				entry:
				%0 = load double, ptr %x, align 8
				%arrayidx1 = getelementptr inbounds double, ptr %x, i64 1
				%1 = load double, ptr %arrayidx1, align 8
				%add = fadd reassoc nsz arcp contract afn double %1, %0
				%arrayidx2 = getelementptr inbounds double, ptr %x, i64 2
				%2 = load double, ptr %arrayidx2, align 8
				%add3 = fadd reassoc nsz arcp contract afn double %add, %2
				%arrayidx4 = getelementptr inbounds double, ptr %x, i64 3
				%3 = load double, ptr %arrayidx4, align 8
				%add5 = fadd reassoc nsz arcp contract afn double %add3, %3
				%arrayidx6 = getelementptr inbounds double, ptr %x, i64 4
				%4 = load double, ptr %arrayidx6, align 8
				%add7 = fadd reassoc nsz arcp contract afn double %add5, %4
				%arrayidx8 = getelementptr inbounds double, ptr %x, i64 5
				%5 = load double, ptr %arrayidx8, align 8
				%add9 = fadd reassoc nsz arcp contract afn double %add7, %5
				%arrayidx10 = getelementptr inbounds double, ptr %x, i64 6
				%6 = load double, ptr %arrayidx10, align 8
				%add11 = fadd reassoc nsz arcp contract afn double %add9, %6
				%arrayidx12 = getelementptr inbounds double, ptr %x, i64 7
				%7 = load double, ptr %arrayidx12, align 8
				%add13 = fadd reassoc nsz arcp contract afn double %add11, %7
				ret double %add13
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Enable MachineCombiner for FP reassociation.
Needs ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 511746

llvm/include/llvm/CodeGen/TargetInstrInfo.h

llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

llvm/lib/Target/SystemZ/SystemZInstrFormats.td

llvm/lib/Target/SystemZ/SystemZInstrInfo.h

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

llvm/lib/Target/SystemZ/SystemZInstrVector.td

llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp

llvm/lib/Target/X86/X86InstrInfo.h

llvm/lib/Target/X86/X86InstrInfo.cpp

llvm/test/CodeGen/SystemZ/fp-add-reassoc-01.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Enable MachineCombiner for FP reassociation.Needs ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 511746

llvm/include/llvm/CodeGen/TargetInstrInfo.h

llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

llvm/lib/Target/SystemZ/SystemZInstrFormats.td

llvm/lib/Target/SystemZ/SystemZInstrInfo.h

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

llvm/lib/Target/SystemZ/SystemZInstrVector.td

llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp

llvm/lib/Target/X86/X86InstrInfo.h

llvm/lib/Target/X86/X86InstrInfo.cpp

llvm/test/CodeGen/SystemZ/fp-add-reassoc-01.ll

[SystemZ] Enable MachineCombiner for FP reassociation.
Needs ReviewPublic