This is an archive of the discontinued LLVM Phabricator instance.

[X86] Remove DenseMap for storing FMA3 grouping information
AbandonedPublic

Authored by craig.topper on Aug 25 2016, 11:41 PM.

Details

Summary

This patch removes the DenseMap for keeping track of FMA3 grouping information avoiding the startup cost of populating the map and the associated memory usage.

Three new bits are added to the instruction TSFlags for keeping track of the form(132, 213, 231) and whether it is a scalar instrinsic instruction.

The 3 different forms of each instruction are combined into groups similar to the current code. But each of the groups are stored into static tables. Each table is sorted by the opcode of each form. Since opcodes encodings are assigned alphabetically and each form is named the same except for the 132, 213, or 231, when one form is sorted the other two forms are sorted. With the tables sorted, we can find the group for a given opcode by getting the form from the TSFlags and doing a binary search through the appropriate column of the table.

There are 6 tables, split by the evex.b bit, memory/register, and masked/unmasked. The two evex.b tables contain masked and unmasked together. The masked/unmasked split for non evex.b makes it easy to populate the load folding tables. The instructions that use evex.b cannot be folded. For the tables without evex.b the register tables are the same size as their equivalent memory table and the opcodes are in the same order. Converting from register form to memory form is as simple as finding the row in one table and looking up the same row in the opposite table. Determining which table an opcode is in can be determined from other TSFlags bits.

Currently the getFMA3Group function is a private function in X86InstrInfo.cpp, but could be made a static function in the X86InstrInfo class if it becomes needed outside this file.

The load folding table creation as well as the commuting code has been updated to use the new interface. The commuting code makes use of new and existing TSFlags to determine additional information about the opcodes beyond which group they are in.

Diff Detail

Event Timeline

craig.topper retitled this revision from to [X86] Remove DenseMap for storing FMA3 grouping information.
craig.topper updated this object.
craig.topper added a subscriber: llvm-commits.
v_klochkov edited edge metadata.EditedAug 29 2016, 4:23 PM

`Hi Craig,

Unfortunately, you were not aware of my bigger plans for the classes X86InstrFMA3Info/X86InstrFMA3Group (you planned to remove them in your change-set).

I like some parts of your patch,
for example, the usage of TSFlags instead of KMergeMasked, KZeroMasked bits in X86InstrFMA3Group::Attrubutes,

There are though some serious concerns from my side as your patch would ruin my plans to re-use the classes X86InstrFMA3Info/X86InstrFMA3Group (added previously in https://reviews.llvm.org/rL278431)
in some new very advanced FMA optimization I am developing right now.
Please, see my concerns in (2) and especially (3) below.

Also, adding 3 bits to TSFlags specifically for FMA3 opcodes probably would require X86 component owner's attention/review.

At the end of this message I added some ideas how to make the current classes X86InstrFMA3{Info/Group} a bit more efficient and better usable in future.

  1. Minor. It is good that the DenseMap is replaced with static arrays and there is no initialization overhead now.

    The opcodes in the arrays are sorted right after initialization, and you rely on the order in which the opcodes are created for *.td files. That seems a bit hacky/risky to me, but is Ok as you added is_sorted() checks.

    Having explicit call to std::sort()-like functions to eliminate such assumptions would add the overhead and would kill the main point in having these changes.
  1. Major. FMA3 is just a small sub-set of X86 instructions. I think FMA3 instructions do not deserve 3 bits in TSFlags. You used bits #55,56,57. Only 6 bits left for some probably more important features, after that the bit field should be extended and that would affect all opcodes.

    a) 2 bits: {NotFMA3, FMA3_132, FMA3_213, FMA3_231}, I don't have very strong opinion about these 2 bits for FMA3 but think this is too many bits for just FMA3 opcodes.

    b) The 1 bit 'FMA3Intrinsic' is definitely too big gift for FMA3 opcodes. I would rather vote for 'IsIntrinsic' bit, which would be set for all *_Int opcodes such as 'ADDSDrr_Int', etc.
  1. MAJOR. I am implementing a new advanced FMA optimization and planned to add more bits to X86InstrFMA3Group::Attributes (being removed in change-set).

    a) I planned to add information about signs (i.e. FMA, FMS, FNMA, FNMS, FMADDSUB, FMSUBADD) Adding new fields to X86InstrFMA3Group::Attributes is quite natural and Ok, but adding 3 more bits to TSFlags is not an option. b) The optimization also needs to operate with MVT and I planned to add an MVT field to X86InstrFMA3Group::Attributes as well. Currently, I don't see how to extract information from Opcode about the number and size of processed elements. (i.e. f32, f64, v4f32, v2f64, ...).

    So, with (a) and (b) that optimization would do requests to X86InstrFMA3Info to get FMA opcodes: unsigned getFMA213Opcode(bool IsEVEX, bool MulSign, bool AddSign, MVT VT, bool KMasked=false, bool KZMasked=false); or alternatively: unsigned getFMA213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFMS213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFNMA213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFNMS213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false);

    for example, bool UseEVEX = HasEVEX && HasVL; unsigned Opc213 = getFMA213Opcode(UseEVEX, true, false, MVT::v4f64); // Opc213 is initialized with (UseEVEX ? VFNMADD213PDZ256r : VFNMADD213PDYr).

Some ideas:
I was thinking about using a different approach to reduce the number
of small static arrays/FMA3Groups (it was one of your major concerns in https://reviews.llvm.org/rL278431)
FMA3Groups could be consisting of the bigger number of fields, i.e.:

// VEX (AVX2) opcodes.
uint16_t VEX_Reg132, VEX_Reg213, VEX_Reg231;
uint16_t VEX_Mem132, VEX_Mem213, VEX_Mem231;

// EVEX opcodes.
uint16_t EVEX_Reg132, EVEX_Reg213, EVEX_Reg231;
uint16_t EVEX_Mem132, EVEX_Mem213, EVEX_Mem231;

// k-masked opcodes.
uint16_t EVEX_KReg132, EVEX_KReg213, EVEX_KReg231;
uint16_t EVEX_KMem132, EVEX_KMem213, EVEX_KMem231;

// k-zero-masked opcodes.
uint16_t EVEX_KZReg132, EVEX_KZReg213, EVEX_KZReg231;
uint16_t EVEX_KZMem132, EVEX_KZMem213, EVEX_KZMem231;

// EVEX.B: Opcodes with explicit round control and with broadcast.
uint16_t EVEX_ERound132, EVEX_ERound213, EVEX_ERound231;
uint16_t EVEX_Broadcast132, EVEX_Broadcast213, EVEX_Broadcast231;

unsigned Attributes; // MVT + {FMA,FMS,FNMA,FNMS,FMADDSUB,FMSUBADD} + IsIntrisic + ...

Also, FMA3Groups could even include *_Int opcodes as well.
(i.e. uint16_t VEX_Reg132, VEX_Reg213,..., VEX_Reg132_Int, VEX_Reg213_Int,...;)

Making groups bigger would improve the search of the FMA opcodes mentioned in (3) above
(i.e. getFMA213Opcode() method), because there would be quite small number of FMA groups then, and even linear search would probably be Ok.

Thank you,
Vyacheslav Klochkov
`

craig.topper planned changes to this revision.Aug 29 2016, 8:28 PM

Hi Vyacheslav,

I suspected this would interfere with your plans. We had spoke a little about your optimization plans, but I wasn't sure if your optimization would be before or after isel.

I agree the FMA3 intrinsic bit was bad. I did remove 4 TSFlags bit before I made this patch so I did pay for myself, but you're right.

I believe you can get single precision or double precision from the VEX_W bit in TSFlags. But I don't think you can get packed vs scalar. Or add vs sub vs addsub vs subadd, etc without just hardcoding all the opcodes. There does appear to be at least some pattern to it. For instance all 132 opcodes are 0x96-0x9f, and all 213 opcodes are 0xa6-0xaf, and all 231 opcodes are 0xb6-0xbf. So you can infer the form from the first nibble. Not sure if we should rely on that without someway to check it.

I think I like some of your suggestion. I'll go see what I can come up with based on that and with your other requirements in mind.

`Craig,

Your comment regarding using the base opcode byte was a good surprise for me.
I did not realize that it is possible to use it and that it is always available in TSFlags.

I put FMA3 opcodes into a table.
The 9,a,b - columns are the senior 4 bits of the opcode byte.
The 6,7,8,9,a,b,c,d,e,f - rows are the lower 4 bits of the opcode byte.
For example, fmadd132ps opcode has the base opcode byte = 0x98.

			9					a					b
		6	fmaddsub132ps/pd	fmaddsub213ps/pd	fmaddsub231ps/pd
		7	fmsubadd132ps/pd	fmsubadd213ps/pd	fmsubadd231ps/pd
		8	fmadd132ps/pd		fmadd213ps/pd		fmadd231ps/pd
		9	fmadd132ss/sd		fmadd213ss/sd		fmadd231ss/sd
		a	fmsub132ps/pd		fmsub213ps/pd		fmsub231ps/pd
		b	fmsub132ss/sd		fmsub213ss/sd		fmsub231ss/sd
		c	fnmadd132ps/pd		fnmadd213ps/pd		fnmadd231ps/pd
		d	fnmadd132ss/sd		fnmadd213ss/sd		fnmadd231ss/sd
		e	fnmsub132ps/pd		fnmsub213ps/pd		fnmsub231ps/pd
		f	fnmsub132ss/sd		fnmsub213ss/sd		fnmsub231ss/sd

It would be possible to have functions something similar to these below:
(I did not check there that the 'Opcode' is FMA3.

		// If this implementation is not safe, then we could have the same binary search through static arrays with FMA opcodes.
		bool isFMA3(uint8_t Opcode) { 
		<Do some additional checks here to avoid possible(?) overlapping
		with other/future opcodes having the same opcode byte, but different prefixes/attributes.
		uint8_t LowB = Opcode && 0xf;
		uint8_t HighB = Opcode >> 8;
		return HighB >= 0x9 && HighB <= 0xb &&
				LowB >= 0x6 /* && LowB <= 0xf*/;
		}

		bool isFMA3Form132(uint8_t Opcode) { return (Opcode >> 8) == 0x9; }
		bool isFMA3Form213(uint8_t Opcode) { return (Opcode >> 8) == 0xa; }
		bool isFMA3Form231(uint8_t Opcode) { return (Opcode >> 8) == 0xb; }
	
		bool isFMA3Scalar(uint8_t Opcode) {
		uint8_t B = Opcode & 0x0f;
		return B == 0x9 || B == 0xb || B == 0xd || B == 0xf;
		}

		bool isFMA3_FMADD(uint8_t Opcode) {
		uint8_t B = Opcode & 0xf;
		return B == 0x8 || B == 0x9;
		}
		bool isFMA3_FMSUB(uint8_t Opcode) {
		uint8_t B = Opcode & 0xf;
		return B == 0xa || B == 0xb;
		}
		bool isFMA3_FNMADD(uint8_t Opcode) {
		uint8_t B = Opcode & 0xf;
		return B == 0xc || B == 0xd;
		}
		bool isFMA3_FNMSUB(uint8_t Opcode) {
		uint8_t B = Opcode & 0xf;
		return B == 0xe || B == 0xf;
		}
		bool isFMA3_FMADDSUB(uint8_t Opcode) { return Opcode & 0xf) == 0x6; }
		bool isFMA3_FMSUBADD(uint8_t Opcode) { return Opcode & 0xf) == 0x7; }

With such solution you would not need to add those 2 bits to TSFlags for 132/213/231 forms.

We also would have the methods that can distinguish 132 vs 213 vs 231
and FMA vs FMS vs FNMA vs NMMS vs FMADDSUB vs FMSUBADD.

Unfortunately, I still would have quite long linear search for methods like this:

		unsigned getFMA213Opcode(bool IsEVEX, MVT VT) {
		// !!! Linear search through about 300-400 opcodes in FMA3RegOpcodes table.
		Opc = <iterate through entries of FMA3RegOpcodes table>;
		uint8_t BaseOpcode = ...getBaseOpcodeFor(TSFlags);
		
		if (isFMA3_FMADD(BaseOpcode) &&
			<check HasEVEX and VT here>)
		  return Opc;

There are at least 2 ways how to workaround that:

  1. Reduce the number of entries in FMA3RegOpcodes by having 1 static array of structures like I mentioned in the previous comment (i.e. describing bigger FMA families/groups, where 1 group would include {VEX}x{Reg,Mem} + {EVEX}x{Reg,Mem,KReg,KMem,KZReg,KZMem,Broadcast,KBroadcast,KZBroadcast,Round,KRound,KZRound}.
  1. Do some sort of hashing in my new FMA optimization:
		// Just the idea.
		unsigned FMAOpc = <iterate through FMA3RegOpcodes>;
		uint8_t FMAOpcByte = <extract opcode byte from TSFlags>;
		if (isFMA3FMADD(FMAOpcByte))
			FMADDOperations[FMADDOperationsIndex++] = <current index in FMA3RegOpcodes>;
		else if (isFMA3FMSUB(FMAOpcByte))
			FMADDOperations[FMSUBOperationsIndex++] = <current index in FMA3RegOpcodes>;
		else if (isFMA3FMSUB(FMAOpcByte))
			FMADDOperations[FMSUBOperationsIndex++] = <current index in FMA3RegOpcodes>;
		...

Even with (2), I think having bigger FMA3 groups may be useful in future.
For example, it would be convenient if/when need to add support for broadcast+operation folding at Peephole opt:
t1 = vbroadcastss <mem>
t2 = VFMADD213PSZr a, b, t1;
-->
t2 = VFMADD213PSZmb a, b, [broadcast <mem>]
Bigger groups would just easily return VFMADD213PSZmb for VFMADD213PSZr.

Thank you,
Slava
`

v_klochkov edited edge metadata.Aug 30 2016, 3:37 PM
v_klochkov added subscribers: DavidKreitzer, qcolombet.
RKSimon edited edge metadata.Feb 23 2017, 2:39 PM

@v_klochkov @craig.topper It's been almost 6 months - is anything happening with this?

craig.topper abandoned this revision.Feb 23 2017, 9:32 PM
lib/Target/X86/X86InstrFMA3Info.h