I fixed some issues covered in the title.
Details
Diff Detail
Event Timeline
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
202 | My understanding is that ResourceCycles has to be 4 because you want to be able to compute a reciprocal throughput of 2. According to the amd documents, the pipes used by a float variable blend are "FPA|FPM". I wonder whether we could have [JFPU0, JFPU1] instead of [JFPU01], and then change ResourceCycles to [2, 1]. A variable blend is 3 uOps, and internally, it is likely to be implemented as the sequence {VAND,VANDN,VOR}, where VAND and VANDN can execute in parallel. It may be worthy to run some experiments to see which approach is better. That being said, your approach is not wrong, and I don't have a strong opinion on this. | |
202–206 | let NumMicroOps = 3; | |
218–225 | Variable blend instructions are 3 macro ops. |
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
202 | I have AMD laptop to make experiments but I don't have any perf test using a variable blend. Could anyone to help me with such a test? |
I made updates required by andreadb except the question about JFPU01: if we need perf experiments please help me with perf test(s). If we agree to use the current implementation then I could commit the patch if you get me LGTM.
I don't think it is worthy to run experiments just for the blendv case.
I just wanted to point out that there may be an alternative solution for that particular case. Having a resource cycle of 4 for the blendv is a bit odd (at least to me)..
But, as I wrote said, I really don't have a strong opinion on what numbers should go there (neither solution is ideal).
If you fix the remaining NumMicroOps issue I pointed out, then the patch looks good to me.
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
652–655 | let NumMicroOps = 3; | |
659–662 | Same. |
My understanding is that ResourceCycles has to be 4 because you want to be able to compute a reciprocal throughput of 2. According to the amd documents, the pipes used by a float variable blend are "FPA|FPM".
I wonder whether we could have [JFPU0, JFPU1] instead of [JFPU01], and then change ResourceCycles to [2, 1].
A variable blend is 3 uOps, and internally, it is likely to be implemented as the sequence {VAND,VANDN,VOR}, where VAND and VANDN can execute in parallel.
It may be worthy to run some experiments to see which approach is better. That being said, your approach is not wrong, and I don't have a strong opinion on this.