Page MenuHomePhabricator

HaohaiWen (Haohai, Wen)
User

Projects

User does not belong to any projects.

User Details

User Since
Jul 21 2021, 11:13 PM (26 w, 2 d)

Recent Activity

Dec 16 2021

HaohaiWen accepted D115752: [X86] Adjust some IceLake fp shuffle schedule classes (PR48110).
Dec 16 2021, 4:20 AM · Restricted Project
HaohaiWen added a comment to D115752: [X86] Adjust some IceLake fp shuffle schedule classes (PR48110).

LGTM

Dec 16 2021, 4:20 AM · Restricted Project

Dec 13 2021

HaohaiWen added inline comments to D115547: [X86] Adjust some IceLake integer shuffle schedule classes (PR48110).
Dec 13 2021, 11:03 PM · Restricted Project
HaohaiWen accepted D115547: [X86] Adjust some IceLake integer shuffle schedule classes (PR48110).

LGTM

Dec 13 2021, 10:56 PM · Restricted Project
HaohaiWen added a comment to D115547: [X86] Adjust some IceLake integer shuffle schedule classes (PR48110).

Here's diff I found: https://www.textcompare.org/?id=61b7080d8668ef0015be12a9
Left window is before this patch, right is after.
Seems like VPBROADCASTBZ128rm, VPCMOVYrmr, VPPERMrmr ports change from [5, 23] to [15, 23] but we didn't have list test for them.
BTW, uops.info provides lat/tpt for icelake, not icelake-server.

Dec 13 2021, 12:52 AM · Restricted Project

Dec 7 2021

HaohaiWen committed rGd2c093e79d14: [CostModel][X86] Add i64 mul cost for avx512 as 1cy (authored by HaohaiWen).
[CostModel][X86] Add i64 mul cost for avx512 as 1cy
Dec 7 2021, 7:40 PM
HaohaiWen closed D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.
Dec 7 2021, 7:40 PM · Restricted Project

Dec 6 2021

HaohaiWen retitled D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy from [CostModel][X86] Add i64 mul cost for avx as 1cy to [CostModel][X86] Add i64 mul cost for avx512 as 1cy.
Dec 6 2021, 6:15 AM · Restricted Project

Dec 5 2021

HaohaiWen added a comment to D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.

Another problem, do we need to add check for 64Bit?

TLI->getTypeLegalizationCost(DL, Ty) for i64 is i32. Therefore mul i64, i64 on 32bit target will fall to BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info)

Dec 5 2021, 11:34 PM · Restricted Project
HaohaiWen added a comment to D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.

Can we first fix imul i64 cost for avx512 before merging D46276? It's 1 cy for all avx512 target.

A more critical feature would be to 'add' costs correctly for throughput cost kinds - depending on pipe usage etc. - instead of just numerically adding the costs together, which only makes sense for latency / size cost kinds.

We may need to fix resource_cycles for many x86 schedmodels (e.g. SKL, SKX) to get a correct throughput. Most instructions set resource_cycles to 1 for each uop which gets wrong throughput.

Dec 5 2021, 11:26 PM · Restricted Project
HaohaiWen updated the diff for D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.

Move to avx512 cost table and rebase

Dec 5 2021, 11:05 PM · Restricted Project

Dec 3 2021

HaohaiWen added a comment to D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.

This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.

Dec 3 2021, 6:13 AM · Restricted Project

Dec 2 2021

HaohaiWen updated the summary of D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.
Dec 2 2021, 6:01 PM · Restricted Project
HaohaiWen requested review of D115016: [CostModel][X86] Add i64 mul cost for avx512 as 1cy.
Dec 2 2021, 5:55 PM · Restricted Project

Jul 23 2021

HaohaiWen added a comment to D103695: [WIP][RFC][Utils] Helper script to check sanity of cost tables vs scheduler models.

I think we can't completely trust reversed throughput reported by llvm-mca since some instructions' Rthroughput is not defined correctly in schedmodel.
e.g.

$./llvm_utils_check_cost_tables.py --cpulevel=avx512  --stop-on-diff
double fdiv double: cost (4.0 - 4.0) vs recipthroughput (3 - 3)
skylake-avx512 : 4.0 vs 3

defines in X86SchedSkylakeServer.td:

def SKXWriteResGroup184 : SchedWriteRes<[SKXPort0,SKXFPDivider]> {
  let Latency = 14;
  let NumMicroOps = 1;
  let ResourceCycles = [1,3];
}
def : SchedAlias<WriteFDiv64,  SKXWriteResGroup184>; // TODO - convert to ZnWriteResFpuPair

However, it's measured tpt is 4 from uops.info. llvm-exegesis tpt result is also 4.
I think uops.info/agner.org should be more accurate.

Jul 23 2021, 1:36 AM · Restricted Project