On Cortex-A57, an AES chain achieves best performance if the accumulation register is kept the same through the entire chain. This patch utilizes the A57 FP load balancing pass to enforce that we emit such AES instruction sequence.
Details
Diff Detail
Event Timeline
The code looks ok, but I'm not sure about the architecture decisions. I'd rather Tim or James had a look.
cheers,
--renato
lib/Target/AArch64/AArch64A57FPLoadBalancing.cpp | ||
---|---|---|
566 | You could have moved isAES up and made Change one line | |
test/CodeGen/AArch64/aes-load-balancing.ll | ||
15 | can you use CHECK variables here? Like: ;CHECK: aese v[[accum:0-9]].16b, ... |
Hi Zhao,
There are some subtleties here that I don't think you're handling correctly. The major one is tied operands, which are not handled right and are the reason I removed handling of MLAv2f32 from this pass.
Consider a chain where the end instruction is the same as the kill instruction (which means the end dest register cannot be changed). The fixup code currently will look at the last instruction and see something like:
%D0<def,tied0> = AES... %D0<kill,tied0>, %D1...
It will look at the uses first, totally ignore the tied constraint and change %D0 to be whatever the chain register is. It will then not look at the def, because it knows it is illegal to change the def. We then have a broken instruction.
When you add proper handling for this (along with a decent amount of tests, because this is incredibly fiddly to get right), we can re-add support for MLAv2f32. Which would be nice :)
Secondly I think your algorithm is going the wrong way. It is chaining the instructions going forward through the instruction stream instead of backwards. Consider, it's possible that we can't change the dest operand of the last instruction in the chain. As the operand is tied, the accumulator has to be the same as the dest. So choosing the chaining register as the dest of the last instruction seems the right thing to do. Currently your algorithm could get to the last instruction and not be able to change it. If you implement this algorithm, you actually don't need to handle tied operands specially (although I still would like to see proper assertions in place that they're not corrupted), because you never get to the situation where you can change an accum but not a dest.
Thirdly, are AES instruction statically scheduled by the core? My belief is "no", which is fine, but they should not contribute to the computed static parity. If "yes", then they should probably be sorted to appear in the ready queue before FMADDs.
Cheers,
James
James,
Thanks for the feed back. To make sure I understand correctly: the way we build chains (mul-mla or aes) is forward, which is fine.
But it's better to color the chains backward for the G->getLast() to handle tied dest register. This would requires us to iterate from the last instructions to the first of a chain in ColorChain() and make appropriate changes to scavengeRegister().
Is this the plan you envisioned to add back support for MLAv2f32?
Thanks,
Zhaoshi
You could have moved isAES up and made Change one line