Hi,
This is an optimized version of memset for AArch64, improving on the general implementation.
I do believe there is still room for improvement on the generated code though, but I suggest I look at those as follow ups.
Things I'd look to look at are:
- a different Bump for single operands like SplatSet as if you only need to align a single pointer I believe it is simpler to just mask away the bottom bits;
- introducing a DoLoop where we use a do-while-loop rather than a for-loop to use in situations where we know to have at least a single iteration, ideally the compiler would figure this one out through some valuerange analysis, but unfortunately it isn't currently, so maybe we can help it out;
- changing Chained to do the Tail last, thus creating chains of stores & loads that are contiguous as I suspect that could potentially help with fetching behaviour in loops
- trying to get the non dc zva loop in memset to use stores with post-increment as that could help reduce the loop to two stores, a cmp and a branch. This will probably require compiler changes though.
Hmm the unaligned Tail<_64> is preventing the reuse of the Loop logic...
I'll update the code to separate the looping element from the tail element.