This is a 'no functional change intended' patch. It removes one FIXME, but it serves as a delivery mechanism for several more. :)
Motivation: we have a FeatureFastUAMem attribute that may be too general. It is used to determine if any sized misaligned memory access under 32-bytes is 'fast'. At some point around Nehalem for Intel and Bobcat for AMD, all scalar and SSE unaligned accesses apparently became fast enough that we can happily use them whenever we want. From the added FIXME comments, however, you can see that we're not consistent about this. Changing the name of the attribute makes it clearer to see the logic holes IMO.
Further motivation: this is a preliminary step for PR24449 ( https://llvm.org/bugs/show_bug.cgi?id=24449 ). I'm hoping to answer a few questions about this seemingly simple test case:
void foo(char *x) { memset(x, 0, 32); }
Both of these:
$ clang -O2 memset.c -S -o -
$ clang -O2 -mavx memset.c -S -o -
Produce:
movq $0, 24(%rdi) movq $0, 16(%rdi) movq $0, 8(%rdi) movq $0, (%rdi)
- Is it ok to generate misaligned 8-byte stores by default?
- Is it better to generate misaligned 16-byte SSE stores for the default case? (The default CPU is Core2/Merom.)
- Is it better to generate a misaligned 32-byte AVX store for the AVX case?
You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned performance should be the same for aligned addresses and only +1cy for unaligned. It might be more complex for cache-line crossing but most targets will suffer there, not just BD.