Pass MemCpyOpt doesn't check if a store instructions is nontemporal.
As a consequence, adjacent nontemporal stores may be wrongly merged into memset calls.
Example:
define void @foo(<4 x float>* nocapture %dst) { entry: store <4 x float> zeroinitializer, <4 x float>* %dst, align 16, !nontemporal !0 %ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1 store <4 x float> zeroinitializer, <4 x float>* %ptr1, align 16, !nontemporal !0 ret void } !0 = !{i32 1}
In this example, the two nontemporal stores are combined to a memset of zero which does not preserve the nontemporal hint. Later on the backend (tested on my x86-64 corei7) expands that memset call into a sequence of two normal 16-byte aligned vector stores.
opt -memcpyopt foo.ll -S -o - | llc -mcpu=corei7 -o -
Before:
xorps %xmm0, %xmm0 movaps %xmm0, 16(%rdi) movaps %xmm0, (%rdi)
With this patch, we no longer merge nontemporal stores into calls to memset.
In this example, llc correctly expands the two stores into two movntps:
xorps %xmm0, %xmm0 movntps %xmm0, 16(%rdi) movntps %xmm0, (%rdi)
Please let me know if okay to submit.
Thanks,
Andrea