Currently getOptimalMemOpType returns i32 for large enough sizes without checking for alignment, leading to poor code generation when misaligned accesses aren't permitted as we generate a word store then later split it up into byte stores. This means we inadvertantly go over the MaxStoresPerMemcpy limit and for memset we splat the memset value into a word then immediately split it up again.
Fix this by leaving it up to FindOptimalMemOpLowering to figure out which type to use, but also fix a bug there where it wasn't correctly checking if misaligned memory accesses are allowed.
The use of getPointerTy() here seems dubious. On many architectures, you can use getPointerTy as a rough proxy for the largest legal integer type, but that's not universal, and the usage of DstAS doesn't make any sense.
Maybe just loop over {i64, i32, i16}.