By manipulating a local variable in the loop, when the loop can
be optimized away (due to no non-trivial destructors), this lets
it be fully optimized away and we modify the __end_ separately.
This results in a substantial improvement in the generated code.
Prior to this change, this would be generated (on x86_64):
movq (%rdi), %rdx movq 8(%rdi), %rcx cmpq %rdx, %rcx je LBB2_2 leaq -12(%rcx), %rax subq %rdx, %rax movabsq $-6148914691236517205, %rdx ## imm = 0xAAAAAAAAAAAAAAAB mulq %rdx shrq $3, %rdx notq %rdx leaq (%rdx,%rdx,2), %rax leaq (%rcx,%rax,4), %rax movq %rax, 8(%rdi)
And after:
movq (%rdi), %rax movq %rax, 8(%rdi)
This brings this in line with what other implementations do.