Looks like the NoRegister has some effect on the final code that is generated. My guess is that some optimization kicks in at the end?
When I use -S to dump the assembly I get the correct version with 'shrq $3, %r8':
movq %r9, %r8 shrq $3, %r8 movsbl 2147450880(%r8), %r8d
But, when I disassemble the final binary I get RAX in stead of R8:
mov %r9,%r8 shr $0x3,%rax movsbl 0x7fff8000(%r8),%r8d