[ELF][X86_64] Fix corrupted LD -> LE optimization for TLS without PLT
The LD -> LE optimization for Thread-Local Storage without PLT requires
an additional "66" prefix, otherwise the next instruction will be
corrupted, causing runtime misbehavior (crashes) of the linked object.
The other (GD -> IE/LD) optimizations are the same with or without PLT,
but add tests for completeness. The instructions are copied from
https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/x86-64-psABI-1.0.pdf#subsection.11.1.2
This does not try to address ILP32 (x32) support.
Fixes https://bugs.llvm.org/show_bug.cgi?id=37303
Reviewed By: ruiu
Differential Revision: https://reviews.llvm.org/D56779