Page MenuHomePhabricator

[X86] Don't zero/sign-extend i1 or i8 return values to 32 bits (PR22532)
ClosedPublic

Authored by hans on Feb 4 2016, 4:29 PM.

Details

Summary

This brings us in line with GCC and MSVC behaviour, and saves on code size.

We were already not extending i1 return values to on x86_64 after [0].

The ABI docs are unclear about this situation. The new i386 psABI [1] clearly states (Table 2.4, page 14) that i1, i8, and i16 return values do not need to be extended beyond 8 bits (GCC and MSVC do extend 16-bit values though, so I'm holding off on that). The old i386 psABI [2] does not mention this. I can't find mention of this in the x86_64 ABI [3], but there is a proposal on [4] that matches the behaviour in this patch.

Please let me know what you think.

[0]. http://llvm.org/viewvc/llvm-project?view=revision&revision=127766
[1]. https://01.org/sites/default/files/file_attach/intel386-psabi-1.0.pdf
[2]. https://refspecs.linuxfoundation.org/elf/abi386-4.pdf
[3]. https://refspecs.linuxfoundation.org/elf/x86_64-abi-0.98.pdf
[4]. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46942#c4

Diff Detail

Event Timeline

hans updated this revision to Diff 46970.Feb 4 2016, 4:29 PM
hans retitled this revision from to [X86] Don't zero/sign-extend i1 or i8 return values to 32 bits (PR22532).
hans updated this object.
hans added reviewers: majnemer, rjmccall, rsmith, jyknight.
hans added a subscriber: llvm-commits.
hans added a subscriber: hans.Feb 4 2016, 4:30 PM

+Cameron who touched this last.

mkuper added a subscriber: mkuper.Feb 4 2016, 4:57 PM
spatel added a subscriber: spatel.Feb 4 2016, 5:23 PM
hans added a comment.Feb 4 2016, 5:48 PM

I can't find mention of this in the x86_64 ABI.

After looking some more, it *is* mentioned in a newer version of the document: [1, Footnote 16 on Page 23]: "Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit.", so I suppose that's where LLVM's current special-casing of i1 on x86_64 comes from.

[1]. https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r249.pdf

cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.

hans added a comment.Feb 5 2016, 2:18 PM

cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.

Thanks. I've also raised this on the x86-64-abi mailing list: https://groups.google.com/d/msg/x86-64-abi/E8O33onbnGQ/gecUVGYzDQAJ

But for this specific patch, GCC's and MSVC's behaviours seem very clear, so I'm hoping this one is fairly non-controversial :-)

majnemer edited edge metadata.Feb 5 2016, 2:30 PM

This looks fine to me.
It would be good to have another pair of eyes confirm my opinion :)

spatel added a comment.Feb 5 2016, 2:53 PM
In D16907#345285, @hans wrote:

cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.

Thanks. I've also raised this on the x86-64-abi mailing list: https://groups.google.com/d/msg/x86-64-abi/E8O33onbnGQ/gecUVGYzDQAJ

But for this specific patch, GCC's and MSVC's behaviours seem very clear, so I'm hoping this one is fairly non-controversial :-)

I hope it's non-controversial too, but I'm very curious to know if this impacts big-core Intel perf. :)

If I'm reading H.J.'s proposal correctly, we should treat shorts the same as char/bool. Should it all be fixed in one shot?

test/CodeGen/X86/tail-call-attrs.ll
13–16

This comment doesn't apply anymore?

jyknight edited edge metadata.Feb 5 2016, 2:59 PM

Why do you say GCC extends 16-bit numbers?

Given:

short global1, global2;
short bar() {
  return global1 + global2;
}

$ gcc -march=i386 -O2 -S -o - -m32

Results in:

bar:
      movw    global2, %ax
      addw    global1, %ax
      ret

If you use -march=i686, then it results in:

bar:
      movzwl  global2, %eax
      addw    global1, %ax
      ret
hans added a comment.Feb 5 2016, 6:17 PM

Why do you say GCC extends 16-bit numbers?

Given:

short global1, global2;
short bar() {
  return global1 + global2;
}

$ gcc -march=i386 -O2 -S -o - -m32

Results in:

bar:
      movw    global2, %ax
      addw    global1, %ax
      ret

If you use -march=i686, then it results in:

bar:
      movzwl  global2, %eax
      addw    global1, %ax
      ret

D'oh, I was holding it wrong.

I was using the same test case I used for bool and char, something like "return x == y", and GCC would use "sete" and then extend that to 32-bit, because that's easy and they have to extend the result anyway. Your test case is obviously better.

David and I also observed MSVC not extending 16-bit return values for this code:

unsigned short f(unsigned short x) {
  return x;
}

where they generate:

00000000: 66 8B 44 24 04     mov         ax,word ptr [esp+4]
00000005: C3                 ret

i.e. they're leaving the high 16 bits of eax undefined.

I'll update the patch to do shorts as well.

(Another interesting note is that MSVC's behaviour is contradictory to what they say in this document: https://msdn.microsoft.com/en-us/library/984x0h58.aspx "On x86 plaftorms, all arguments are widened to 32 bits when they are passed. Return values are also *widened to 32 bits and returned in the EAX register*.")

If I'm reading H.J.'s proposal correctly, we should treat shorts the same as char/bool. Should it all be fixed in one shot?

Yes, since James pointed out my doubts about i16 behaviour were unfounded, let's do them all in one shot.

test/CodeGen/X86/tail-call-attrs.ll
13–16

Right. I'll add a FIXME here. The tail call lowering needs an update.

hans updated this revision to Diff 47071.Feb 5 2016, 6:18 PM
hans edited edge metadata.

Updating the patch to do i16 as well.

(I didn't update the title because I think it tends to break the email thread.)

zansari edited edge metadata.Feb 7 2016, 12:35 AM

cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.

Well, there's always the potential for the non-extending move into the partial return register to create a false dependency with any prior writes to the whole register, however, you could argue that this isn't the place to deal with that since that's a more general issue ... maybe.

For example, take something like this:

short foo(void);
int bar(short, short);
short x, y = 123;
int A;

main()
{
  int i;
  for (i = 0; i <= 0xfffffff; i++) {
    A = bar(x, y);
    x = foo();
  }
  printf("\nA = %d, x = %x.\n", A, x);
  return 0;
}
-----------------------------
extern short x;
extern short y;

int bar(short a, short b)
{
  return a / b;
}

short foo(void)
{
  return x + y;
}

Foo will be slowed down a bit on a write to %al, whereas, there will be no dependency with a movzbl into %eax (I tried this real quick on an IVB, and it's ~50% faster with the movz, or an xor of eax before the movb).

We could play it safe and only do this for opt/min-size, or go ahead with this if the impact is low and deal with any potential performance issue in a more general way. Have we done any perf tests on this to see if there's any impact?

spatel accepted this revision.Feb 7 2016, 9:57 AM
spatel added a reviewer: spatel.

We could play it safe and only do this for opt/min-size, or go ahead with this if the impact is low and deal with any potential performance issue in a more general way. Have we done any perf tests on this to see if there's any impact?

I haven't run any perf tests, but I thought it would be good to alert people about this change since there might be some diffs.

I agree that dealing with any perf issue as a separate pass is the better option, and Kevin mentioned some progress on that here:
http://lists.llvm.org/pipermail/llvm-dev/2016-February/094745.html

This patch LGTM; it's just trying to match the ABI's mandates and as noted, makes LLVM behave more like GCC/MSVC by default.

This revision is now accepted and ready to land.Feb 7 2016, 9:57 AM
hans closed this revision.Feb 8 2016, 11:42 AM

Committed in r260133.