Page MenuHomePhabricator

[OpenMP] Add flag for specifying the target device architecture for OpenMP device offloading
ClosedPublic

Authored by gtbercea on Jun 28 2017, 3:44 PM.

Details

Summary

OpenMP has the ability to offload target regions to devices which may have different architectures.

A new -fopenmp-target-arch flag is introduced to specify the device architecture.

In this patch I use the new flag to specify the compute capability of the underlying NVIDIA architecture for the OpenMP offloading CUDA tool chain.

Only a host-offloading test is provided since full device offloading capability will only be available when D29654 lands.

Event Timeline

gtbercea updated this revision to Diff 104532.Jun 28 2017, 3:44 PM
gtbercea created this revision.
hfinkel edited edge metadata.Jun 29 2017, 8:28 AM

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Thanks,

--Doru

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

We allow -fopenmp-targets to get a list of triples. I am not aware of any limitations in terms of how many of these triples you can have. Even in the test file of this patch we have the following: "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le--linux"

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

In our previous solution there might be a problem. The same triple might be used multiple times just so that you can have several archs in the other flag (T1 and T2 being the same). There are some alternatives which I have discussed with @ABataev.

One solution could be to associate an arch with each triple to avoid positional matching of triples in one flag with archs in another flag:

-fopenmp-targets=T1:A1,T2,T3:A2

":A1" is optional, also, in the future, we can pass other things to the toolchain such as "-L/a/b/c/d":

-fopenmp-targets=T1:A1: -L/a/b/c/d,T2,T3:A2

An actual example:

-fopenmp-targets=nvptx64-nvidia-cuda:sm_35,openmp-powerpc64le-ibm-linux-gnu

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

We allow -fopenmp-targets to get a list of triples. I am not aware of any limitations in terms of how many of these triples you can have. Even in the test file of this patch we have the following: "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le--linux"

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

In our previous solution there might be a problem. The same triple might be used multiple times just so that you can have several archs in the other flag (T1 and T2 being the same). There are some alternatives which I have discussed with @ABataev.

One solution could be to associate an arch with each triple to avoid positional matching of triples in one flag with archs in another flag:

-fopenmp-targets=T1:A1,T2,T3:A2

":A1" is optional, also, in the future, we can pass other things to the toolchain such as "-L/a/b/c/d":

-fopenmp-targets=T1:A1: -L/a/b/c/d,T2,T3:A2

Okay, good, this is exactly where I was going when I said I was worried about generalization. -march seems like one of many flags I might want to pass to the target compilation. Moreover, it doesn't seem special in what regard.

We have -Xclang and -mllvm, etc. to pass flags through to other stages of compilation. Could we do something similar here? Maybe something like: `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7`. That's unfortunately long, but if there's only one target, we could omit the triple?

An actual example:

-fopenmp-targets=nvptx64-nvidia-cuda:sm_35,openmp-powerpc64le-ibm-linux-gnu

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

We allow -fopenmp-targets to get a list of triples. I am not aware of any limitations in terms of how many of these triples you can have. Even in the test file of this patch we have the following: "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le--linux"

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

In our previous solution there might be a problem. The same triple might be used multiple times just so that you can have several archs in the other flag (T1 and T2 being the same). There are some alternatives which I have discussed with @ABataev.

One solution could be to associate an arch with each triple to avoid positional matching of triples in one flag with archs in another flag:

-fopenmp-targets=T1:A1,T2,T3:A2

":A1" is optional, also, in the future, we can pass other things to the toolchain such as "-L/a/b/c/d":

-fopenmp-targets=T1:A1: -L/a/b/c/d,T2,T3:A2

Okay, good, this is exactly where I was going when I said I was worried about generalization. -march seems like one of many flags I might want to pass to the target compilation. Moreover, it doesn't seem special in what regard.

We have -Xclang and -mllvm, etc. to pass flags through to other stages of compilation. Could we do something similar here? Maybe something like: `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7`. That's unfortunately long, but if there's only one target, we could omit the triple?

The triple could be omitted, absolutely.

If you have the following:

-fopenmp-targets=openmp-powerpc64le-ibm-linux-gnu `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7 -Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr8`

This would end up having a toolchain called for each one of the -Xopenmp-target sets of flags even though a single triple was specified under the -fopenmp-targets. Would this be ok?

An actual example:

-fopenmp-targets=nvptx64-nvidia-cuda:sm_35,openmp-powerpc64le-ibm-linux-gnu

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

We allow -fopenmp-targets to get a list of triples. I am not aware of any limitations in terms of how many of these triples you can have. Even in the test file of this patch we have the following: "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le--linux"

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

In our previous solution there might be a problem. The same triple might be used multiple times just so that you can have several archs in the other flag (T1 and T2 being the same). There are some alternatives which I have discussed with @ABataev.

One solution could be to associate an arch with each triple to avoid positional matching of triples in one flag with archs in another flag:

-fopenmp-targets=T1:A1,T2,T3:A2

":A1" is optional, also, in the future, we can pass other things to the toolchain such as "-L/a/b/c/d":

-fopenmp-targets=T1:A1: -L/a/b/c/d,T2,T3:A2

Okay, good, this is exactly where I was going when I said I was worried about generalization. -march seems like one of many flags I might want to pass to the target compilation. Moreover, it doesn't seem special in what regard.

We have -Xclang and -mllvm, etc. to pass flags through to other stages of compilation. Could we do something similar here? Maybe something like: `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7`. That's unfortunately long, but if there's only one target, we could omit the triple?

The triple could be omitted, absolutely.

If you have the following:

-fopenmp-targets=openmp-powerpc64le-ibm-linux-gnu `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7 -Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr8`

This would end up having a toolchain called for each one of the -Xopenmp-target sets of flags even though a single triple was specified under the -fopenmp-targets. Would this be ok?

Why? That does not sound desirable. And could you even use these multiple outputs? I think you'd want to pass all of the arguments for each target triple to the one toolchain invocation for that target triple. Is that possible?

An actual example:

-fopenmp-targets=nvptx64-nvidia-cuda:sm_35,openmp-powerpc64le-ibm-linux-gnu

What happens if you have multiple targets? Maybe this should be -fopenmp-targets-arch=foo,bar,whatever?

Once this all lands, please make sure that you add additional test cases here. Make sure that the arch is passed through to the ptx and cuda tools as it should be. Make sure that the defaults work. Make sure that something reasonable happens if the user specifies the option more than once (if they're all the same).

Hi Hal,

At the moment only one arch is supported and it would apply to all the target triples under -fopenmp-targets.

I was planning to address the multiple archs problem in a future patch.

I am assuming that in the case of multiple archs, each arch in -fopenmp-targets-arch=A1,A2,A3 will bind to a corresponding triple in -fopenmp-targets=T1,T2,T3 like so: T1 with A1, T2 with A2 etc. Is this a practical interpretation of what should happen?

Yea, that's what I was thinking. I'm a bit concerned that none of this generalizes well. To take a step back, under what circumstances do we support multiple targets right now?

We allow -fopenmp-targets to get a list of triples. I am not aware of any limitations in terms of how many of these triples you can have. Even in the test file of this patch we have the following: "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le--linux"

Regarding tests: more tests can be added as a separate patch once offloading is enabled by the patch following this one (i.e. D29654). There actually is a test in D29654 where I check that the arch is passed to ptxas and nvlink correctly using this flag. I will add some more test cases to cover the other situations you mentioned.

Sounds good.

Thanks,

--Doru

In our previous solution there might be a problem. The same triple might be used multiple times just so that you can have several archs in the other flag (T1 and T2 being the same). There are some alternatives which I have discussed with @ABataev.

One solution could be to associate an arch with each triple to avoid positional matching of triples in one flag with archs in another flag:

-fopenmp-targets=T1:A1,T2,T3:A2

":A1" is optional, also, in the future, we can pass other things to the toolchain such as "-L/a/b/c/d":

-fopenmp-targets=T1:A1: -L/a/b/c/d,T2,T3:A2

Okay, good, this is exactly where I was going when I said I was worried about generalization. -march seems like one of many flags I might want to pass to the target compilation. Moreover, it doesn't seem special in what regard.

We have -Xclang and -mllvm, etc. to pass flags through to other stages of compilation. Could we do something similar here? Maybe something like: `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7`. That's unfortunately long, but if there's only one target, we could omit the triple?

The triple could be omitted, absolutely.

If you have the following:

-fopenmp-targets=openmp-powerpc64le-ibm-linux-gnu `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7 -Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr8`

This would end up having a toolchain called for each one of the -Xopenmp-target sets of flags even though a single triple was specified under the -fopenmp-targets. Would this be ok?

Why? That does not sound desirable. And could you even use these multiple outputs? I think you'd want to pass all of the arguments for each target triple to the one toolchain invocation for that target triple. Is that possible?

I agree, I don't think that is something we want (i.e. having one triple lead to two toolchains), with the current flags you can't do that today. I wanted to check with you though that's why i mentioned it.

I think appending all options for a particular triple together is more desirable.

An actual example:

-fopenmp-targets=nvptx64-nvidia-cuda:sm_35,openmp-powerpc64le-ibm-linux-gnu

...

Okay, good, this is exactly where I was going when I said I was worried about generalization. -march seems like one of many flags I might want to pass to the target compilation. Moreover, it doesn't seem special in what regard.

We have -Xclang and -mllvm, etc. to pass flags through to other stages of compilation. Could we do something similar here? Maybe something like: `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7`. That's unfortunately long, but if there's only one target, we could omit the triple?

The triple could be omitted, absolutely.

If you have the following:

-fopenmp-targets=openmp-powerpc64le-ibm-linux-gnu `-Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr7 -Xopenmp-target:openmp-powerpc64le-ibm-linux-gnu -march=pwr8`

This would end up having a toolchain called for each one of the -Xopenmp-target sets of flags even though a single triple was specified under the -fopenmp-targets. Would this be ok?

Why? That does not sound desirable. And could you even use these multiple outputs? I think you'd want to pass all of the arguments for each target triple to the one toolchain invocation for that target triple. Is that possible?

I agree, I don't think that is something we want (i.e. having one triple lead to two toolchains), with the current flags you can't do that today. I wanted to check with you though that's why i mentioned it.

I think appending all options for a particular triple together is more desirable.

Good, let's do that.

gtbercea updated this revision to Diff 104960.Jun 30 2017, 4:43 PM
gtbercea retitled this revision from [OpenMP] Add flag for specifying the target device architecture for OpenMP device offloading to [OpenMP] Add flag for specifying the target device architecture for OpenMP device offloading.

Pass OpenMP target options.

gtbercea updated this revision to Diff 104962.Jun 30 2017, 4:55 PM

Check -fopenmp-targets has one entry when using default toolchain in -Xopenmp-target.

@hfinkel I've add the flag as suggested. There is one minor change, I used "=" instead of ":" when specifying the toolchain/triple. I also support the triple being omitted when there is only one offloading toolchain specified with -fopenmp-targets.

hfinkel added inline comments.Jun 30 2017, 5:51 PM
include/clang/Driver/Options.td
463

Can this be?

HelpText<"Pass <arg> to the target offloading toolchain.">,
MetaVarName<"<arg>">;
465
HelpText<"Pass <arg> to the specified target offloading toolchain. The triple that identifies the toolchain must be provided after the equals sign.">, MetaVarName<"<arg>">;
lib/Driver/ToolChains/Cuda.cpp
432

Is this first sentence accurate?

433

This comment should be moved down to where the sm_20 default is added.

435

Why is this logic in this function? Don't you need the same logic in Generic_GCC::TranslateArgs to handle non-CUDA offloading toolchains?

458

A user can trigger this assert, right? Please make this a diagnostic error instead.

463

Shouldn't you be adding all of the options, not just the -march= ones?

467

Can a user hit this? If so, it must be an actual diagnostic.

test/Driver/openmp-offload.c
607

I don't see why you'd check that the arguments are unused. They should be used. One exception might be that you might want to force -Xopenmp-target=foo to be unused if foo is not a currently-targeted offloading triple. There could be a separate test case for that.

Otherwise, I think you should be able to check the relevant backend commands, no? (something like where CHK-COMMANDS is used above in this file).

gtbercea updated this revision to Diff 105280.Jul 5 2017, 9:05 AM
gtbercea marked 3 inline comments as done.

Address comments.

lib/Driver/ToolChains/Cuda.cpp
432

Fixed. It should be -Xopenmp-target

435

I would imagine that each toolchain needs to parse the list of flags since, given a toolchain, the flag may need to be passed to more than one tool and different tools may require different flags for passing the same information.

463

I thought that that would be the case but there are a few issues:

  1. PTXAS and NVLINK each use a different flag for specifying the arch, and, in turn, each flag is different from -march.
  1. -Xopenmp-target passes a flag to the entire toolchain not to individual components of the toolchain so a translation of flags is required in some cases to adapt the flag to the actual tool. -march is one example, I'm not sure if there are others.
  1. At this point in the code, in order to add a flag and its value to the DAL list, I need to be able to specify the option type (i.e. options::OPT_march_EQ). I therefore need to manually recognize the flag in the string representing the value of -Xopenmp-target or -Xopenmp-target=triple.
  1. This patch handles the passing of the arch and can be extended to pass other flags (as is stands, no other flags are passed through to the CUDA toolchain). This can be extended on a flag by flag basis for flags that need translating to a particular tool's flag. If the flag doesn't need translating then the flag and it's value can be appended to the command line as they are.
gtbercea marked 2 inline comments as done.Jul 5 2017, 9:07 AM
gtbercea added inline comments.Jul 5 2017, 11:33 AM
test/Driver/openmp-offload.c
607

Only the CUDA toolchain currently contains code which considers the value of the -Xopenmp-target flag. The CUDA toolchain is not capable of offloading until the next patch lands so any test for how the flag propagates to the CUDA toolchain will have to wait.

Passing a flag to some other toolchain again doesn't work because the other toolchains have not been instructed to look at this flag so they won't contain the passed flag in their respective command lines.

For a lack of a better test, what I wanted to show is that the usage of this flag doesn't throw an error such as unknown flag and is correctly recognized: "-Xopenmp-target=powerpc64le-ibm-linux-gnu -march=pwr8".

gtbercea updated this revision to Diff 105354.Jul 5 2017, 4:33 PM

Address comments.

gtbercea updated this revision to Diff 105355.Jul 5 2017, 4:34 PM

Address Comments.

gtbercea added inline comments.Jul 5 2017, 4:37 PM
lib/Driver/ToolChains/Cuda.cpp
467

A user cannot hit this now, -Xopenmp-target does not lead to duplicate -march flags in DAL anymore.

hfinkel added inline comments.Jul 6 2017, 4:17 PM
lib/Driver/ToolChains/Cuda.cpp
463
  1. PTXAS and NVLINK each use a different flag for specifying the arch, and, in turn, each flag is different from -march.

I don't understand why this is relevant. Don't NVPTX::Assembler::ConstructJob and NVPTX::Linker::ConstructJob handle that in either case?

This seems to be the same comment to point 2 as well.

  1. At this point in the code, in order to add a flag and its value to the DAL list, I need to be able to specify the option type (i.e. options::OPT_march_EQ). I therefore need to manually recognize the flag in the string representing the value of -Xopenmp-target or -Xopenmp-target=triple.

I don't understand why this is true. Doesn't the code just below this, which handles -Xarch, do the general thing (it calls Opts.ParseOneArg and then adds it to the list of derived arguments)? Can't we handle this like -Xarch?

This patch handles the passing of the arch and can be extended to pass other flags (as is stands, no other flags are passed through to the CUDA toolchain). This can be extended on a flag by flag basis for flags that need translating to a particular tool's flag. If the flag doesn't need translating then the flag and it's value can be appended to the command line as they are.

I don't understand this either. If we need to extend this on a flag-by-flag basis, then it seems fundamentally broken. How could we append a flag to the command line without it also affecting the host compile?

test/Driver/openmp-offload.c
607

Passing a flag to some other toolchain again doesn't work because the other toolchains have not been instructed to look at this flag so they won't contain the passed flag in their respective command lines.

I think, however, that we need to refactor this so that it works for all toolchains. If you convince me otherwise, then this will be fine as well :-)

gtbercea added inline comments.Jul 10 2017, 10:08 AM
lib/Driver/ToolChains/Cuda.cpp
463

@hfinkel

The problem is that when using -Xopenmp-target=<triple> -opt=val the value of this flag is a list of two strings:

['<triple>', '-opt=val']

What needs to happen next is to parse the string containing "-opt=val". The reason I have to do this is because if I use -march, I can't pass -march as is to PTXAS and NVLINK which have different flags for expressing the arch. I need to translate the -march=sm_60 flag. I will have to do this for all flags which require translation. There is no way I can just append this string to the PTXAS and NVLINK commands because the flags for the 2 tools are different. A flag which works for one of them, will not work for the other.

So I need to actually parse that value to check whether it is a "-march" and create an Arg object with the OPT_march_EQ identifier and the sm_60 value. When invoking the commands for PTXAS and NVLINK, the dervied arg list will be travered and every -march=sm_60 option will be transformed into "--gpu-name" "sm_60" for PTXAS and into "-arch" "sm_60" for NVLINK.

In the case of -Xarch, you will see that after we have traversed the entire arg list we still have to special case -march and add it is manually added to the DAL.

Let me know your thoughts on this.

Thanks,

--Doru

hfinkel added inline comments.Jul 10 2017, 12:11 PM
lib/Driver/ToolChains/Cuda.cpp
463

What needs to happen next is to parse the string containing "-opt=val".

Yes, that's what ParseOneArg will do.

The reason I have to do this is because if I use -march, I can't pass -march as is to PTXAS and NVLINK which have different flags for expressing the arch. I need to translate the -march=sm_60 flag. I will have to do this for all flags which require translation. There is no way I can just append this string to the PTXAS and NVLINK commands because the flags for the 2 tools are different. A flag which works for one of them, will not work for the other.

So I need to actually parse that value to check whether it is a "-march" and create an Arg object with the OPT_march_EQ identifier and the sm_60 value. When invoking the commands for PTXAS and NVLINK, the dervied arg list will be travered and every -march=sm_60 option will be transformed into "--gpu-name" "sm_60" for PTXAS and into "-arch" "sm_60" for NVLINK.

We still seem to be talking past each other. Maybe I'm misreading the code, but it looks like TranslateArgs is called (by Compilation::getArgsForToolChain) and the *translated arguments* are what are processed by NVPTX::Assembler::ConstructJob (for ptxas) and void NVPTX::Linker::ConstructJob to construct the command lines for the relevant tools. So, while I understand that those tools take specific arguments, their respective ConstructJob routines are still responsible for doing the tool-specific translation (as they do currently). Thus, I believe you can just add all arguments here and they'll be interpreted by each tool's ConstructJob function later as necessary.

In the case of -Xarch, you will see that after we have traversed the entire arg list we still have to special case -march and add it is manually added to the DAL.

Yes, but not way you seem to imply. The Xarch march handling special case is only doing this:

if (!BoundArch.empty()) {
  DAL->eraseArg(options::OPT_march_EQ);
  DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ), BoundArch);
}

and that's just overriding the current derived -march if we have BoundArch set (and this special case would apply in addition to the proposed logic for -Xopenmp-target as well).

gtbercea updated this revision to Diff 105932.Jul 10 2017, 4:12 PM

Address comments.

@hfinkel

I think I have something that works which is similar to what you were requesting. Please let me know your thoughts!

Thanks,
--Doru

This is much closer to what I had in mind, thanks! Now I think we're in a position to make this work for more than just the CUDA target. It looks like the added code is now:

  1. Remove -march from the translated arguments (because any existing -march would apply only for the host compilation).
  2. Process the -Xopenmp-target flags and add those arguments to the list.
  3. If we don't have an -march in the translated arguments, then add -march=sm_20 so that there's a suitable default (noting that this default must be higher than the regular CUDA default).

I propose the following:

(1) is good, but should be more general. It is not just the host's -march that should not apply to any arbitrary toolchain, but any of the -m<foo> options. You should remove all options that are in the m_Group options group (which, as noted in Options.td, are "Target-dependent compilation options"). I believe that you can iterate over them all using something like:

for (const Arg *A : Args.filtered(options::OPT_m_Group)) {

and that might help. This should be in toolchain-independent code, and I'd prefer that we always remove these options whenever the host and target toolchain differ, but leave them when they're the same.

(2) is good, but, along with (1), should be in toolchain-independent code. I recommend that we add a new member function to ToolChain, called, to make a specific suggestion, TranslateOpenMPTargetArgs, and put the logic from (1) and (2) in this function. Then, we can augment Compilation::getArgsForToolChain to do something like the following:

const DerivedArgList &
Compilation::getArgsForToolChain(const ToolChain *TC, StringRef BoundArch,
                                 Action::OffloadKind DeviceOffloadKind) {
  if (!TC)
    TC = &DefaultToolChain;

  DerivedArgList *&Entry = TCArgs[{TC, BoundArch, DeviceOffloadKind}];
  if (!Entry) {
    // First, translate OpenMP toolchain arguments provided via the -Xopenmp-toolchain flags.
    Entry = TranslateOpenMPTargetArgs(*TranslatedArgs, BoundArch, DeviceOffloadKind);
    if (!Entry)
      Entry = TranslatedArgs;

    Entry = TC->TranslateArgs(*Entry, BoundArch, DeviceOffloadKind);
    if (!Entry)
      Entry = TranslatedArgs;
  }

  return *Entry;
}

And then (3) we leave as it is (where it is).

lib/Driver/ToolChains/Cuda.cpp
478

This default is only for OpenMP, right? Please explain in the comment why this is the default for OpenMP.

guansong added a project: Restricted Project.Jul 19 2017, 7:29 AM
gtbercea updated this revision to Diff 109903.Aug 5 2017, 4:45 PM
New way to handle OpenMP target flags.
gtbercea updated this revision to Diff 109904.Aug 5 2017, 5:33 PM
Don't exclude flags when host matches offload toolchain.
hfinkel added inline comments.Aug 5 2017, 6:01 PM
lib/Driver/ToolChain.cpp
828

Please include {} around this else-if code, even though it is not necessary, because the other blocks require it.

840

Is this covered by a test case?

847

Is this covered by a test case?

854

Why is -march special in this regard? Shouldn't the consumers just take the last one specified (e.g., use getLastArgValue in the ToolChain code)?

test/Driver/openmp-offload.c
615

Now that this is in common code, why are these arguments still unused?

gtbercea updated this revision to Diff 109938.Aug 6 2017, 2:22 PM

Address comments.

gtbercea added inline comments.Aug 6 2017, 2:27 PM
lib/Driver/ToolChain.cpp
828

Done

840

Done

847

Done

test/Driver/openmp-offload.c
615

Fixed.

gtbercea updated this revision to Diff 109939.EditedAug 6 2017, 2:42 PM

Fix -march special casing.

hfinkel accepted this revision.Aug 6 2017, 3:54 PM

LGTM. Thanks for all of your work on this!

test/Driver/openmp-offload.c
603

Comment should say pwr7, not pwr8, to match the test.

611

Comment should say pwr7, not pwr8, to match the test.

This revision is now accepted and ready to land.Aug 6 2017, 3:54 PM
gtbercea updated this revision to Diff 110007.Aug 7 2017, 8:35 AM

Fix test comments.

gtbercea closed this revision.Aug 7 2017, 8:39 AM
Hahnfeld added inline comments.Aug 14 2017, 12:56 AM
lib/Driver/ToolChain.cpp
851

This is a memory leak that is currently triggered in tests/Driver/openmp-offload-gpu.c and found by ASan. How to fix this? I'm not really familiar with OptTable...

fjricci added a subscriber: fjricci.Sep 8 2017, 2:59 PM
fjricci added inline comments.
lib/Driver/ToolChain.cpp
851

Even with the follow-up patch to fix the memory leak, I'm still seeing this pointer leaked (on Darwin with ASan and detect_leaks=1). I've tried playing around with a few fixes myself, but haven't been able to get anything working.