This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP5.0] map item can be non-contiguous for target update
AbandonedPublic

Authored by cchen on May 14 2020, 3:40 PM.

Details

Reviewers
ABataev
jdoerfert
Summary

In order not to modify the tgt_target_data_update information but still be
able to pass the extra information for non-contiguous map item (offset,
count, and stride for each dimension), this patch overload arg when
the maptype is set as OMP_MAP_DESCRIPTOR. The origin arg is for
passing the pointer information, however, the overloaded arg is an
array of descriptor_dim:

struct descriptor_dim {
  int64_t offset;
  int64_t count;
  int64_t stride
};

and the array size is the same as dimension size. In addition, since we
have count and stride information in descriptor_dim, we can replace/overload the
arg_size parameter by using dimension size.

More details can be found here: https://github.com/chichunchen/openmp-50-design/blob/master/target_update_noncontiguous.pptx

Edit:
The runtime implementation I'm thinking of is to convert the non-contiguous data into several chunks of contiguous.

For example:

int arr[3][3][3];
#pragma omp target update to (arr[1:2][1:2][0:2])

We can visualize the noncontiguous data as below (X is the data we want to transfer, O is the data want don't bother with):

Dim 0 = {Offset: 0, Count: 1, Stride: 4 bytes (int)}
XXO

Dim 1 = {Offset: 1, Count: 2, Stride: 12 bytes (4 * 3 - since Dim 0 has 3 elements)
OOO
XXO
XXO

Dim 2 = {Offset: 1, Count: 2, Stride: 36 bytes (12 * 3 since Dim 1 has 3 elements)
OOO OOO OOO
OOO XXO XXO
OOO XXO XXO

For the visualization, we know that we want to transfer 4 contiguous chunks and the runtime code could be something similar to:

// we expect this loop to transfer 4 contiguous chunks:
// arr[1][1][0:2]
// arr[1][2][0:2]
// arr[2][1][0:2]
// arr[2][2][0:2]
for (int i = Dim[2].offset; i < Dim[2].count; i++) {
  for (int j = Dim[1].offset; j < Dim[1].count; j++) {
    ptr = bast_ptr + Dim[2].stride * i + Dim[1].stride * j + Dim[2].stride * Dim[0].offset;
    size = Dim[0].count * Dim[0].stride;  // we can hoist it I think
    transfer(ptr, size, /*flag or some other stuff...*/);
  }
}

For this design, we can support strides by just adding an extra dimension. For instance:

int arr[5][5][5]
#pragma omp target update to(arr[0:2:2][1:2:1][0:2:2])

Dim 0 = {offset: 0, count: 1, stride: 4 bytes (int) } // the extra dimension for supporting stride
XO

Dim 1 = {offset: 0, count: 3, stride 8 bytes (4 * 2) }
XOXOX

Dim 2 = {offset: 0, count: 2, stride: 40 bytes (8 * 5) }
OOOOO
XOXOX
XOXOX
OOOOO
OOOOO

Dim 3 = {offset: 0, count: 2, stride: 200 bytes (40 * 5) }
...

Diff Detail

Event Timeline

cchen created this revision.May 14 2020, 3:40 PM
Herald added a project: Restricted Project. · View Herald Transcript
cchen updated this revision to Diff 265562.May 21 2020, 11:48 AM

Remove redundant code

How are you going to pass this non-contiguous data in the runtime? Are you going to map it in a loop or convert this non-contiguous data into the contiguous and map it as a contiguous chunk of data? Your presentation provides interface only interface changes but has nothing about implementation in the runtime.

cchen added a comment.May 27 2020, 2:00 PM

How are you going to pass this non-contiguous data in the runtime? Are you going to map it in a loop or convert this non-contiguous data into the contiguous and map it as a contiguous chunk of data? Your presentation provides interface only interface changes but has nothing about implementation in the runtime.

Hi Alexey, thanks for asking. The runtime implementation I'm thinking of is to convert the non-contiguous data into several chunks of contiguous.

For example:

int arr[3][3][3];

#pragma omp target update to (arr[1:2][1:2][0:2])

We can visualize the noncontiguous data as below (X is the data we want to transfer, O is the data want don't bother with):

Dim 0 = {Offset: 0, Count: 1, Stride: 4bytes (int)}
XXO

Dim 1 = {Offset: 1, Count: 2, Stride: 12bytes (4 * 3 - since Dim 0 has 3 elements)
OOO
XXO
XXO

Dim 2 = {Offset: 1, Count: 2, Stride: 36 bytes (12 * 3 since Dim 1 has 3 elements)
OOO
OOO
OOO
\\\\\
OOO
XXO
XXO
\\\\\
OOO
XXO
XXO

For the visualization, we know that we want to transfer 4 contiguous chunks and the runtime code could be something similar to:

// we expect this loop to transfer 4 contiguous chunks:
// arr[1][1][0:2]
// arr[1][2][0:2]
// arr[2][1][0:2]
// arr[2][2][0:2]
for (int i = Dim[2].offset; i < Dim[2].count; i++) {
  for (int j = Dim[1].offset; j < Dim[1].count; j++) {
    ptr = bast_ptr + Dim[2].stride * i + Dim[1].stride * j + Dim[2].stride * Dim[0].offset;
    size = Dim[0].count * Dim[0].stride;  // we can hoist it I think
    transfer(ptr, size, /*flag or some other stuff...*/);
  }
}

Is my guess correct that for OpenMP >= 50 for target update directive we always emit possibly non-continuous runtime calls?

clang/include/clang/AST/OpenMPClause.h
5346

Why do you need this bool flag? Seems to me, it is set to true always if OpenMP >= 50 && Directive == OMPD_target_update. Could check it during the codegen rather than introduce this new extra data here?

cchen marked an inline comment as done.May 27 2020, 2:38 PM
cchen added inline comments.
clang/include/clang/AST/OpenMPClause.h
5346

You're right, I shouldn't add bool here since we only need it in OMPToClause and OMPFromClause. I was adding it since I'm assuming they should have the same type for the inherited TrailingObject.

cchen edited the summary of this revision. (Show Details)May 27 2020, 2:46 PM
cchen added a comment.May 27 2020, 2:48 PM

Is my guess correct that for OpenMP >= 50 for target update directive we always emit possibly non-continuous runtime calls?

My intent is to emit possibly non-contiguous runtime calls only if the analysis in Sema set the IsNonContiguous flag to true.

cchen edited the summary of this revision. (Show Details)May 27 2020, 2:51 PM
ABataev added a comment.EditedMay 27 2020, 3:26 PM

Is my guess correct that for OpenMP >= 50 for target update directive we always emit possibly non-continuous runtime calls?

My intent is to emit possibly non-contiguous runtime calls only if the analysis in Sema set the IsNonContiguous flag to true.

But this analysis only checks for the directive and the version,nothing else.

cchen added a comment.May 27 2020, 3:40 PM

Is my guess correct that for OpenMP >= 50 for target update directive we always emit possibly non-continuous runtime calls?

My intent is to emit possibly non-contiguous runtime calls only if the analysis in Sema set the IsNonContiguous flag to true.

But this analysis only checks for the directive and the version,nothing else.

The context of the checks for the directive and version:

bool NotWhole =
  checkArrayExpressionDoesNotReferToWholeSize(SemaRef, OASE, CurType);
bool NotUnity =
  checkArrayExpressionDoesNotReferToUnitySize(SemaRef, OASE, CurType);

if (AllowWholeSizeArraySection) {
  // Any array section is currently allowed. Allowing a whole size array
  // section implies allowing a unity array section as well.
  //
  // If this array section refers to the whole dimension we can still
  // accept other array sections before this one, except if the base is a
  // pointer. Otherwise, only unitary sections are accepted.
  if (NotWhole || IsPointer)
    AllowWholeSizeArraySection = false;
} else if (DKind == OMPD_target_update &&
           SemaRef.getLangOpts().OpenMP >= 50) {
  IsNonContiguousRef = true;
} else if (AllowUnitySizeArraySection && NotUnity) {
  // A unity or whole array section is not allowed and that is not
  // compatible with the properties of the current array section.
  SemaRef.Diag(
    ELoc, diag::err_array_section_does_not_specify_contiguous_storage)
    << OASE->getSourceRange();
  return false;
}

The original analysis checks for non-contiguous by finding if there is more than one "array-section" expression with length greater than one. Therefore, I added my check there to allow more than one array-section with length greater than one by depending on the existing analysis (and also set IsNonContiguous to true so that we can pass it to codegen rather than doing analysis in codegen). This change allows me to pass all the existing lit test but still emit the "non-contiguous" runtime.

cchen marked an inline comment as done.May 27 2020, 3:48 PM
cchen added inline comments.
clang/lib/Sema/SemaOpenMP.cpp
16763

@ABataev , I guess you're saying the condition should be !AllowWholeSizeArraySection && DKind == OMPD_target_update && SemaRef.getLangOpts().OpenMP >= 50?

Did you think about implementing it in the compiler instead of the runtime?

clang/lib/CodeGen/CGOpenMPRuntime.cpp
7635–7638

Do you really need to count DimSize for array shaping operators and array subscript expressions? I don't see tests for it.

clang/lib/Sema/SemaOpenMP.cpp
16763

No, what I want is to try to simplify the code. I see now why do you need this flag. I'm just thinking can we avoid adding this flag to the clause and save some mem space?

cchen updated this revision to Diff 266921.May 28 2020, 9:47 AM

Fix based on feedback

cchen marked 2 inline comments as done.May 28 2020, 9:54 AM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7635–7638

You're right, I don't need to count DimSize for array shaping and array subscript.

clang/lib/Sema/SemaOpenMP.cpp
16763

But we also don't want to do the analysis in codegen I guess? Also, if we emit non-contiguous runtime for every target update call, we need to change tons of stuff (tons of lit tests, runtime implementation, etc...).

ABataev added inline comments.May 28 2020, 10:09 AM
clang/lib/Sema/SemaOpenMP.cpp
16763

Maybe make it a part of MappableComponent, if possible, and put it into PointerIntPair<Expr *, 1, bool> AssociatedExpression;?

cchen updated this revision to Diff 267725.Jun 1 2020, 1:53 PM
  • Use PointerIntPair to pass non-contiguous information in AST
  • Error out in Sema if we don't have enough size information for cases involving pointers
  • Allows *arr[N][M] since we don't need size information for the last dimension
  • Add more test cases

Still: Did you think about implementing it in the compiler instead of the runtime?

clang/include/clang/AST/OpenMPClause.h
4752–4753

I would suggest to pass Expr * and bool as separate parameters here rather than as PointerIntPair

4766

isNonContiguous()

clang/lib/Sema/SemaOpenMP.cpp
16584

Add default initializer

16765–16769

Remove braces here, they are not needed.

clang/lib/Serialization/ASTWriter.cpp
6600

There is a member function writeBool

6625

Same, use writeBool

cchen added a comment.Jun 2 2020, 11:57 AM

Still: Did you think about implementing it in the compiler instead of the runtime?

I'm not sure I understand your question, which part of code are you asking?
The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.

Still: Did you think about implementing it in the compiler instead of the runtime?

I'm not sure I understand your question, which part of code are you asking?
The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.

I mean did you think about calling __tgt_target_data_update function in a loop in the compiler-generated code instead of putting it into the runtime?

cchen added a comment.Jun 2 2020, 12:12 PM

Still: Did you think about implementing it in the compiler instead of the runtime?

I'm not sure I understand your question, which part of code are you asking?
The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.

I mean did you think about calling __tgt_target_data_update function in a loop in the compiler-generated code instead of putting it into the runtime?

Oh, I would prefer to call tgt_target_data_update once in the compiler and I'm also doing it now.

Still: Did you think about implementing it in the compiler instead of the runtime?

I'm not sure I understand your question, which part of code are you asking?
The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.

I mean did you think about calling __tgt_target_data_update function in a loop in the compiler-generated code instead of putting it into the runtime?

Oh, I would prefer to call tgt_target_data_update once in the compiler and I'm also doing it now.

I was not quite correct. What I mean, is to generate the array with the array section as VLA in the compiler, and fill it in the loop generated by the compiler for non-contiguous sections but not in the runtime?
Say, we have the code:

int arr[3][3]
...
 #pragma omp update to(arr[1:2][1:2]

In this case, we're going to transfer the next elements:

000
0xx
0xx

In the compiler-generated code we emit something like this:

void *bptr[<n>];
void *ptr[<n>];
int64 sizes[<n>];
int64 maptypes[<n>];
for (int i = 0; i < <n>; ++i) {
  bptr[i] = &arr[1+i][1];
  ptr[i] = &arr[1+i][1];
  sizes[i] = ...;'
  maptypes[i] = ...;
}
call void @__tgt_target_data_update(i64 -1, i32 <n>, bptr, ptr, sizes, maptypes);

With this solution, you won't need to modify the runtime and add a new mapping flag.

cchen updated this revision to Diff 268009.Jun 2 2020, 3:35 PM

Fix based on feedback

cchen added a comment.Jun 2 2020, 4:24 PM

Still: Did you think about implementing it in the compiler instead of the runtime?

I'm not sure I understand your question, which part of code are you asking?
The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.

I mean did you think about calling __tgt_target_data_update function in a loop in the compiler-generated code instead of putting it into the runtime?

Oh, I would prefer to call tgt_target_data_update once in the compiler and I'm also doing it now.

I was not quite correct. What I mean, is to generate the array with the array section as VLA in the compiler, and fill it in the loop generated by the compiler for non-contiguous sections but not in the runtime?
Say, we have the code:

int arr[3][3]
...
 #pragma omp update to(arr[1:2][1:2]

In this case, we're going to transfer the next elements:

000
0xx
0xx

In the compiler-generated code we emit something like this:

void *bptr[<n>];
void *ptr[<n>];
int64 sizes[<n>];
int64 maptypes[<n>];
for (int i = 0; i < <n>; ++i) {
  bptr[i] = &arr[1+i][1];
  ptr[i] = &arr[1+i][1];
  sizes[i] = ...;'
  maptypes[i] = ...;
}
call void @__tgt_target_data_update(i64 -1, i32 <n>, bptr, ptr, sizes, maptypes);

With this solution, you won't need to modify the runtime and add a new mapping flag.

For my current implementation, we have discussed in the bi-weekly meeting several weeks back, and there was a general consensus that it was an acceptable approach.

The major advantage of sending a descriptor to runtime can be elaborated in the following example:

#define N 10000
int a[N][2];
…
#pragma amp target update to (a[0:N][0:1])

This would require passing through O(N) entries in the tgt_target_data_update call, or 10000 entries. The current implementation only require a descriptor with 2 entries. I think this could be a real concern -
splitting out the transfers in compiler-generated code results in a list containing one entry per non-contiguous chunk (easily hitting scaling issues), while the descriptor approach is bounded by the number of dimensions.
That seems like a pretty compelling reason to use the descriptor - it’s much more space efficient.

Also, the descriptor idea is very similar to how Cray supported Fortran dope vectors for years (we send in a pointer to a dope vector rather than a pointer to the data, and a flag to indicate it’s a dope vector, and the runtime library handles it as a dope vector).
I think the runtime library changes will not be very extensive or difficult at all and we’re very willing to implement the runtime for non-contiguous.

Do you have a test for mapping of something like arr[0][:n], where the base is an array subscript and the remaining part is an array section?

clang/include/clang/AST/OpenMPClause.h
4756–4757

I think you can initialize AssociatedExpressionNonContiguousPr using just AssociatedExpressionNonContiguousPr(AssociatedExpression, IsNonContiguous) form, no?

clang/lib/CodeGen/CGOpenMPRuntime.cpp
7122–7123

Restore original formatting

7613–7614

Better to convert it to !IsNonContiguous && isFinalArraySectionExpression(I->getAssociatedExpression()).

7633

Use prefix form ++DimSize.

7690

No need for parameter name comment here, it is required only if the true|false constants are used

7747

Same, comment not required

7940–7954

Can we merge the functionality in this new function with the existing ones somehow? It is not the best idea to duplicate functionality using copy-paste if any.

8665–8667

Why removed the comment?

9061

Same question as before - can we merge this functionality with the existing functions?

clang/lib/Sema/SemaOpenMP.cpp
18611–18614

Use .emplace_back(SimpleRefExpr, D, false);

clang/lib/Serialization/ASTReader.cpp
12530–12532

.emplace_back(AssociatedExprPr, AssociatedDecl, /*IsNonContiguous=*/false);

12648–12649

Same, use emplace_back()

12698–12699

Same, use emplace_back()

12747–12748

Same, use emplace_back()

12833–12834

Same, use emplace_back()

clang/test/OpenMP/target_update_to_messages.cpp
147

Delete this extra line

cchen updated this revision to Diff 269663.Jun 9 2020, 2:07 PM

Fix based on feedback

cchen added a comment.Jun 9 2020, 2:12 PM

Do you have a test for mapping of something like arr[0][:n], where the base is an array subscript and the remaining part is an array section?

I'm not having it right now, but it seems like if the base is an array subscript and the remaining part is an array section, then this map-item will always be contiguous, and will not trigger my code in Codegen. I can still add a test for Sema though.

cchen marked 26 inline comments as done.Jun 9 2020, 2:17 PM
ABataev added inline comments.Jun 10 2020, 8:51 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7046–7049

I'm not sure about the value of this flag. If I do recall it correctly, this value might be used for something different by XL compiler, for example. Maybe use some other value, maybe high bits? It is a kind of service flag, not data mapping attribute, so better to move it to high bits (the bit before OMP_MAP_MEMBER_OF maybe?).

7817–7818

Use range-based loop, if possible.

7820–7822

Do you really need to analyze array subscript expressions here? I though that we should analyze only array sections, no?

7879

Same, try to use range-based loop, if possible.

7884

Same question about array subscript expressions.

8302–8310

Just `generateInfoForComponentList(

L.MapType, L.MapModifiers, L.Components, CurBasePointers,
CurPointers, CurSizes, CurTypes, CurDims, PartialStruct,
IsFirstComponentList, L.IsImplicit,
/*OverlappedElements=*/llvm::None,
L.Components.back().isNonContiguous(), &CurOffsets, &CurCounts, &CurStrides);`
8941–8944

Can we encapsulate these new data into CGOpenMPRuntime::TargetDataInfo?

cchen updated this revision to Diff 269961.Jun 10 2020, 1:32 PM

Fix based on feedback

ABataev added inline comments.Jun 11 2020, 12:12 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7340–7343

I would prefer to pack these 4 params into a single parameter (a struct). Also, can we put Dims parameter into the list of the optional parameters?

7816

Expand auto here to a real type

7832

What if the base is a pointer, not an array?

7842–7849

The code for SizeV must be under the control of the next if:

if (DimSizes.size() < Components.size() - 1) {
 ....
}
7845

Create directly as of CGF.Int64Ty type.

7870

Expand auto here to a real type

ABataev added inline comments.Jun 11 2020, 12:12 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7818

Can we have anything else except for array section here? If not, use just cast. If yes, use continue to simplify complexity:

if (!OASE)
  continue;
...
7872–7875

Can we have anything else except for array section here? If not, use just cast. If yes, use continue to simplify complexity:

if (!OASE)
  continue;
...
7883–7884

Do you really to pass real offsets here? Can we use pointers instead?

cchen marked 4 inline comments as done.Jun 12 2020, 3:48 PM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7832

The if (ElementType) condition only push back stride when base is not pointer. I'm now allowing one dimension size to be unknown (pointer as base) and sema has analysis to check if more than one indirection as base. My last codegen test case is for testing pointer as base.

7842–7849

I don't think I understand this one. Why do you remove SizeV in the if condition?

7845

Doing this I'll get assertion error in this exact line if on a 32-bits target.

7883–7884

Do you mean I should set the type of Offset to Expr*?

cchen updated this revision to Diff 270536.Jun 12 2020, 3:54 PM

Fix based on feedback

ABataev added inline comments.Jun 15 2020, 11:53 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7340

Can we encapsulate Dims into StructNonContiguousInfo?

7842–7843

If only CAT or VAT is allowed, then transform this if into:

else {
  assert(VAT&& ...);
7842–7849

This is for SizeV. You don't use it if DimSizes.size() < Components.size() - 1 is false, looks like memory leak.

7845

Hmm, why, can you investigate?

7880–7883

Can we have anything else except for array section here? If not, use just cast. If yes, use continue to simplify complexity:

if (!OASE)
  continue;
...
7883–7884

Currently, you're passing offsets to the runtime. Can we pass pointers instead? I mean, for a[b] you pass b to the runtime, can we pass &a[b] instead?

7922

Avoid expressions with some side effects, like *DI++

cchen updated this revision to Diff 270880.Jun 15 2020, 3:07 PM

Resolve issues

cchen marked 3 inline comments as done.Jun 15 2020, 3:12 PM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7845

My comment was not accurate, I've updated it. What I want to convey is that we can only have CAT, VAT, or pointer here, since analysis in Sema has a restriction for it. (SemaOpenMP line 16623)

7880–7883

Not sure about this one, I've added:

if (!OASE)
  continue;
...
7883–7884

Yes, I'm fine either passing index or passing address, though I'm curious why you're recommending passing address.

ABataev added inline comments.Jun 16 2020, 5:53 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7845

It does not relate to the comments thread but I got it. Anyway, try to investigate why the compiler crashes if you try to cr4eate a constant ща ]СПАюШте64Ен] directly.

7883–7884

It is going to simplify the codegen. Currently, to get the offset, you need to dig through all the elements of the array section. If, instead, you use the pointers, you would not need to do this and you can rely on something like CGF.EmitArraySectionLValue(). At least, I hope so.

7914

The check is not required, you already checked that the expression must be array section only.

cchen marked 2 inline comments as done.Jun 16 2020, 9:34 AM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7845

I'll investigate it, thanks.

7883–7884

After discussed with my colleagues, I think passing relative offset makes more sense.

For a 1-dim array, storing the offset as a pointer could work, but it seems strange to me to store as a pointer when there are 2+ dimensions with multiple disjoint chunks of memory because the pointer can only point to the offset for the first chunk. That is, a pointer would refer to an absolute location in a single chunk, whereas the offset is relative to the start of any chunk.

For example:

int a[4][4];
#pragma omp target update to(a[1:2][1:2])

This is two disjoint chunks of memory:

XXXX
XOOX
XOOX
XXXX

The offset for the outer dimension could be store as a pointer, since there is only one instance of that dimension:

Dim1: Offset=&a[1]

But, the inner dimension is "instantiated" twice, once for each element in the outer dimension. So, there are really two absolute pointers, depending on which instance (element in the outer dimension) you're talking about:

Dim2: Offset=&a[1][1]
Dim2: Offset=&a[2][1]

We could set the policy that the absolute offset would always be expressed as the offset in the first instance, but then wouldn't we need to refer to that location when computing the offset for all of the other instances? That seems unintuitive to me, and potentially complicates the implementation. The relative offset makes a lot more senes to me - for a starting point, what relative offset is needed for each dimension. The starting point for the outermost dimension does require the base address, but all inner dimensions have a variable starting pointer based on which element in the outer dimensions you're currently looking at.

cchen updated this revision to Diff 271183.Jun 16 2020, 1:33 PM

Fix Int64Ty issue (The bitNum of APInt I used before is 32)

ABataev added inline comments.Jun 16 2020, 2:35 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7931

Use preincrement

8308

No need to add /*OverlappedElements=*/llvm::None here, it is default value.

9051–9053

Better just to have something like this:

if (!IsNonContiguous || Info.Offsets.empty() || Info.NumberOfPtrs == 0)
  return;
...
cchen updated this revision to Diff 271221.Jun 16 2020, 3:06 PM

Fix based on feedback

cchen marked 28 inline comments as done.Jun 17 2020, 12:50 PM

How do you plan to support
#pragma omp target update to (arr[1:2][1:2][0:2], x, b[1:5][0:2])
Are you going to split this into 3 updates since your are using the arg fields.

cchen added a comment.Jun 22 2020, 3:25 PM

How do you plan to support
#pragma omp target update to (arr[1:2][1:2][0:2], x, b[1:5][0:2])
Are you going to split this into 3 updates since your are using the arg fields.

There's only one runtime call for your case. and args will be { descriptor_1, x, descriptor_2 }, where descriptor_1 will be { { 1, 2, 80 }, { 1, 2, 20 }, { 0, 2, 4 } }, descriptor_2 will be { { 1, 5, 16 }, { 0, 2, 4 } }. There's analysis in Sema that detecting if the item is non-contiguous or not and codegen only generate descriptor for non-contiguous item.

cchen updated this revision to Diff 272564.Jun 22 2020, 4:06 PM

Updated test for clarification

cchen added a comment.Jun 22 2020, 4:09 PM

How do you plan to support
#pragma omp target update to (arr[1:2][1:2][0:2], x, b[1:5][0:2])
Are you going to split this into 3 updates since your are using the arg fields.

I have added a test basically base on the case in your comment (CK19 in target_update_codegen.cpp). Thanks.

ABataev added inline comments.Jun 25 2020, 10:08 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7853–7855

No need for braces here

9108

C.getTypeAlignInChars(C.VoidPtrTy)->CGF.getPointerAlign()

10420–10423

Better just to pass Info.Offsets, Info.Counts and Info.Strides as arguments to generateAllInfo() function and do not create local copies at all.

10660–10663

Same, pass the fields as arguments instead.

clang/lib/Serialization/ASTReader.cpp
12531–12532

Still calling an extra constructor here, just .emplace_back(AssociatedExprPr, AssociatedDecl, /*IsNonContiguous=*/false);

12649–12650

Just Components.emplace_back(AssociatedExprPr, AssociatedDecl, IsNonContiguous);

12699

.emplace_back(AssociatedExprPr, AssociatedDecl, IsNonContiguous);

12748–12749

.emplace_back(AssociatedExprPr, AssociatedDecl, /*IsNonContiguous=*/false);

12790–12792

.emplace_back(AssociatedExpr, AssociatedDecl, /*IsNonContiguous*/ false);

12833–12835

.emplace_back(AssociatedExpr, AssociatedDecl, /*IsNonContiguous=*/false));

cchen updated this revision to Diff 273484.Jun 25 2020, 12:36 PM

Fix coding style

ABataev added inline comments.Jun 25 2020, 12:47 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
8934–8935

Do you really need to pass Dims here if you have Dims data member in Info parameter? Why you can't use Info.Dims instead?

9054–9082

Maybe worth it to outline it into a separate function to reduce code size and the complexity of this function? And just call this new function here.

clang/lib/Sema/SemaOpenMP.cpp
16766

Better to use integer value as selectors, not boolean.

18664–18666

.emplace_back(SimpleRefExpr, D, /*IsNonContiguous=*/false);

18731–18732

Add a comment for false argument with the name of parameter.

cchen marked an inline comment as done.Jun 25 2020, 1:14 PM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
8934–8935

I think I haven't added Dims in TargetDataInfo atm, I'll add into it and then use it via Info.

cchen marked an inline comment as done.Jun 25 2020, 1:28 PM
cchen added inline comments.
clang/lib/Sema/SemaOpenMP.cpp
16766

The selector for err_omp_section_length_undefined is a bool value. (true for unknown bound false for not a array type, so always be true here).
Do you mean that I need to create a new kind of diagnosis message here and use integer as selectors?

ABataev added inline comments.Jun 25 2020, 1:41 PM
clang/lib/Sema/SemaOpenMP.cpp
16766

No, it is an integer, starts from 0

cchen updated this revision to Diff 273517.Jun 25 2020, 2:24 PM

Fix based on feedback

ABataev added inline comments.Jun 25 2020, 2:32 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
10418–10419

Better to pass Info here directly.

cchen updated this revision to Diff 273529.Jun 25 2020, 2:58 PM

Pass Info directly

ABataev added inline comments.Jun 26 2020, 6:04 AM
clang/lib/Sema/SemaOpenMP.cpp
16807

/*IsNonContiguous=*/false

16827

/*IsNonContiguous=*/false

18612–18614

.emplace_back(SimpleRefExpr, D, /*IsNonContiguous=*/false);

cchen updated this revision to Diff 273747.Jun 26 2020, 8:31 AM

Fix based on feedback

cchen updated this revision to Diff 273778.Jun 26 2020, 10:31 AM

Rebase and resolve conflictions

cchen marked 21 inline comments as done.Jun 29 2020, 1:27 PM
cchen marked an inline comment as done.Jun 30 2020, 9:52 AM
cchen added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7046–7049

Hi @ABataev, is there any place I can find which value has been used for lower bits (like 0x800, 0x1000)?

ABataev added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7046–7049

I rather doubt. You can try to ask @kkwli0

kkwli0 added inline comments.Jun 30 2020, 12:49 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
7046–7049

We are using 0x800. I think your current choice should be fine.

cchen added a comment.Jun 30 2020, 3:24 PM

@ABataev , I'm considering emitting an extra dimension for a non-contiguous descriptor to support stride in this patch (stride = 1 in array section is just a special case for computing stride, however, the formula computing stride do not change). Do you think I should do it in this patch?

Computing of stride after support stride in array section:

int arr[5][5][5];
#pragma omp target update to(arr[0:2:2][1:2:1][0:2:2]

D0: { offset = 0, count = 1, stride = 4 }                                           // offset, count, dimension size always be 0, 1, 1 for this extra dimension, stride is the unit size
D1: { offset = 0, count = 2, stride = 4 * 1 * 2 = 8 }                        // stride = unit size * (production of dimension size of D0) * D1.stride = 4 * 1 * 2 = 8
D2: { offset = 0, count = 1, stride = 4 * (1 * 5) * 1 = 20  }             // stride = unit size * (production of dimension size of D0, D1) * D2.stride = 4 * 5 * 1 = 20
D3: { offset = 0, count = 2, stride = 4 * (1 * 5 * 5) * 2 = 200 }      // stride = unit size * (production of dimension size of D0, D1, D2) * D3.stride = 4 * 25 * 2 = 200

For the case in this patch (stride = 1), we can use the same formula for computing stride with extra dimension:

int arr[5][5][5];
#pragma omp target update to(arr[0:2][1:2][0:2]

D0: { offset = 0, count = 1, stride = 4 }                                          // offset, count, dimension size always be 0, 1, 1 for this extra dimension, stride is the unit size
D1: { offset = 0, count = 2, stride = 4 * 1 * 1 = 4 }                        // stride = unit size * (production of dimension size of D0) * D1.stride = 4 * 1 * 1 = 4
D2: { offset = 0, count = 1, stride = 4 * (1 * 5) * 1 = 20  }            // stride = unit size * (production of dimension size of D0, D1) * D2.stride = 4 * 5 * 1 = 20
D3: { offset = 0, count = 2, stride = 4 * (1 * 5 * 5) * 1 = 100 }     // stride = unit size * (production of dimension size of D0, D1, D2) * D3.stride = 4 * 25 * 1 = 100

The extra dimension does not affect the runtime implementation at all since runtime will try to merge inner dimensions if they are contiguous. Take the above case for example (arr[0:2][1:2][0:2]):
The product of count and stride for D0 is 4 which is the same as the stride of D1, therefore, runtime just ignores D0.

@ABataev , I'm considering emitting an extra dimension for a non-contiguous descriptor to support stride in this patch (stride = 1 in array section is just a special case for computing stride, however, the formula computing stride do not change). Do you think I should do it in this patch?

Computing of stride after support stride in array section:

int arr[5][5][5];
#pragma omp target update to(arr[0:2:2][1:2:1][0:2:2]

D0: { offset = 0, count = 1, stride = 4 }                                           // offset, count, dimension size always be 0, 1, 1 for this extra dimension, stride is the unit size
D1: { offset = 0, count = 2, stride = 4 * 1 * 2 = 8 }                        // stride = unit size * (production of dimension size of D0) * D1.stride = 4 * 1 * 2 = 8
D2: { offset = 0, count = 1, stride = 4 * (1 * 5) * 1 = 20  }             // stride = unit size * (production of dimension size of D0, D1) * D2.stride = 4 * 5 * 1 = 20
D3: { offset = 0, count = 2, stride = 4 * (1 * 5 * 5) * 2 = 200 }      // stride = unit size * (production of dimension size of D0, D1, D2) * D3.stride = 4 * 25 * 2 = 200

For the case in this patch (stride = 1), we can use the same formula for computing stride with extra dimension:

int arr[5][5][5];
#pragma omp target update to(arr[0:2][1:2][0:2]

D0: { offset = 0, count = 1, stride = 4 }                                          // offset, count, dimension size always be 0, 1, 1 for this extra dimension, stride is the unit size
D1: { offset = 0, count = 2, stride = 4 * 1 * 1 = 4 }                        // stride = unit size * (production of dimension size of D0) * D1.stride = 4 * 1 * 1 = 4
D2: { offset = 0, count = 1, stride = 4 * (1 * 5) * 1 = 20  }            // stride = unit size * (production of dimension size of D0, D1) * D2.stride = 4 * 5 * 1 = 20
D3: { offset = 0, count = 2, stride = 4 * (1 * 5 * 5) * 1 = 100 }     // stride = unit size * (production of dimension size of D0, D1, D2) * D3.stride = 4 * 25 * 1 = 100

The extra dimension does not affect the runtime implementation at all since runtime will try to merge inner dimensions if they are contiguous. Take the above case for example (arr[0:2][1:2][0:2]):
The product of count and stride for D0 is 4 which is the same as the stride of D1, therefore, runtime just ignores D0.

You can do this patch. But at first, you need to commit the runtime part of the patch that supports it, and the part that introduces stride support.

cchen abandoned this revision.Jul 20 2020, 12:43 PM

Created a new patch with the support for stride: https://reviews.llvm.org/D84192.