Replies: 8 comments 16 replies
-
In my view, I don't like PE (processing element) for warps or blocks, as I feel it should refer to the smallest hardware unit that can execute a work-item. So SIMD lanes, CUDA cores etc. In DPC++, Nit: SIMD lane should be analogous to a CUDA core rather than a CUDA thread IMO. But yes, I agree it's reasonable to call the unit of work done in one SIMD lane as one work item. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Okay, I guess then we have come to some kind of a consensus and I have updated to the following now:
Let me know if this is reasonable, or if you have strong opinions on something above. |
Beta Was this translation helpful? Give feedback.
-
Another thing we should probably discuss is the types that these properties are returning. Most CUDA/HIP properties are returned as Should I change all DPCPP properties to return |
Beta Was this translation helpful? Give feedback.
-
Sorry for jumping into this discussion very late.
|
Beta Was this translation helpful? Give feedback.
-
The latest naming looks like this:
|
Beta Was this translation helpful? Give feedback.
-
For the last row in the above table, ie., I see that |
Beta Was this translation helpful? Give feedback.
-
So I was just going over how the I think I made a mistake in my assumptions about Nvidia while recommending the term. Let's assume Pascal and above. It would seem that for FP32 and lower, more than one warp can actually execute concurrently on one SM. In case of FP32, say, exactly two warps would execute concurrently, given enough registers etc. Is that true? If that is the case, the term |
Beta Was this translation helpful? Give feedback.
-
Context: In the HWLOC PR #554, an
exec_info
struct is being created to abstract the device properties of different concrete executors. This discussion is to finalize on the naming of some of the member variables of this struct.First the standard naming and their equivalent as I understand it:
Now our member variables:
CurrentOld)num_cores
num_multiprocessors
(num_sms)num_compute_units
num_compute_units
num_threads_per_core
num_warps_per_sm
max_group_size
num_pe_per_cu
warp_size
subgroup_size
subgroup_size
max_threads_per_block
max_work_items
max_work_item_sizes
The concrete executors will have getters which return their own specific named variables. For example, CUDA/HIP executor will have a getter for
warp_size
, that internally just callssubgroup_size
.Current inconsistencies:
num_warps_per_sm
isnum_pe_per_cu
, butpe
does not exactly correspond to awarp
while onesm
is acu
.subgroup
is a warp, and accordinglysubgroup_size
iswarp_size
. This again conflicts with the point above.Now the points of discussion:
I guess everyone agrees with:
Open questions:
num_warps_per_sm
andmax_work_group_size
?ping @ginkgo-project/reviewers
Beta Was this translation helpful? Give feedback.
All reactions