Naming for device property abstractions in the Executor base class #694

pratikvn · 2021-01-25T18:11:44Z

pratikvn
Jan 25, 2021
Maintainer

Context: In the HWLOC PR #554, an exec_info struct is being created to abstract the device properties of different concrete executors. This discussion is to finalize on the naming of some of the member variables of this struct.

First the standard naming and their equivalent as I understand it:

CPU	CUDA/HIP	DPCPP	Our Abstraction
Core	Streaming Multiprocessor (SM)	compute_unit	compute_unit
thread	warps	sub_group?	subgroup
-	block	work_group	(processing_unit/processing_element/SMT ?)
SIMD lane ?	thread	work_item	work_item

Now our member variables:

CPU	CUDA/HIP	DPCPP	Our Abstraction (~~Current~~ Old)
`num_cores`	`num_multiprocessors` (num_sms)	`num_compute_units`	`num_compute_units`
`num_threads_per_core`	`num_warps_per_sm`	`max_group_size`	`num_pe_per_cu`
-	`warp_size`	`subgroup_size`	`subgroup_size`
-	`max_threads_per_block`	`max_work_items`	`max_work_item_sizes`

The concrete executors will have getters which return their own specific named variables. For example, CUDA/HIP executor will have a getter for warp_size, that internally just calls subgroup_size.

Current inconsistencies:

num_warps_per_sm is num_pe_per_cu, but pe does not exactly correspond to a warp while one sm is a cu.
subgroup is a warp, and accordingly subgroup_size is warp_size. This again conflicts with the point above.

Now the points of discussion:

I guess everyone agrees with:

that compute_units --> cores is reasonable.
Also that work_items --> threads is reasonable and hence max_work_items --> max_threads_per_block.

Open questions:

Is pe/pu suitable for warps or blocks ?
What is equivalent to num_warps_per_sm and max_work_group_size ?
What is a warp in DPCPP ?

ping @ginkgo-project/reviewers

Slaedr · 2021-01-25T19:26:08Z

Slaedr
Jan 25, 2021

In my view, num_warps_per_sm == num_hyperthreads_per_cpucore == num_smt_per_cu (in the order Cuda, CPU, abstraction).
hwloc's PU is usually one hyperthread context that resides in a CPU core. So it could make sense to also let PU refer to one warp that resides on a SM (even though there is a small difference). So, if you prefer, num_smt_per_cu can be replaced by num_pu_per_cu, which I think is what @tcojean suggested. I like num_smt_per_cu as it makes the meaning clear. 'Processing Unit' is somewhat vague.

I don't like PE (processing element) for warps or blocks, as I feel it should refer to the smallest hardware unit that can execute a work-item. So SIMD lanes, CUDA cores etc.

In DPC++, sycl::info::device::max_work_group_size is the max number of work-items per work-group, not the number of subgroups per work group. (https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf#subsubsection.4.6.4.2) So max_group_size does not really fit next to num_warps_per_sm.
Perhaps the warp_size is equivalent to sycl::info::device::native_vector_width_(pod_type). We could add a templated function that converts native_vector_width_(pod_type) to native_vector_width<pod_type>. But still, this complicates things in comparison to CUDA where the warp size is always the same. Perhaps we could just leave warp size undefined for DPC++, at least for now. Sometime in the future, we could template warp_size on the numeric type for all backends.
subgroup_size in SyCL, I guess, is more akin to a possible size of cooperative groups rather than the warp size.

Nit: SIMD lane should be analogous to a CUDA core rather than a CUDA thread IMO. But yes, I agree it's reasonable to call the unit of work done in one SIMD lane as one work item.

0 replies

tcojean · 2021-01-25T20:08:46Z

tcojean
Jan 25, 2021
Maintainer

I prefer processing unit for the full SIMD vector or a full warp, leaving PE for CUDA cores. In this fashion, subgroup size could as well be pu_width, num_pe_per_pu, or anything of that fashion,
I don't think there is currently an equivalent to num_warps_per_sm in DPC++. At least, it's not available through the API but it must exist, this is a hardware design feature and all hardware have this same concept (num SMT/SIMD per core on CPUs, similar with wavefronts on AMD). What gets close are the num_partitions thing, but it's not available on GPUs AFAIK. max_workgroup_size (badly named I think) in DPC++ is equivalent to the CUDA max_threads_per_block logical parallelism expression. max_work_item_sizes is actually maxThreadsDim in CUDA I think. Check the SYCL 2020 provisional device info table posted in Aditya's post and compare to the CUDA device properties, that is quite clear I think.
num_warps in DPC++ would be subgroup_size to me, although it's a vector in DPC++. Indeed at some point we'll need to use the function native_vector_width for different types to know which subgroup size to use, but that's not relevant here, I think.

2 replies

tcojean Jan 25, 2021
Maintainer

Also, a work-item is an expression of logical parallelism which then gets mapped into a corresponding physical unit, i.e. a CUDA core, or a SIMD lane on CPU, or a processing element (I believe) in DPC++/SYCL, again according to the provisional spec.

Slaedr Jan 26, 2021

For your point 3, do you perhaps mean that warp_size (rather than num_warps) would be subgroup_size in DPC++? Because that would make sense:

The SYCL sub-group (sycl::sub_group class) is a representation of a collection of related work-items within a work-group that execute concurrently, and which may make independent forward progress with respect to other sub-groups in the same work-group.

(from the SyCL 2020 glossary)

pratikvn · 2021-01-26T00:55:31Z

pratikvn
Jan 26, 2021
Maintainer Author

Okay, I guess then we have come to some kind of a consensus and I have updated to the following now:

CPU	CUDA/HIP	DPCPP	Our Abstraction (Current)
`num_cores`	`num_multiprocessors` (num_sms)	`num_compute_units`	`num_compute_units`
`num_threads_per_core`	`num_warps_per_sm`	--	`num_pu_per_cu`
--	--	`subgroup_sizes` (array)	`subgroup_sizes` (array)
--	`warp_size` (scalar)	`max_sub_group_size` (scalar)	`max_subgroup_size` (scalar)
--	`max_threads_per_block`	`max_work_group_size`	`max_workgroup_size`
--	`max_threads_per_block_dimension` (x,y,z) (array)	`max_work_item_sizes` (array)	`max_workitem_sizes` (array)

Let me know if this is reasonable, or if you have strong opinions on something above.

5 replies

yhmtsai Jan 26, 2021
Collaborator

note. Hip uses wavefront as cuda's warp
warp_size should be the max_subgroup_size in DPCPP.
subgroup_size should be equal to subwarp_size in our cuda usage.

I take the number of threads in EU, the name is confusing, (at least gen9) as num_warps_per_sm usage.
https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
the section 5.3.5
As far as I understand, each eu has 7 threads and each threads contains SIMD-32. Thus, they can run 7 warp(subgroup = 32) in eu in the same time.

pratikvn Jan 26, 2021
Maintainer Author

note. Hip uses wavefront as cuda's warp

Yes that is true. But in Ginkgo we never mention wavefront. We always use warp for HIP as well. For example the functions are called get_num_warps and get_warp_size in HIP as well.

I updated the table above to show which is currently an array and which is a scalar value.

warp_size should be the max_subgroup_size in DPCPP.
subgroup_size should be equal to subwarp_size in our cuda usage.

That is true. I will update to add another variable called max_subgroup_size and set that to warp_size for CUDA/HIP. I will set the subgroup_sizes as invalid for CUDA/HIP.

I take the number of threads in EU, the name is confusing, (at least gen9) as num_warps_per_sm usage.

Can you confirm this for other DPCPP devices ? Is there a related DPCPP variable that stores num_threads_per_eu ?

yhmtsai Jan 26, 2021
Collaborator

Yes, note it for complete table. the table size is not enough for additional column.
I do not find the spec paper of DG1. I think there is no variable in DPCPP, but I will check it.
num_warp_per_sm is not in CUDA/HIP variable of library. we use the function of CUDA example to get the number of cores per sm and convert it to num_warp_per_sm. We use 4 (AFAIK) in AMD device.

pratikvn Jan 26, 2021
Maintainer Author

Yes, you are right about num_warp_per_sm not being in CUDA/HIP either. I guess as you say, if we can somehow get the overall number of subgroups in a compute_unit with DPCPP, then we can calculate it ourselves as we do in CUDA/HIP.

Slaedr Jan 26, 2021

note. Hip uses wavefront as cuda's warp
warp_size should be the max_subgroup_size in DPCPP.
subgroup_size should be equal to subwarp_size in our cuda usage.

I take the number of threads in EU, the name is confusing, (at least gen9) as num_warps_per_sm usage.
https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
the section 5.3.5
As far as I understand, each eu has 7 threads and each threads contains SIMD-32. Thus, they can run 7 warp(subgroup = 32) in eu in the same time.

Can an Intel GPU 'thread' read the registers of other threads in an EU? That would help us decide whether a 'core' in Intel is an EU or a SIMD unit. It would be nice to have something that can be treated similar to a SM in NVidia, which we refer to as a core here.

The different warps on an SM don't execute concurrently, but I get the feeling that the different Intel-threads in an EU actually execute concurrently - since you say each has its own SIMD unit. In that case, warps per SM would not quite be the same thing as Intel-threads per EU.

pratikvn · 2021-01-26T09:41:08Z

pratikvn
Jan 26, 2021
Maintainer Author

Another thing we should probably discuss is the types that these properties are returning. Most CUDA/HIP properties are returned as int whereas for DPCPP, we are returning size_type most of the times and sometimes int (for subgroup_sizes vector for example).

Should I change all DPCPP properties to return int as well ? It would not be interface breaking to change this in DPCPP, as it has not been released yet, but it would be interface breaking to do this if we wanted to change the CUDA/HIP return types.

3 replies

Slaedr Jan 26, 2021

Good point. All the DPCPP functions seem to be returning size_t or uint32. Except for bitwise quantities, which we are not dealing with right now, I guess it should always be safe to cast them to int?

pratikvn Jan 26, 2021
Maintainer Author

Yes. I think so. Maybe @tcojean and @yhmtsai can also comment on if they prefer size_type or int is okay ?

yhmtsai Jan 26, 2021
Collaborator

I think int should be enough for these properties.
the maximum of max work group size is ~7e7 and work_item_size is ~ 7e7 x 7e7 x 7e7 from some devices

hartwiganzt · 2021-01-26T12:47:52Z

hartwiganzt
Jan 26, 2021
Maintainer

Sorry for jumping into this discussion very late.

I am not sure, but maybe there is a better variable than num_pu_per_cu?
I understand subgroup_sizes is an array containing integers all smaller that max_subgroup_size. So for example, a set of subwarp sizes in CUDA. But this is an implementation detail, not hardware-given, or? It is more similar to thread block size, or?
AFAIK Kokkos uses team size for what we call subgroup size(s), correct? https://docs.trilinos.org/dev/packages/kokkos/doc/html/classKokkos_1_1TeamPolicy.html
Maybe we should contact HW-close groups whether these naming conventions and mappings are reasonable.

4 replies

yhmtsai Jan 26, 2021
Collaborator

Yes for 2.
Cuda subgroups will be 1, 2, 4, 8, 16, 32.
Gen9 contains 8, 16, 32 SIMD. In logically, if we always has 32 to execute same warp operation, we can also use the operation in 1, 2, 4 size (they are handled in SIMD 8, 16, or 32).
in hardware, they do not really support SIMD 2, 4

pratikvn Jan 26, 2021
Maintainer Author

I am not sure, but maybe there is a better variable than num_pu_per_cu?

Do you also think something like num_smt_per_cu would be better ?

I understand subgroup_sizes is an array containing integers all smaller that max_subgroup_size. So for example, a set of subwarp sizes in CUDA. But this is an implementation detail, not hardware-given, or? It is more similar to thread block size, or?

Yes, essentially, but not all integers from [1,max_subgroup_size] but I think only a subset, powers of two or something like that, I guess. But I think it is maybe hardware defined ? From the doc here: Table 4.19 under info::device::sub_group_sizes , they say it is for a device. To quote them:

Returns a std::vector of size_t containing the set of sub-group sizes supported by the device

AFAIK Kokkos uses team size for what we call subgroup size(s), correct? https://docs.trilinos.org/dev/packages/kokkos/doc/html/classKokkos_1_1TeamPolicy.html

I think the Kokkos definition of team_size might be slightly more general, in the sense that I think they take into account implementation defined execution and scratch memory space constraints and set all the possible team_sizes. But I guess the idea is similar.

Slaedr Jan 26, 2021

According to what @upsj once mentioned, num_vectors_per_cu could be used. Each vector would be a group of work-items guaranteed to execute concurrently on the same CU (SM), and having a hardware-defined optimal size. The different vectors (of work-items) on the CU can execute concurrently or not. This has the advantage of working for threads per EU in Intel too. If we restrict it to mean that only one vector executes at a given instant (as in warps), such that the different vectors execute in SMT fashion, then num_smt_per_cu would be nice, IMO. pu came as an option because hwloc uses it:

The smallest processing element that can be represented by a hwloc object. It may be a single-core processor, a core of a multicore processor, or a single thread in a SMT processor (also sometimes called "Logical processor", not to be confused with "Logical index of a processor"). hwloc's PU acronym stands for Processing Unit.

I guess in case of DPC++ it is more like hardware-given. We can query this from the DPC++ runtime using sycl::info::device::sub_group_sizes. But I think you are right in that it is similar to a set of subwarp sizes in CUDA, though Mike would know better.
The doc says "The work functor is called for each thread of each team such that the team's member threads are guaranteed to be concurrent." So it indeed looks like a subwarp size. They have a team_size_recommended static member in the Team policy which I guess would correspond to the warp size on CUDA.

hartwiganzt Jan 26, 2021
Maintainer

Maybe num_smt_per_cu is a good naming?

pratikvn · 2021-01-26T17:01:10Z

pratikvn
Jan 26, 2021
Maintainer Author

The latest naming looks like this:

CPU	CUDA/HIP	DPCPP	Our Abstraction (Current)
`num_cores`	`num_multiprocessors` (num_sms)	`num_compute_units`	`num_compute_units`
`num_threads_per_core`	`num_warps_per_sm`	--	`num_smt_per_cu`
--	--	`subgroup_sizes` (array)	`subgroup_sizes` (array)
--	`warp_size` (scalar)	`max_sub_group_size` (scalar)	`max_subgroup_size` (scalar)
--	`max_threads_per_block`	`max_work_group_size`	`max_workgroup_size`
--	`max_threads_per_block_dimension` (x,y,z) (array)	`max_work_item_sizes` (array)	`max_workitem_sizes` (array)

0 replies

Slaedr · 2021-01-26T18:27:28Z

Slaedr
Jan 26, 2021

For the last row in the above table, ie., max_workitem_sizes (max_threads_per_block_dimension), what to people think about using array<int,3> rather than vector<int>? These will typically not be called in device code so I doubt it affects performance, but it might be overall clearer and more convenient to use array<int,3>. In case we ever do want to pass this to a device kernel too, it will be more convenient.

I see that max_work_item_dimensions in SyCL need not always be 3, but max_work_item_sizes is returned as a id<3>, so I'm thinking it should be okay to use array<int,3> for this.

0 replies

Slaedr · 2021-01-26T19:15:12Z

Slaedr
Jan 26, 2021

So I was just going over how the num_smt_per_cu is currently computed. It looks like the intention is not to get the maximum number of warps that can reside on one SM at a time (which would be large, I guess 64), but the number of warps that actually execute concurrently on one SM at any given instant. For some reason I was under the assumption that it was the former.

I think I made a mistake in my assumptions about Nvidia while recommending the term. Let's assume Pascal and above. It would seem that for FP32 and lower, more than one warp can actually execute concurrently on one SM. In case of FP32, say, exactly two warps would execute concurrently, given enough registers etc. Is that true? If that is the case, the term num_smt_per_cu really only makes sense for FP64. I now think we should go with one of the other options, num_vectors_per_cu or num_pu_per_cu or something else if anyone has a better term. In that case we could also use it for Intel for the number of SIMD units on one EU, once we find out how to query that. My sincere apologies to @pratikvn , I should have realized that sooner.

2 replies

Slaedr Jan 26, 2021

If I'm right about the meaning of num_warps_per_sm in Pratik's definition, that also means that this quantity should change depending on the numeric type. For example, on Pascal/Volta/Ampere, there are 64 FP32 cores per SM and 32 FP64 cores per SM. So for CUDA kernels with only FP32 computations, there can be at most two warps executing concurrently, while if there is a FP64 computation, at most one warp can execute at any instant. Q: What do we envision using num_warps_per_sm for?

pratikvn Jan 27, 2021
Maintainer Author

Yes, this is a good point. I guess we need to decide what level of information we need. It seems that before compute capability 7.x, the overall number of cores was the one that was relevant. After 7.x and later, NVIDIA, split it up into DP cores, SP cores and INT cores.

But it looks like the function convert_sm_ver_to_cores we use from CUDA only considers SP cores and hence the drop in number of cores as the SM version increases.

It might be useful to template this function on the required type and get the number of cores depending on the actual computation type. But I am not sure, if it is useful to do this.

Naming for device property abstractions in the Executor base class #694

pratikvn Jan 25, 2021 Maintainer

Replies: 8 comments · 16 replies

tcojean Jan 25, 2021 Maintainer

tcojean Jan 25, 2021 Maintainer

pratikvn Jan 26, 2021 Maintainer Author

yhmtsai Jan 26, 2021 Collaborator

pratikvn Jan 26, 2021 Maintainer Author

yhmtsai Jan 26, 2021 Collaborator

pratikvn Jan 26, 2021 Maintainer Author

pratikvn Jan 26, 2021 Maintainer Author

pratikvn Jan 26, 2021 Maintainer Author

yhmtsai Jan 26, 2021 Collaborator

hartwiganzt Jan 26, 2021 Maintainer

yhmtsai Jan 26, 2021 Collaborator

pratikvn Jan 26, 2021 Maintainer Author

hartwiganzt Jan 26, 2021 Maintainer

pratikvn Jan 26, 2021 Maintainer Author

pratikvn Jan 27, 2021 Maintainer Author

pratikvn
Jan 25, 2021
Maintainer

Replies: 8 comments 16 replies

tcojean
Jan 25, 2021
Maintainer

tcojean Jan 25, 2021
Maintainer

pratikvn
Jan 26, 2021
Maintainer Author

yhmtsai Jan 26, 2021
Collaborator

pratikvn Jan 26, 2021
Maintainer Author

yhmtsai Jan 26, 2021
Collaborator

pratikvn Jan 26, 2021
Maintainer Author

pratikvn
Jan 26, 2021
Maintainer Author

pratikvn Jan 26, 2021
Maintainer Author

yhmtsai Jan 26, 2021
Collaborator

hartwiganzt
Jan 26, 2021
Maintainer

yhmtsai Jan 26, 2021
Collaborator

pratikvn Jan 26, 2021
Maintainer Author

hartwiganzt Jan 26, 2021
Maintainer

pratikvn
Jan 26, 2021
Maintainer Author

pratikvn Jan 27, 2021
Maintainer Author