Compute shader: error X3013 no matching 1 parameter function

Started by
4 comments, last by fido9dido 2 years, 5 months ago

I'm trying to write a function within a compute shader (HLSL) that call other thread functions, note that all functions perform different operations on the same buffer so i want it to output the result after the last function in the main is performed

 RWStructuredBuffer <Data> dataBuffer:register(u0);
[numthreads(1, 1, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
Otherfunc();//this works obviously
Func1();//Error no matching 1 parameter function
Func2();
}
[numthreads(32, 32, 32)]
void Func1(uint3 DTid : SV_DispatchThreadID)
{
}
[numthreads(32, 32, 32)]
void Func2(uint3 DTid : SV_DispatchThreadID)
{
}
void otherfunc()
{
}

error X3013: 'Func1': no matching 1 parameter function

Each function may have different length, x,y,z I thought that since the SV_DispatchThreadID is determined based on numthreads (and groupid ofc) I can redefine the numthreads in each function and call it as shown to perform multi-threading inside each function and it will be treated as a function with default parameter, but since it's not working

How can I make this work

Advertisement

uh… sadly this can't work the way you want.

fido9dido said:
[numthreads(1, 1, 1)]

Notice this means only one thread of 32 or 64 (depending on HW) will do any work at all. The other remain idle and you can not expand this.
Currently shaders can not call other shaders at all. So subfunctions are not possible. You can only implement such things on API sides using multiple dispatches and synchronization between them, which has a cost.

fido9dido said:
[numthreads(32, 32, 32)]

This would be too much threads. It would lead to a workgroup size of 32.768 threads, which is much more than a single CU (former AMD) or SM (NVidia) has. APIs define limits here, usually it's 1024 at most.
Let's make an example about older GCN archtecture from AMD, because i know that best personally. Here one CU has 64 threads which execute a program in lockstep. If you make a larger workgroup of size 256, the CU will execute 4 times the same code sequentially. But as a programmer you can think they would run in parallel. Also, all 256 threads can access the same block of LDS memory which has been reserved to the workgroup if you use it.

What's important to know from all this is: A workgroup should have a size of at least 32 (or 64 on AMG GCN), and at most of 256. 512 and 1024 works as well, but then less workgroups will be available to switch them, which is like hyper threading on CPU. Such switches of active workgroups is important to hide memory access latency.

Probably this means you have to subdivide your work into smaller chunks os proposed in the other topic, and each chunk will process in random order, so they need to be truly independent of each other.

fido9dido said:
void otherfunc()

This works, but it's not implemented as a function call like you imagine. It's more a tool to reduce shader code size e.g. if you need similar math often. All threads will execute it in parallel, if some threads are masked out due a branch they could not do another function at the same time.
So you could just inline all your functions and the result would be the same )although in practice, inlining the code is usually faster :/ )

I usually recommend the chapter about compute shaders from OpenGL Superbible. It's very good, also because it explains building blocks of parallel programming quickly, which is more important than those (confusing) details about HW.
I don't know a good resource for DirectX, but it's basically the same on both sides. Just terminology differs. If you can, read it.

You can describe your problem so we could propose how to implement it.

I didn't mind using a single thread in the main function if it meant that I can call functions the way i wanted it too

JoeJ said:
Currently shaders can not call other shaders at all. So subfunctions are not possible. You can only implement such things on API sides using multiple dispatches and synchronization between them, which has a cost.

Uh I feared that it wouldn't work

JoeJ said:

This would be too much threads. It would lead to a workgroup size of 32.768 threads, which is much more than a single CU (former AMD) or SM (NVidia) has. APIs define limits here, usually it's 1024 at most.

when I first read it I imagined it something like [numthreads(1024 , 1024 , 1024 )], thanks for clarification

JoeJ said:

I usually recommend the chapter about compute shaders from OpenGL Superbible. It's very good, also because it explains building blocks of parallel programming quickly, which is more important than those (confusing) details about HW.

I don't know a good resource for DirectX, but it's basically the same on both sides. Just terminology differs. If you can, read it.

I will have a look at it, as long as it's explain CS, it will be the same.

You can describe your problem so we could propose how to implement it.

I am trying to implement Convex terrain Dual contour with compute shader then pass it to the CPU so I can use it for physics, The way I did it on the CPU using 32x32x32 chunks with a flat array indexing, at first I tried to substitute looping with numthreads but I guess it wont work, in my prev approach I generated a single chunk, so I calculate the density for the whole terrain, then calculate generate VB then smoothen the terrain and generate normal then generate IB

so now, it seems that I have to re-design it as you suggested in a smaller chunks so i will have to generate the terrain dynamically something similar to GPU Gems

fido9dido said:
I am trying to implement Convex terrain Dual contour with compute shader then pass it to the CPU so I can use it for physics, The way I did it on the CPU using 32x32x32 chunks with a flat array indexing, at first I tried to substitute looping with numthreads but I guess it wont work, in my prev approach I generated a single chunk, so I calculate the density for the whole terrain, then calculate generate VB then smoothen the terrain and generate normal then generate IB

That's a good example. Should work well with compute.
Assuming you have a big density volume in VRAM, chunks of 16^3 would fit into LDS easily.
So one workgroup would read an extended block of 17^3 to LDS (or 19^3 if you do one iteration if blur).
Then each thread could check one voxel if to build a surface patch with left, top, and front side neighbors.
It's a maybe a win here (as with the blur), that neighbors can be read from shared LDS not VRAM. But i'm not sure. Maybe multiple 3D texture fetches do just as well due to caching. Then you could use LDS only for the output of surface.

If your output is something like a list of triangles, that's also a typical problem because you do not know in advance how much memory a block needs for the generated surface.
This could be solved with a atomic counter so independent workgroups don't overwrite each others output. But if all threads of all workgroups fight about that same counter, it's usually a bottleneck.
To avoid this, workgroups can use atomics to LDS instead. So all threads of the workgroup use a local counter in LDS and write to a local small buffer, also in LDS.
Once this buffer is half full, you can then write this half to VRAM in one batch. To get the destination memory address, you need the global VRAM atomic only once each time such batch is ready, which is no problem.

So all this is usually about trying multiple options about batch sizes and how to use LDS or not.

Now let's invent a more complicated example, where some more complex control flow would be necessary.
Say your density is made procedurally from a big tree of SDF primitives, and we could clip it by a frustum per frame.
We want to traverse the tree, and output a list of volume block coordinates, plus a list of overlapping SDF primitives so we could process this later in per block work groups as above.
It's well possible to do all of this with compute, even within a single ‘draw call’.
The draw call would be a command list containing all potential dispatches and memory barriers we might need.
E.g. we add one dispatch for each tree level. But we don't know how many tree nodes will survive the frustum check, so we use indirect dispatches where a compute shader sets the amount of work to be done for a later dispatch, which likely is processing the next level of the tree, or finally the number of total blocks to voxelize.
In this example we have dynamic allocation implemented using global atomics, which works if we can preallocate big buffers surely large enough for all our stuff.
And we also have dynamic control flow in some limited form, again by having enough prerecorded dispatches in our command list.
It's a common problem some disptaches then end up doing nothing because no work is needed, but usually that's still much faster than orchestrating the GPU from CPU each frame individually.

Thanks, this made it a lot clearer for me

This topic is closed to new replies.

Advertisement