Learn Half 1
Learn Half 2
Run the take a look at couple of occasions , we get the identical execution time throughout many iterations which means one thing is improper or we’re getting cached responses.
And that’s the timer perform , let’s have a look at if it really works , it does :
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(1000)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1) Milliseconds (15)
Okay now , what are we in search of ?
No matter is finished in parallel will likely be shut in time proper ? so we’re in search of that further delay that’s not tiny in comparison with the general execution
So let’s begin 2048 objects and we’ll improve iterations in order that we will discover occasions let’s strive 1 million once more .
Truly lets ship 1 merchandise and 1million iterations .
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(1000000)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1) Milliseconds (16)
Superior ,1 objects 100million iterations
2023.05.03 02:51:17.223 blog_,benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(100000000)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1) Milliseconds (16)
Okay modified kernel once more to this :
string kernel=”__kernel void memtests(__global int* group_id,”
“int iterations){“
“double sum=(double)0.0;”
“double inc=(double)-2.6;”
“double step=(double)0.01;”
“double radi=(double)0.0;”
“double rstep=(double)0.00001;”
“for(int i=0;i<iterations;i++){“
“sum=((double)tanh(inc))*((double)sin(radi));”
“inc+=step;”
“radi+=rstep;”
“if(inc>2.6&&step>0.0){step=(double)-0.01;}”
“if(inc<-2.6&&step<0.0){step=(double)0.01;}”
“if(radi>6.28&&rstep>0.0){step=(double)-0.00001;}”
“if(radi<0.0&&rstep<0.0){step=(double)0.00001;}”
“}”
“group_id[get_global_id(0)]=get_group_id(0);}”;
Dropped to 10000 iterations , typically do not overdo it as a result of you might be getting cache responses and the second you modify the calculation to illustrate you may hear your gpu complain.
So 10000 iterations :
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(10000)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1) Milliseconds (31)
lets improve to 100K
i will get 31 ms once more i am positive however this time i will take a look at it after a restart too .
So 1 merchandise 100K iterations no restart of the laptop.
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(100000)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1) Milliseconds (47)
Okay and 1 merchandise 100K iterations restart of the laptop.
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(100000)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1) Milliseconds (47)
identical time yay okay , we’re good with this kernel FILALLY aand look who determined to point out up :
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1) Kernel non-public mem (40)
the non-public reminiscence measurement stopped being zero for the primary time . Okay so thats 40 what ? 40 bytes per merchandise ?
Lets measure the kernel , we depend non globals and non locals and non arguments so :
yeah 5 doubles 5 by 8 its 40 bytes . Okay in order that works too . Superior.
Non-public reminiscence is tremendous quick you do not fear for it .How we measure the restrict that is a later query.
Alright so , can it take 1million iterations now ?
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1) Work Objects (1) Iterations(1000000)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1) Work Teams (1)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1) Milliseconds (203)
Okay however is that this quantity one thing that may enable us to identify the “fold” ?
Let’s have a look at , lets begin rising the objects , i wont paste the logs i will simply checklist them
objects# Teams# ms 1 1 203 100 1 203 256 1 188 512 2 297 1024 4 578 2048 8 1125 4096 1 2235
Hmm what went down there ? we went from 8 teams to 1 group and the display flashed
So 1024 is the max objects per compute unit then and 256 is the max group dimension that signifies 4 1024 concurrent processes and 4 teams that may be working in the identical reminiscence ? (for 1D assuming we don’t do any splitting ourselves)
We should keep in mind that we’re letting it resolve on the splits on this benchmark and it selected 1 group , or it didn’t do something in parallel right here and the file we output confirms this
GROUP.ID[4095]=0
the final entry was 0 . So it calculated nothing and glitched in all probability.
Sooo trying on the figures above i do not assume any concurrency is there , if 4 teams executed in parallel or 2 teams or 8 teams we would see the identical time kind of with 1 group proper ? So what’s operating in parallel is the 256 objects possibly.
Hmm , so now let’s introduce one other parameter into our take a look at and name it benchmark 3 . The native dimensions parameter within the execution name
bool CLExecute(
int kernel,
uint work_dim,
const uint& global_work_offset[],
const uint& global_work_size[],
const uint& local_work_size[]
);
what if i set this to 32 ? we will get 8 sub teams .
These 8 subgroups will execute on the identical time but when i we deploy 10 sub teams (i.e 2 teams) we are going to get a better execution time ?
I believe i need not take a look at this however let’s verify .
It is my understanding i’ve a tool that may calculate 256 objects on the identical time and because of this the integer
CL_MAX_WORK_GROUP_SIZE returns this quantity too the group dimension could be 256 as a result of 256 that run on the identical time can share the native reminiscence .
May or not it’s that straightforward ?
Let’s take a look at how the teams cut up if i set the parameter local_work_size to 32 (matching the warp/wavefront) for various # of things
uint work_items_per_group[]={32};
lengthy msStarted=GetTickCount();
CLExecute(ker,1,offsets,works,work_items_per_group);
I do not assume i want to vary anything .
Let’s have a look at the variety of teams it creates
(sidenote the 4096 did certainly crash as i received a context error on the subsequent run needing a restart)
objects teams 1 none , error okay ofcourse 32 1 (203ms) 64 2 (203ms) 128 4 (203ms) 256 8 (203ms) 512 16 (297ms) 1024 32 (578ms) 2048 64 (1125ms) 4096 1 (2234ms)
Okay so , clearly the
lengthy CLGetInfoInteger(ctx,CL_DEVICE_MAX_WORK_GROUP_SIZE);
which returns 1024 in my machine , refers back to the most parallel objects that may run on the identical time .
Meaning in the event you set the native dimension to 1024 it wont run 1024 objects in parallel it’s going to cut up it to packs of 1024 and it’ll seemingly fail and decelerate however in the event you set it to 1 it’s going to deal with every work merchandise as its personal group , proper ?
So for this take a look at which doesn’t want teams , if i set the native dimension to 1 do i get the quickest execution if the objects are lifeless on 1024 ?
Nope , i get a glitch once more . What if the overall objects is 256 ?
nope , glitch … ermm multiples of two maybe ?
Nnnope , glitch …so what on earth is happening once more ?
What if i’ve 8 objects and 1 native , so 1 merchandise per group 8 teams basically
No glitch , so there’s a restrict within the variety of teams however it has nothing to do with this :
CL_DEVICE_MAX_WORK_GROUP_SIZE
And the best way to find the group restrict is by in search of the drop off from the incrementing # of teams to 1 adopted by the display glitching ???
However it additionally tolerated 2048 work objects so , i believe we’re seeing the non – communication between the OpenCL and the {hardware} .
Is it 8 per dimension although , so 8^3 ? 512 ? or the cubic root of 1024 (the max objects ? ) hmmm
thats ~10 so 10 teams max ? (per dimension i assume)
How we would know ? 320 objects with 32 native would work in all probability
It did okay however 64 labored too 2048 objects with 32 native … anyway .. im demolished i will choose it up an different time.
I am attaching the benchmarks…
Additionally there’s something to take from all these exams ,and its that the worth
CLGetInfoInteger(ker,CL_KERNEL_WORK_GROUP_SIZE);
Is telling you what number of kernel situations can run in parallel on this machine (or per compute unit however that may be examined with the cpu)