_kernel void sumGPU(_global float * a,_global float * b) I'll test something like an nbody on my R7-240 which is 1/24 or 1/26 th power of fp32 as fp64 tomorrow.Įdit: its working. In case of CUDA, how is this achieved? Do I just use doubles andįloats at the same time in my kernel? Or do I need to pass some kindĪn example to iterative refinement using fp64+fp32 in same function:įor the opencl part, here is amd evergreen(hd5000 series) capable of issuing 1dp fma + 1 sp(or 1 sf) every cycle. People even use integers (integer dot product) on top of floats here : Says "outstanding performance and accuracy" but I couldn't find a physics solver for games using FP32 + FP32(truncated FP64), maybe its money talks again, if someone makes this, it would be "outstanding performance and meltdown" on gaming.(maybe worse than furmark exploding gpus) You have a titan? You should try, maybe its having %50 more power than its advertisement GFLOPs value.(but advertisement TDP value could be limiting its frequency that way, and melts down) My GPU has 1/24 FP64 power so I don't trust my computer. FP64 is very much memory consuming(and cache lines(and local memory)) thats why nbody alorithm I suggested which re-uses some data for N(>64k for example) times. It seems it is more useful as speeding "FP64" up instead of "FP32" down(but having many FP64 cores should be beneficious(for upping FP32), you could test them with something like a nbody kernel (which is not memory bottlenecked)). When they are not mixed, then they need "extra iterations" between blocks which also not letting %100 scaling. Long story short, they don't add, there is "diminishing returns" that somehow not letting %100 scaling on all cores because of needed "extra cycles" between different precision calculations. Mixed precision computing can be done in cuda and opencl so you can get even faster using all cores but only applicable for non-memory-bottlenecked situations which is rare and hard to code. Why I can't use all FP32 and FP64 units at the same time? More transistors = more production failure probability so a 1024 FP32 GPU could be more probably produced than a 512 FP64_flexible GPU. ALso making 2xFP32 out of a FP64 would need more transistors than pure FP64, more heat, more latency maybe. SIMD expects always same operation for multiple data and less fun for scalar GPGPU kernels. Per instruction (like the SIMD instruction sets in CPUs). Why not just put FP64 units that are capable of performing 2xFP32 operations But now, money talks and it says "fixed function for now" and best income is achieved with a mixture of FP64 and FP32 (and FP16 lately). For example, I may need more multiplications than additions and FPGA could help here. This would benefit for people doing many different things on a computer. Ultimately, I wouldn't say no for an FPGA based GPU that can convert some of cores from FP64 to FP32 or some special function cores for an application, then converting all to FP64 for another application and even converting everything to a single fat core that is doing sequential work(such as compiling shaders). Who knows, maybe in future there will be FP_raytrace dedicated cores that do raytracing ultra fast so no more DX12 DX11 DX9 painful upgradings and better graphics. If there were only FP32, a neural network simulation would work at half speed or some FP64 summation wouldn't work. Without FP32, game physics and simulations would be very slow or GPU would need a nuclear reactor. Without FP64, scientific research guys can't even try a demo of scientifically important gpgpu software that uses FP64(and even games could be using some double precision on an occasion). I think its about market penetration, to sell as many as possible. Why did Nvidia put both FP32 and FP64 units in the chip?
0 Comments
|