mast.hpc.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon instance to support and encourage communication among the High Performance Computing community and related fields. (Personal use permitted.) More community features at https://hpc.social

Administered by:

Server stats:

118
active users

There is a supppsed vendor-independent (but not cross-vendor) way of P2P communication in : have all GPUs in the same context & pass buffers to other devices' kernels.💡
Drivers should automatically handle P2P comm via PCIe/SLI/NVLink/CrossFire/InfFabric.
🧵1/9

On GPUs (2x 2080 Ti, 4x SXM, 8x ), performance is half compared to when buffers are explicitely copied via PCIe+CPU. Extra overhead probably due to VRAM re-allocation (unclear in migration spec). No P2P, no RDMA, no SLI/NVLink. ❌
🧵2/9

When running with the backend of + P2P cudaMemcpy, performance is 40% faster compared to PCIe copy over CPU memory. PoCL's P2P backend is >3x faster than Nvidias own runtime here. This is the perf delta are giving up on.
🧵3/9

On 8x ​s, performance with the P2P mode is unchanged compared to explicit copy over PCIe/CPU. No P2P/RDMA through PCIe/InfFabric either. ❌
Explicit P2P copy with the clEnqueueCopyBufferP2PAMD extension is broken and produces a segfault. ❌
🧵4/9

With Graphics HD 4600 + i7-4720HQ, P2P mode actually is faster, 16% in that particular benchmark. The compiler is smart enough to not copy data in unified CPU/iGPU RAM.✅
On 2x A770 there is no speedup, so no P2P/RDMA either.❌
Have no access to test PVC.❔
🧵5/9

For now, the way I initially did multi-GPU communication over PCIe to CPU memory & back is still the best (&only) option. The major vendors have yet to implement/fix P2P communication in their compilers to claim that performance advantage on their platform.⚠
🧵6/9

The clEnqueueMigrateMemObjects mechanism in the spec is too vague and should be improved. It is unclear on whether memory on the target device is extra allocated/freed. ⚠
Better would be explicit data copy between existing buffers on source & target devices.
🧵7/9

Dr. Moritz Lehmann

Credit and many thanks to Jan Solanti from Tampere University for visiting me at University of Bayreuth and testing this together with me, in his endeavour to implement/optimize -Remote.
Thanks to @ShmarvDogg for testing P2P mode on his 2x A770 16GB "bigboi" PC!
🧵8/9

🧵10/9
On PVC (4x GPU Max 1100), P2P transfer does not work either. Both the implicit and explicit buffer migration variants cut performance in half compared to when sending buffers over the CPU. ❌

@ProjectPhysX the NEC VectorEngine can do AFAIK host<--> VE p2p, VE<-->VE p2p, VE<-->Melanox p2p and i recall reading something about copying to NVIDIA GPUs.
Not sure all those are available in OpenCL though or how good the opencl support is.

@freemin7 @ProjectPhysX AFAIK there's no OpenCL support for VEC, "only" SYCL

@ProjectPhysX Sounds like they need a clever OpenCL graphics engineer to patch that ;)

Specifically on the PVC front, that means having 1100s support GPU direct RDMA over PCIe, but it also means enabling the use of the XE links