I'm glad they're open-sourcing this, but I have to say that making 8 GPUs work together is not that big of a deal. Companies like Cirrascale are making up to 16 GPUs scale linearly with a blade.
We have distributed implementations that allow us to use more than 8 GPUs to train a single network. These machines have 8 cards each, but we aren't limited to a single machine (though having more GPUs connected via PCI than via Ethernet helps)
Any particular reason you don't have Infiniband on these for interconnects?
Having this chassis with Infiniband and a local disk would be a dream if the manufacturing cost is right as we scale up in our local datacenter where I'm at.
That's interesting. But from what I understand, the problem isn't getting many GPUs to train a single network -- the problem is to get them to scale linearly; i.e. to get each additional GPU to accelerate the process as much as the last. Throwing boxes at neural net training is easy, but people run into serious plateaus scaling GPUs.
It's irrelevant if you can't balance the cost, CPU, shared DRAM, and - possibly most importantly - PCIe and network connectivity in and out of it. A typical mid-range Intel Xeon has 40 PCIe lanes. A high-end GPU requires 16. You can do the math. Facebook has decided upon a balance that works for them and their workloads; it's quite likely that going to 16 GPUs in the chassis resulted in overall worse utilization because of PCIe bandwidth limits, or socket and QPI count, etc. What's a big deal is someone having gone and done a lot of work to calculate and experiment with finding a balanced design that's also cost-effective and easy to maintain.
Actually the PCI express topology is configurable, which is one of the innovations. You can put all 8 gpus on a single CPU bus or have 4 on each CPU (but have to use QPI between them)
reply