Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Cost is to do some task.


view as:

So how does cost compare to run a task 24/7 for a year if I buy Nvidia GPUs vs rent TPUs?

Would expect Nvidia GPUs run 24/7 would be far cheaper. But that is an unusual use case.

Even more so with ML training. It is much more bumpy.

But you have a lot of other issues. Buying is going to hurt cash flow versus renting is a big one for small companies or startups not well funded.

Then also responsible for updating and stuck with old silicon. So for example we just got the TPU 3.0. So the cost will decrease.

But your cost of running what you buy is static. You are not be rewarded by improvements and they are happening quickly with the TPUs for example.

Just one year and we have the 3.0.


Let's see, last year I bought a 4x 1080Ti workstation for ~$6k. I agree that I haven't run it 24/7 all year long, the typical usage pattern has been training a model for 2-8 days, then it's 0-3 days idle time. The longest it ever sat idle was a week. Let's say 50% utilization rate for the last 12 months. That would be 180x24 = 4,300 hours. In a small startup, typically such workstations would be shared by multiple people, so the utilization would be higher. When I interned in a small startup people always waited to launch their experiments, so the utilization was actually close to 24/7.

So, how much would 4k hours of 4x 1080Ti compute time cost me to rent? Or, to put another way, how many hours do I need to use to justify buy vs rent? Or, how many hours of 4x 1080Ti compute time can I rent for $6k?


The thing is ML for training is much more bumpy. You need further capacity at certain times.

But the other issue is you are stuck with the hardware. Google just put out the TPU 3.0 and you can use them without losing all your investment if you bought your own hardware.

The other is buying the hardware is much harder from a cash flow standpoint.


So you completely ignored my questions, and simply repeated your previous comment word for word?

You provided some generic considerations, which don't apply to my situation, and I'm a fairly typical ML researcher. In fact, I know several startups doing DL research and they all have been buying hardware. Can you give me some examples where doing DL training in the cloud makes financial sense?

Also, you seem to imply that TPU is some kind of a miracle chip, even though we know very little about TPU2 and we know nothing about TPU3. It's actually pretty embarrassing that a general purpose V100 GPU is competitive with TPU2, which is an ASIC made for DL. If Nvidia ever decides to make a pure DL chip it will destroy anything Google can design. I mean, you can't be serious comparing Google to Nvidia when it comes to designing chips, right? That would be like believing that Nvidia could make a competitive search engine.


I do not know your personal use case. But if buying something makes sense then I would do it.

But the future is the cloud. I have fought people on this for well over a decade and thought everyone by now got it. Apparently not.

On TPU being a miracle chip not clear to me what a "miracle" chip would even be? Plus things keep improving and how could you have a static "miracle" chip?

What we do know is Google is doing WaveNet in production in real-time. We can hear the results. They are offering at a competitive price to traditional TTS.

We can also see the TPUs are about 1/2 the cost of using Nvidia.

Otherwise that is it. But I do not know your faith but personally I would not consider that a "miracle". But maybe just me.

Maybe Lewandowski would as I have heard he prays to a AI god or something like that.

The big difference is Google is working top down and Nvidia has to work bottom up.

So Google wants to roll out the best TTS there is to the world and at scale.

So they are going to from the top down to be able to achieve.

Clearly Google has a goal of creating the singularity and the Silicon is just a piece of the equation.

Versus I am not really sure what Nvidia goal is. Silicon is not for silicon sake. But rather a means to an end.

Plus Google having in production has the data to optimize in future versions where Nvidia just is not going to be able to.

I am long Nvidia and owned it for a bit. Was disappointed on how the stock traded after ER. Do think it will do fine as Google does all these incredible AI things. It is an alternative that will be good for them.

But for the very long term I would not hold. Google is a hold forever. Nvidia would keep on it.

BTW, would say Nvidia is now two generations behind.


Clearly Google has a goal of creating the singularity

Oh wow. Didn't realize you're one of those. I used to be a Kurzweil fan too, when I was 18. That phase usually passes by senior year in college. Well, I guess some people need longer to see reality. I recommend cutting back on pop tech news consumption.

Take it easy, and have a nice day!


One of those? Sorry do not know what that means?

BTW, I am old as in 50s. Have to explain to me in a little more detail what you mean by your post?

I am very curious but just not following?


No. You don't pay google to run a task. You pay google to run a TPU instance for some amount of time. The time you spend leaving it on, setting it up, tearing it down, etc, are all still time you pay for. When they have TPU lambda jobs, that's different, but they don't. From their page:

Virtual machine pricing In order to connect to a TPU, you must provision a virtual machine (VM), which is billed separately. For details on pricing for VM instances, see Compute Engine pricing.


You are a bit much. You have some task to complete. You can chose to use AWS with Nvidia chips or can use Google with their TPUs.

How much it cost you to complete that task is what matters. How it is done is here or there as long as get the precision.

We can see right now Google with their TPUs is about 1/2 the cost of using Nvidia with AWS.


No, you can't. Under the best case circumstance where you have a V100 and TPU both using tensorflow for something that's optimized for both, the TPU is about 37% better:

https://blog.riseml.com/comparing-google-tpuv2-against-nvidi...

The 50% is a number you made up, and isn't based in any real benchmarks. For all other tasks that are not tensorflow tasks, the V100 is the only one that works.


We can see the cost and they are signficantly less for TPU 2.0. But now Google has the 3.0.

But if we look at WaveNet we can see Google must be taking much larger margins with the TPU 2.0 versus Amazon using Nvidia.

Rolling out using a NN at 16k cycles a second and offering at a competitive price to the old way means Google TPUs have to be way more efficient than using anything from Nvidia.

It is hard to believe Google pulled it off. But if we look at WaveNet it suggests that TTS is a solved problem.

It will be how it is done for a very long time and just the NN will improve.

Nvidia honestly needs to get on their game. Google is running a 1000 mph. Iterating to the TPU 3.0 in just a year was a big surprise.

I suspect Capsule networks and dynamic routing which were invented by Google drove the TPU 3.0 but do not know.

Hope Google will share a paper now on the TPU 2.0 and their secrets.

Lucky for Nvidia they will share and Nvidia can copy. But just keeps Nvidia behind.


Oh man... You think Google is performing some kind of secret that Nvidia has no idea how to do? I really don't understand your fanboyism. The only reason the TPU can perform more operations per watt is because it is a severely limited processor. There is nothing special about that. They chose to dedicate more silicon to a smaller number of features, while Nvidia made the processor more general. If anything, Google uses the generic tsmc Fab, well Nvidia has their own subprocess at tsmc. If Nvidia really, really wanted to make a chip that was dedicated just for deep learning and nothing else, they could. But that's only useful for Google.

The big difference is Google works top down. So Google comes up with WaveNet as an example. That then causes a need to optimize the entire stack to be possible to offer at scale.

So WaveNet part of the Google Assistant and things like Duplex. But Nvidia silicon is just not going to make it possible. So Google does the TPUs as it just would not be possible to do at a reasonable price with Nvidia.

They are doing 16k cycles through a NN in real-time and competing against a far less compute intensive technique.

Now do not get me wrong I am long Nvidia and been for a while. Bit disapointed on the hit after an incredible earnings report.

I think they will do well from a investor marketing standpoint as really the only alternative to Google hardware.

But they have a fundamental disadvantage. They just do not have the applications like Google. They do NOT have the data to iterate like Google. It is why they appears to be 2 generations behind Google.

What is the goal of Nvidia? It is hardware for hardware sake? IMO, it should be driven top down and just do not see that happening at Nvidia. Is their goal the singularity?

Versus Google clearly wants to create the singularity and the silicon is in support. That is a very different calculus compared to Nvidia.

We can see it so strongly this week. The Google Keynote was all about AI applications. Then the TPUs are too support. Take a look at the duplex video for example. That is the focus and the silicon is what makes it possible.

But the more success for Google and presentations like Duplex this week and the buzz across the Internet helps Nvidia if the goal is investing. But from an actual technical solution it just points out the problem for Nvidia that much stronger.


Legal | privacy