Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I can't agree with this at all. It is entirely practical to configure bare metal or virtualized systems yourself. There are so many wonderful tools to do whatever you want. It will cost less and is usually more performant, especially when you avoid the biggest cloud players.


view as:

It's a golden era for tooling, if we could do it before it should be even easier now. I think we are in a risk avoidance rut. "No one ever got fired for picking AWS".

Does that include your salary? How about power, and cooling, and the so many very wonderful things it takes to run a data center.

It is not a binary choice between AWS/GCP/Azure and running your own data center.

At some level of scale, it makes sense to hire your own platform team and do it all in house. At the other end, it also works for turtle mode / lifestyle / small businesses.

The place for expensive cloud is when you're growing fast and need to spin up infra in a hurry. When you can't be bothered to configure software outside of your core competency.


At a previous gig, they were moving from on-prem to cloud for accounting purposes.

So has Netflix, Disney+, Intuit, Capital One not reached that scale?

Netflix has quite a bit of their own hardware at the edge [1].

Disney literally just started and are rapidly growing.

I don't know anything about Intuit or Capital One. Do they make you wear a suit and tie while working on Oracle databases? (I kid.)

[1] https://www.theverge.com/22787426/netflix-cdn-open-connect


Netflix has their CDNs at the edge. Their core infrastructure is on AWS, Disney+ technologically is an outgrowth of the BamTech acquisition which has one of the best regarded streaming technologies.

You wouldn't consider a CDN core infrastructure for a streaming service?

Considering that the hardware that Netflix is using for a CDN is physically located in the network providers network center, they can’t very well use AWS for that.

For 5G networks they can (Wavelength) and maybe do.

First off, the Cloud isn't expensive. Cloud misuse and bad architecture is expensive. It's a lack of skill (and to some degree tooling as well) that cause this to happen, expensive is not some inherent property of the cloud. Also, unless you are in some sort of business that has suddenly found itself without the need to innovate, or you are building a very commoditized thing, there is no reason to ever go in house.

Second, the reason people go cloud is not because they can't be bothered to configure things. It's because they have realized that any time spent doing undifferentiated work like that is a complete and utter waste of time and capital.

Third, the clouds power (and ultimately why it's less expensive) is not its ability to spin up infra in a hurry, the clouds power is elasticity. Properly architected cloud applications spin resources down as freely as they spin up, without intervention. You don't pay for idle and you certainly don't have to pay an army of people to maintain things that are now completely automated and realized in code or at the very least managed (without you having to think about it) for you.

Going multi-cloud, and forcing yourself to use the lowest common denominator services that are common across cloud providers is a really great way to miss out on all of those things.


In the context of this article, and I presume the parent comment, “bare metal” refers to a hosted VM (like ec2 instances) not running your own data center.

Bare metal definitionally means not a VM.

Its certainly possible, but it requires a level of discipline and commitment thats usually missing at larger organisarions.

In some firms they still struggle with basics, like staff with 200k salary get a terrible slow pc/ crap monitors, etc.


There are no slow desktop computers in 2022. There is a lot of slow software running on fast computers.

Haha, you ain't seen nothing yet! An mayor accounting firm stulkl has some devs on ancient macbooks with HDDs!

> especially when you avoid the biggest cloud players.

I find it hard to find a balance between the OVH model of 'purchase a server for a year with discounts for a 12/24 year commitment' and AWS where you can purchase a single hour of bare metal compute for cheap.

I would say DigitalOcean fills this, but it's not like they've committed to bare metal, and I imagine they'll continue to look into expanding their suite of managed offerings like Apps[0] (which I think is heroku-esque auto scaling for containers) since the margins on these have to be way higher than their VPS offerings.

0: https://www.digitalocean.com/products/app-platform


Good lord I could not more strongly think the opposite. It’s a massive resource sink for anything of even modest scale, and in my experience it’s literally been 10x the overall cost, with many, many fewer 0s of uptime guaranteed.

I think the truth is somewhere in-between (hence the articles proposal of a hybrid approach).

That said it's not just the budget for hardware to take into consideration but the often ignored costs like man hours, support contracts etc.

It also ignores security. Many security departments would much prefer the risk of having a dedicated cloud provider take that on, rather than a bunch of local sysadmins who will come and go.


Just learn Ansible already, it will take you at most one day.

I'm not sure how learning Ansible (which I already do know) is going to help troubleshoot or actually maintain the deployed services.

The problem is not deployment or orchestration (though that is one of them), the problem is building a team with the necessary expertise to support what could be a gigantic variety of tools and technologies at an expert level.


> the problem is building a team with the necessary expertise to support what could be a gigantic variety of tools and technologies at an expert level.

That is just as valid for cloud solutions, though.


Isn't that true of cloud providers, too? I'll use AWS as an example because that's where I have the most experience. There's a dizzying number of services, some of which have a high degree of overlap. Someone needs to understand them to make an informed choice as to what to use. Every place I've been at using AWS has had to deal with undocumented API issues and responses that the docs say are impossible, but happen. There are performance pitfalls that require in-depth knowledge of the product to avoid, but usually you don't find out until it's too late. Adopting many of the cloud native services isn't just a technology swap (your RDBMS knowledge isn't going to help much with DynamoDB). There's a cottage industry helping people just understand their bills and how to reduce costs. And for non-trivial deployments, you often have to learn orchestration tools like CloudFormation or Terraform.

I'm at a place that uses GCP now and very little of that AWS knowledge I've accumulated is applicable here.

Cloud platforms/services have a lot going for them, but I think they've become complex monsters that don't really save people as much time as they think. If you're just pushing buttons on the EC2 console, then it's faster than provisioning an Ubuntu server with Ansible for sure. But, those toy deployments aren't indicative of the actual effort to manage a production system. At least in my (mostly start-up) experience with AWS going back to 2009.


> Isn't that true of cloud providers, too? I'll use AWS as an example because that's where I have the most experience. There's a dizzying number of services, some of which have a high degree of overlap. Someone needs to understand them to make an informed choice as to what to use. Every place I've been at using AWS has had to deal with undocumented API issues and responses that the docs say are impossible, but happen. There are performance pitfalls that require in-depth knowledge of the product to avoid, but usually you don't find out until it's too late. Adopting many of the cloud native services isn't just a technology swap (your RDBMS knowledge isn't going to help much with DynamoDB). There's a cottage industry helping people just understand their bills and how to reduce costs. And for non-trivial deployments, you often have to learn orchestration tools like CloudFormation or Terraform.

It's true to some extent but I would typically characterize it as having DIY add additional layers you have to be concerned with. For example, I've seen more problems with SANs in some weeks than I've had total using EBS since the late 2000s (repeat for dodgy network cabling, server firmware causing issues, etc.).

The other thing to consider is what the alternatives are: for example, the reason why some of your RDBMS knowledge isn't relevant to DynamoDB is because it's a NoSQL database – if you wanted a SQL database, RDS is going to be at least an order of magnitude less time than running your own, especially if you are concerned with performance or reliability. If you've determined that NoSQL is a good fit for your application, that means that the relevant comparison is how much it costs to run DynamoDB versus, say, MongoDB.

> I'm at a place that uses GCP now and very little of that AWS knowledge I've accumulated is applicable here.

My experience has been the opposite. There are some differences but an awful lot of the principles transfer well to the other major cloud platforms and a solid foundation will make it easier for you to transition between the major cloud providers more easily than on-premise.

> If you're just pushing buttons on the EC2 console, then it's faster than provisioning an Ubuntu server with Ansible for sure. But, those toy deployments aren't indicative of the actual effort to manage a production system.

It's definitely not trivial but I certainly have no desire to go back to running a VMware cluster, either. I still have memories of being on the phone with their support team explaining how the flaws in their the design of their cluster health-check system lead to a big outage.


Yeah. I remember years ago working at a shop that had hundreds of baremetal hypervisors in a bunch of colos. We had a problem where a clocks would sometimes skew massively (maybe dodgy hardware)? The thing was, ntpd would be able to correct it sometimes, but not always. It has a cut-off point where it throws its little arms up and says "I dunno!", and the skew just gets larger until someone manually fixed it. So we added a remediation step to the Sensu time check it if exceeded a threshold. Then one time we got blacklisted by the public ntpd servers we used in one region because we had a shitload of servers hitting them directly with ntpd and sensu checks. So we had to set up our own servers (and monitor them). And we still had occasional skew even though the remediation action would (eventually) force a correction, and these would cause odd failures with authentication or database replication. We eventually ditched ntpd and moved to chrony (which will continually adjust the clock regardless of the drift). But that took research, testing, puppet code, scheduled deployment, documentation etc. The whole episode was boring and stupidly time-consuming and wasn't even some cool thing that "moved the needle forward" for the company. It's just the fucking time on your servers. Now, take any number of stupid little things like this and sprinkle them over every single sprint, and see how the infrastructure/sre/devops/dogsbody team's promotion cycle works. "Why was the database upgrade delayed this time?"

I've never witnessed that kind of hardware problem precisely, but now just think about it: what would happen if the same situation happened on an AWS instance? How would you go about debugging anything and/or fixing it? It's not even sure you could diagnose the problem in the first place, let alone deploy a workaround. You'd have to send dozens of email to tech support who of course would say there is no problem because their machines have 999999999999999999% uptime and nothing could be wrong on their side, but hey they can sell you advanced support/engineering if you've got way too much money to help you find the problem in your code.

Some commenter mentioned the dark days of Oracle/Microsoft/SAP ruling over the server market. But at least these companies had the most basic decency to let you house/access your own hardware and diagnose things yourself if you had the skills. Now in the "cloud" you can just go to hell and suffer all the problems/inconsistencies, or rather your users can suffer since you as a sysadmin have zero control/monitoring over what's happening. Oh and bonus point: if users report some problems, they will be reproducible 0% of the time since there are great chances your users are connected to a different availability zone: yeah it's easy to reach more than 5 9's when "ping works from a specific location" counts as uptime for a global service.

So in the end, is it better to have a silly answer to "Why was the database upgrade delayed this time?"? Or is it better for the answer to be "i don't know, but upgrading the database cost us 37k€ in compute, 31k€ for storage and 125k€ for egress for backup of the previous db" ? I much prefer silly answers, but maybe that's because i don't have dozens of thousands of euros to be shorted of even if i wanted to :-)


> It's not even sure you could diagnose the problem in the first place, let alone deploy a workaround.

It's easy to detect: ntpdate -q will tell you the drift, your logging would tell you when ntpd gave up because skew was too large.

Correction would depend on why it was happening: you might be able to tell ntpd/chrony to adjust more frequently or to accept larger deltas, but at that point I'd also say that the best path would be pulling that instance from service and replacing it so it's not critical.

> You'd have to send dozens of email to tech support who of course would say there is no problem because their machines have 999999999999999999% uptime and nothing could be wrong on their side, but hey they can sell you advanced support/engineering if you've got way too much money to help you find the problem in your code.

Has that been your experience? Based on mine it would be more like “we confirmed the problem you showed and have contacted the service team” followed by an apologetic call from your TAM. I've opened plenty of issues over the years but have never had someone insist that something is not a problem because their systems are perfect — they'd basically have to be saying you faked the evidence in their report.


> ntpdate -q will tell you the drift, your logging (...)

Sure, that's if you got root on the machine. I was not talking about VPS hosting which we've been doing for a long time, but rather so-called "serverless" services.

> Has that been your experience?

That's been my experience with most managed hosting i've had over the years, though i've had some good experiences too. I've never dealt with Amazon but i'm assuming since you can't run diagnostics (since you have no shell) and like any other business it's likely their first level of customer is inexperienced and reads from a script, i'm guessing you're gonna have a bad time if you encounter weird/silent errors from your cloud services.

Some hardware errors are hard-enough to diagnose with root and a helpful customer support, i can't imagine without those.


While it's true that you can't just shell into a managed host, you can run diagnostics in many cases[1]. I would also say that it's _far_ less common for those services to have hardware issues — part of what they're doing is optimizing to things like health-checks and rebuilds easy since the places with durable state are very clearly identified and isolated.

I've never seen a hardware problem which the platform didn't detect a fault first in those environments (e.g. you see some requests have latency spike for a minute prior to a new instance coming online — it doesn't say “underlying hardware failure” but clearly something changed on the host or perhaps a neighbor got very noisy) and that includes things like the various Intel exploits where you could see everything rotate a week before the public advisory went out. I will say that I've had a few poor run-arounds with first-level support but I've never had them refuse to escalate or switch to a different tech if you say that first response wasn't satisfactory.

1. e.g. on AWS “Fargate” refers to the managed Firecracker VMs, as opposed to the traditional ECS/EKS versions using servers which you can access, but you can use https://docs.aws.amazon.com/AmazonECS/latest/developerguide/... to run a command inside the container environment.


I've seen lambdas run into that aplenty on AWS. The clock has skew between the lambda and S3, resulting in measurable signature mismatch errors. There's even settings on the S3 client in JS to sync times to account for the skew

That's where hybrid comes in so, you don't make yourself completely dependent on big cloud. Of course it's cheaper, because it's mass production.

With hybrid you can use any cloud you want, but when it starts doing things you don't agree on, you can revert back to your own systems.

Use the force, but have alternatives ready.


Have worked on a few SaaS apps that are built on just a couple of Linode/DigitalOcean VMs and bare metal DBs. Costs on AWS would have been 3-4x more. It isn't difficult writing Chef scripts or building a postgresql cluster with auto failover.

Unfortunately now finding engineers who can work outside of the big 3 cloud providers are rare. The lock-in is actually industry wide, not just from a resourcing point but also if you want to be taken seriously from an investor perspective. And of course with VC money you can just throw a chunk of it on AWS, but if you are bootstrapped it’s a different story

Having said that, eventually there will be a need to move to cloud once you are about to hit a certain scale or have more compliance and security requirements


> And of course with VC money you can just throw a chunk of it on AWS

Don't think VCs are much enthused about AWS either: https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap...


That explains a lot of blockchain enthusiasm. Get somebody else to worry about paying for the cloud costs

Enterprises are also doing this to themselves with the usual trend to hire only in X, so if they want people skilled in cloud native with 5 years experience in production, that is what they will get.

What about training people to get better?

Only if they can get away with putting that as project expenses, in most cases.

Fortunately that are still some that do care about training, without going through such actions.

But they are the minority either way, when training is considered a perk on job offers.


I'd expect anyone that can work well inside of the big 3 cloud providers to be able to work outside of them too. The performance drop is that the developer experience is poorer when you don't use them. There is more rote work to be done to accomplish an equivalent end

Legal | privacy