I didn't even read the article, but I agree with the title already. The amount of companies asking for knowledge in Hadoop for data only bigger then excel limit is not trivial... At least in the area where I work, this is the case
I hate these sorts of titles, because some of us DO work with petabyte scale datasets. I know the article does mention the types of use cases that Hadoop is good for, but the title just comes across as arrogantly dismissive.
And we cheer for you guys. We really do. You push the edge, it's the right thing to do.
Some of us though, from time to time, have these side company projects. Then ,after a round of consultants and meetings (and nobody asking us for input) we had this 500MB database, and an architect who was pushing for all the big data keyword to be on his resume, even though all math and business numbers said we'd double that in twenty years.
We are talking zookeepers and 20 servers per geo separate datacenter, spark, ml, Casandra, hadoop, MR, java everywhere, logstash, etc etc etc. Presentations in hundreds or PowerPoint slides and endless meetings about direction and future.
So the article, and MySQL, have it's places. Especially when one can finish the job in two weeks.
Of course, I am not saying that there aren't people who misuse 'big data' technology for 'small data' problems. You could even keep a provocative title like, "Are you sure you need big data?" or, "Your data must be this big to ride" or something else both cheeky and not dismissive.
It is pretentious, but it strikes me as the kind of rule that is handy because those who need to break it will KNOW they need to break it, whereas those who should follow it will be on the fence.
Unfortunately still relevant... I can't count the number of times when investors are pushing on a CEO which in turn starts to worry about scalability & ask the team if Cassandra isn't a better fit etc (when the whole data can indeed be kept in memory or some midsize USB key).
Cassandra is not limited to use cases that need to scale-out, but it also excells at stuff that needs to be always-on, regardless of data size. Sure, you can do many nice things with data in memory or on a midsize USB key, but when your centralized hardware fails, you have a problem and in the best case you restore from a backup - and during downtime your customers go to your competition.
In 2017, with Spark's Catalyst engine and DataFrames data structure (allowing SQLesque operations instead of requiring writing code in map-reduce paradigms), you can have the best of both worlds in terms of big data performance and high usability. Running Spark in a non-distributed manner may sound counterintuitive, but it works well and makes good utilization of all CPU, RAM, and HDD.
Spark is orders of magnitude faster than Hadoop, too.
Spark is not pure MR. Spark has Spark Dataframes and Datasets with SQL-like syntax. You can even write pure SQL and not mess with the Dataframe API at all.
Presto is nice, but you can't use it for an ETL job. It is great for analysis.
Spark is a fairly generic data processing engine with excellent SQL support, and its DataFrame structure is pretty similar to a pandas DataFrame. It can be really useful even on a single node with a bunch of cores. Spark's MLLib also has distributed training algorithms for most standard models (notably excluding nonlinear SVMs). It also has fantastic streaming support. So Spark is good for a lot more than straight-up map-reduce jobs these days.
I've been using PrestoDB for a few months now, and I'm deeply in love. It's such a well-designed piece of technology. The query execution engine is a tremendous boon to anyone with inconvenient-sized data. And it does most (all?) of ANSI-SQL!
I used to use Spark SQL for this purpose, but I've switched. I now use Spark for when I want to transform data. But when I'm writing ad-hoc data exploration/investigation queries, PrestoDB is my jam; it's what it's designed for. Parquet as the data storage method makes both of these quite workable.
Last time I looked at Presto, it works fine for simple queries (e.g. scanning data and aggregating it into a small result set) but the performance was prone to falling off a cliff as queries got moderately complex - it comes up with a bad query plan or query execution OOMs when data doesn't fit in memory.
Hive and other SQL-on-Hadoop systems tend to do better in that department.
SQL is familiar, but it is not simple. The vocabulary is large, and inconsistent between implementations. It is hard to predict the performance of a complex query, without resorting to rules of thumb. Understanding EXPLAIN ... PLAN requires a fairly deep comp sci background, and familiarity with a variety of data structures that are rarely used directly by programmers.
Contrast that with a system of map-filter-reduce pipelines over an append-only data set, like a classic CouchDB. A reasonable pipeline can be composed by a junior dev, just by repeatedly asking "What do I want this report to summarize? What information do I need to collect or reject, for that summary? How can I transform the shape of the information that is currently in front of me, into the input that I wanted when I planned the high-level end result?" And, if they need help with that last part, then at least they are asking for help with a small subset of the problem, instead of "Something is wrong in this forest of queries, can you take a look at it with me?" Or, "I need to add a column, may I ALTER TABLE?" They can even prototype the whole thing on an array in Javascript, if they are more comfortable there.
SQL can be a beautiful language that feels very natural, once you have had a few years to build up fluency in it. It might make for an excellent shell language. But, having spent time prototyping systems in CouchDB (which were admired for their elegance, but rejected due to the relative obscurity of Couch, grrr!), I have to say, that my previous bias for querying over transforming, was ultimately holding me back, bogging me down in leaky abstractions. We should have started with MR, and then learned SQL only when presented with something that doesn't fit the MR paradigm, or even the graph processing paradigm, which IMO is also simpler than SQL.
As for the original subject, yes, Hadoop is a pig, ideally suited to enterprisey make-work projects. All the way through the book, I kept thinking, "there has got to be a simpler way to set this up."
> Spark is orders of magnitude faster than Hadoop, too.
A comparison between Spark & Hadoop doesn't make much sense though.
Spark is a data-processing engine.
Hadoop these days is a data storage & resource management solution (plus MapReduce v2). Spark often runs on top of Hadoop: Hosted by YARN accessing data from HDFS.
There's a subtle difference between Hadoop (the platform / ecosystem) and Hadoop MapReduce (tasks that run on Hadoop). It's the latter that is being referenced in the comparison.
If your workload can generally run in a non-distributed manner, then the operational overhead of dealing with Spark versus simpler paradigms will be expensive. That has been my first hand experience.
I think there's a middle tier of problems that don't need a distributed cluster but can still benefit from parallelism across, say, 30-40 cores, which you can easily get on a single node. Once you know how to use Spark, I haven't found there's much overhead or difficulty to running it in standalone mode.
I do agree in principle that you're better off using simpler tools like Postgres and Python if you can. But if you're in the middle band of "inconveniently sized" data, the small overhead of running Spark in standalone mode on a workstation might be less than the extra work you do to get the needed parallelism with simpler tools.
More and more I reach for pyspark for medium sized tasks because I like the API and if I need to scale it I can. I don't think this is as black and white as it used to be now that big data framework interfaces are getting as good or better than small data interfaces.
I'm not sure how valid the last point is (my Data is 5TB or bigger) when you can create a 2TB (memory) x1 instance https://aws.amazon.com/ec2/instance-types/#memory-optimized . Also someone else mentioned using Spark's Catalyst engine and DataFrames data structure. My previous experience with Spark it would automatically swap the data so it would be feasible to work with 5tb of data.
Yup, 600mb is pretty tiny. I have a netbook from 2011 that I used for crunching numbers for my dissertation. It can handle 600mb csv files in memory using numpy. The slowest part is reading from disk.
The craziest example I saw was a blog post titled something like "How I used big data to analyse my problem" and she was talking about 1200 rows. Big data that could fit in a floppy disk!
An ordinary PC can accommodate 4TB of data easily. But it is a SPOF, has slow I/O, the network card is probably slow as well.
I mean if that was the case then ELK stacks would not be running in cluster mode. There many 3-Node ELKs handling less than 4 TB of data efficiently where one could not. By efficiently I mean, not having time for Coffee while waiting for Kibana to load the graphs...
Your point is simply not true. Many users of big data platforms e.g. Spark aren't doing so because of the volumes of data. It's because they want to do machine/deep learning on a proven and popular platform. With technologies like Caffe, Sparkling Water, Tensorflow all available on the one platform.
I really like the article and it agrees with what we've been accomplishing at my company. Datasets top out at 2 GB and clients ask for "big data" solutions that don't make sense. My only complaint is that I don't feel comfortable sharing it with my co-workers because of the vulgar image on there... :/
"There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files."
How about base64decode? it is possible in SQL but not pretty. There are other examples as well. I do agree that Hadoop is probably an overkill for 90%+ of uses but there are certain set of problems which are helped by having a
general execution environment.
I agreed with the article at the time it was posted. Now I use Spark for various sized data sets. Specifically Zeppelin + Spark is a great combination.
Then again, Spark doesn't really need Hadoop, I see more and more people using it with Kafka and Elasticsearch for all sorts of fun stuff.
And as other commenters pointed out, you get read-only SQL (very powerful SQL) for free. The other day I joined an elasticsearch result with a CSV file in SQL.
> But my data is more than 5TB! Your life now sucks - you are stuck with Hadoop. You don't have many other choices (big servers with many hard drives might still be in play), and most of your other choices are considerably more expensive.
I'd add that the benefits of avoiding bigdata(tm) can extended further into the 5TB+ space if you're happy to run at a bit of a delay.
I.e. If it's ok for the crunching to take 6 hours to produce intermediary aggregates which can then be crunched in less time , you can avoid spark and Hadoop for much longer.
I have seen so so so many projects get bogged down by the need to use a "big data" stack.
I think my favorite example, was a team that spent six months trying to build a system to take in files, parse them, and store them. Files came through a little less than one per second, which translated to about 100kb. This translated to about 2.5Gb a day of data. The data only needed to be stored for a year, and could easily be compressed.
They felt the need to setup a cluster, with 1Tb of RAM to handle processing the documents, they had a 25 Kafka instances, etc. It was just insane.
I just pulled out a python script, and combine that with Postgres and within an afternoon I had completed the project (albeit not production ready). This is so typical within companies it makes me gag. They were easily spending $100k a month just on infrastructure, my solution cost ~$400 ($1200 with replication).
The sad part, is that convincing management to use my solution was the hardest part. Basically, I had to explain how my system was more robust, faster, cheaper, etc. Even side-by-side comparisons didn't seem to convince them, they just felt the other solution was better some how... Eventually, I convinced them, after about a month of debates, and an endless stream of proof.
I fail to understand the hype behind Kafka, my guess is that 99% of Kafka use cases can be handled by a simple RabbitMQ/Celery etc. But maybe its not considered cool enough?
Lots of problems can be understood as a immutable log of events. Kafka is a screamingly high performance, resilient, immutable log of events.
It is kind of infectious though, once you have system implemented as Kafka-and-Samza the easiest way for them to communicate e and be communicated with is more Kafka and Samza.
I am fully on board the Kafka hype train. Choo choo.
The "Turning the database inside out" article that's linked a couple of posts up about Kafka and logs in general was the one that made me sit up and take notice and really made me 'get' Kafka and the entries stream oriented processing thing.
It raises interesting questions and I've had fun producing both arguements for and against the approach.
Some very interesting architectural ideas in there. In particular, the observation that if you turn off history pruning in Kafka, you have a consistent distributed log, which can then be piped off into whatever services need to consume it. That's appealing for cases where you want an audit trail, for example.
Do most systems require that sort of thing? Absolutely not. RabbitMQ is boring tech, in the good sense; it is well understood and does its job, so IMO it's a better default option where it fits.
Agree. I looked at Kafka a while back for an IoT backend. Settled on RabbitMQ and a Ruby worker pool. It handled something like 10x our expected load on one $1k unix surplus node.
Same case. Ingesting 100s or 1000s of IoT "events" per second,pipelining them and reacting. Solved with redis pub/sub and python subscribers.HAd machines responding to system interactions in no time and could prototype a new pipeline in 5 minutes.
The big-data/"cognitive computing" unit of the company would have suggested using Hadoop, Kafka, IBM Watson , AI , neural networks and lots of billable man-hours to match my solution.
They even looked interested in making and aquisition for a company with an event processing engine that did basically the same: pipe stuff trough processes.
I recommend reading Event Processing in Action and then just building the same with your favourite pub/sub toolchain.
As you said, usually it is enough, but people tend to (over)design and plan for scale, and Kafka does become a better solution above a limit. Shall you touch it, that's another topic. I wouldn't say that it is a rare occasion that you absolutely need Kafka, but a less frequent one, surely. Also I don't think it is a terrible miss to use it if you need a queue with reasonably high demands from it. You'll mostly waste operations efforts and development time. The end result will not differ too much.
Nevertheless, the case from above is a very extreme case of over engineering and premature optimization. To the limit where would qualify it as reckless.
Why does everyone seem to think Kafka is a queue replacement; it's a high performance immutable log with a guaranteed CAP profile and the possibility to add materised views to the log.
A good number of the cases Kafka is used for where I work are even simpler: they'd be best handled by a very tiny flask (or similar) server backed by a common data store.
Perhaps you haven't spent enough time reading into Kafka? It solves very different problems than the RabbitMQ/Clery "worker queue" pattern you're referring to. That doesn't necessarily mean people don't use Kafka for the wrong reasons.
Another caveat on this. Some people are forced into "big data" solutions not because their data is inherently "big data", but because their applications output data with wanton disregard for its arrangement or structure, making it so only systems designed to manage massive streams can readily handle it.
Many databases that I regularly touch would shrink an order of magnitude or more if someone went through, redid the layout, and scripted a process to perform the migration.
> Not sure of exact numbers, but I think we're doing 4-5m writes per second across one Kafka cluster, and around 1m writes per second against another.
I'm not saying that's the case with your system, but my immediate thought when I see those numbers is: how much of the value from those 4m/s writes could you get with a system that did something like 100 w/s? Either through sampling/statistical methods, reduce "write amplification" or simply looking hard at what is written and how often :-)
What makes RabbitMQ/Celery/etc. any "simple"r than Kafka? I found the experience very similar, and in a JVM-oriented environment Kafka was easier to deploy.
I've come to know Kafka as very reliable and well understandable queue that solves a few things much nicer than rabbitmq - for example the ability to replay all events still retained in the cluster. My primary pain with it is running the zk cluster, but other than that it's my to-go queue for moderate or larger throughput queues where loosely ordered at least once delivery is fine. My reason to pick it is the feature set and that it has yet to let me down and munch my data.
I should know better by now, but I'm still surprised by the relentless pursuit of the shiny that some teams get mired in and even more so that their management allows them to do it.
There are an awful lot of problems that can be solved with a simple Postgres instance running on RDS.
Yes! Had a startup client with a 4 person full-time "Big Data" team. They brought me into help them automate their server infrastructure. As part of the process I had them show me the Hadoop jobs they were running. 100% of the jobs could have been done as rudimentary SQL queries to their DB.
Welcome to the land of engineering, where we will fine-tune an algorithm to save milliseconds, yet waste man-years of engineering time to look like the big kids.
So many engineering choices are based on fashion rather than need.
Gotta get that hot new bag from Prada to show off to your engineering friends! :D
Yep. This is why I'm not sure that I want to do a deeper dive into algorithms. It seems like I could learn to implement all this awesome stuff with the least number of CPU cycles, but it's probably best to learn to communicate these things extremely well just to keep my sanity
To be fair, your employer is not concerned with your resume. They certainly don't pay you a premium for working on things that look boring or harmful on a resume.
If companies were serious about providing realistic advancement opportunities (in salary, in tech, in responsibility) there wouldn't be as much of an impulse to work on "cool" (i.e., marketable) projects.
Many dev jobs can be accomplished by pretty tried and true technology. For a startup and even most mid sized companies, MySQL and a web framework will get you pretty much everything you need. But any kid a year out of school can do that. So those positions are not really well paid, certainly not enough for a family in the Bay Area.
To become a mid-level engineer and make more money, you've got to implement some kind of complicated distributed system. Incorrectly. And then you can "fix" it (or someone else's) to get the next promotion!
This is stylized and not 100% true to reality, but I think it gets at a core truth.
There are many cases where there's a real need to create a solution either without any established framework or where previous attempts have failed. That means you're either a founding engineer or some skunk-works employee where you are given great autonomy but also bear most of the responsibility for the failed project. And, AFAIK, these positions are usually not advertised, you need good connections and luck to arrive at such opportunities.
The alternative is to realize that for certain types of problems, programming has become commoditized and one has to go up the consulting ladder selling architecture expertise, while improving social skills to be able to deal with customers at other level that isn't plain coding.
> Many dev jobs can be accomplished by pretty tried and true technology.
Then they should be.
> To become a mid-level engineer and make more money, you've got to implement some kind of complicated distributed system. Incorrectly. And then you can "fix" it (or someone else's) to get the next promotion!
Yea... that's dishonest, and frankly theft of the companies money.
Do you have any tips on how to convince management they don't have big data? My company's largest database has been in production for 10 years and could fit into the RAM on my dev machine, yet I'm constantly pushing back against "big data" stacks. This post made me laugh because "big data" and Hadoop were mentioned in our standup meeting yesterday morning.
A good definition of "big data" is "data that won't fit on one machine". Corresponding rule of thumb is that you don't need big data tools unless you have big data.
I don't think that's a good definition. "One machine" of data is highly variable; everyone has a different impression of the size of "one machine". Does "fit" mean fit in memory or on disk? Why is "Big Data" automatically superior to sharding with a traditional RDBMS, or a clustered document database?
I usually change "one machine" with "my notebook" and "fit" with "can analyze". It is big data if I cannot analyze it using my notebook. So it depends both on the size of the data (a petabyte is big data), the performance requirements (10GB/s is big data even if I keep 1 minute of data in the system) and also depends on the kind of analysis (doing TSP on a 1000000-node graph is big data, even if it fits my notebook memory).
I also define "small data" as anything that can be analyzed using Excel.
It's the best definition. It makes "big data" the name of the problem you have when your data can not be worked in a coherent way, what must be solved by distributed tools.
If you can buy a bigger machine, you can make "big data" bigger, and maybe evade this problem; if you must access it a lot of times, fitting on disk is useless and "big data" just got smaller; etc.
How then does "big data" differ from traditional HPC and mainframe processing? Those fields have been dealing with distributed processing and data storage measured in racks for decades.
I think the simplest answer is that it's often essentially the same thing but approached from a different direction by different people with different marketing terms.
One area which might be a more interesting difference to talk about might be flexibility/stability. A lot of the classic big iron work involved doing the same thing on a large scale for long periods of time whereas it seems like the modern big data crowd might be doing more ad hoc analysis, but I'm not sure that's really enough different to warrant a new term.
Ehhhh....that sounds like you're defining big data as distributed data.
Hadoop and Cassandra lend themselves to distributed nodes, but you can also use them without that. Or you can use solutions that work well with "big data" that aren't as opinionated about it, such as HDF5.
I guess the point is this: if I have 20TB of timeseries data on a single machine, and I have 20GB incoming each day, do I get to say I'm working with "big data" yet?
EDIT: My other complaint with this definition (perspective, really) is that it predisposes you to choose distributed solutions when you really do have "big data", which is not ideal for all workflows.
If you have 20TB of data on a single machine, you're better off with just Postgres 90% of the time. If you predict you're going have more data to fit on a single machine by the end of the year, then it makes sense to invest in distributed systems.
Definitely stick with HDF5 and Python for what you're doing. Postgress doesn't lend itself well to timeseries joins and queries in the same way that a more time series specific database like KDB+ would. The end result is most likely that you'd be bringing the data from a database into python anyway, probably caching in HDF5 before using using whatever python libs you want to use. You could alternatively bring your code/logic to the data using Q in KDB+, but there will be a learning curve and you will have to code for yourself a lot of functionality that just isn't available in library form. The performance will be a lot better though.
I like that as a rule of thumb; I think the following tend to bucket storage and processing solutions well enough to be a starting point
Small data: fits in memory
Medium data: fits on disk
Big data: fits on multiple disks
I've yet to come up with a rule of thumb for throughput though, and this can never replace the expertise of an experienced, domain knowledgeable engineering team. As always, there are lots of things to balance, including cost, time to implement and the now Vs the near future. Rules of thumb over simplify, but also give you a way to discuss different solutions without coming over as one size fits all.
It depends. You can store 4TB on single HDD, but reading and processing it can take many hours, so you may want big data stack to have your task paralleled.
I've ended up using "big data" tools like Spark for only 32GB of (compressed) data before, because those 32GB represented 250 million records that I needed to use to train 50+ different machine learning models in parallel.
For that particular task I used Spark in standalone mode on a single node with 40 cores, so I don't consider it Big Data. But I think it does illustrate that you don't have to have a massive dataset to benefit from some of these tools -- and you don't even need to have a cluster.
I think Spark is a bit unique in the "big data" toolset, though, in that it's far more flexible than most big data tools, far more performant, solves a fairly wide variety of problems (including streaming and ML), and the overhead of setting it up on a single node is very low and yet it can still be useful due to the amount of parallelism it offers. It's also a beast at working with Apache Parquet format.
Same here. Some problems in ML are embarrasingly parallel like cross validation and some ensemble methods. I would love to see better support for Spark in scikit-learn and better python deployment to cluster nodes also.
It's very hard because they see "big data" as something that important people and important companies do. When you say "We don't have big data!", it translates to "We aren't that important!" This, of course, makes everyone very angry.
Be mindful of people looking to introduce big data without justification. They are playing a game of some sort (maybe just personal resume value, or maybe a larger vie for power), and you are positioning yourself as their opponent when you try to stop the proposal they're pushing. Do not go into this naively.
One thing is to create a standard benchmark for your current solution, eg a dataset and some standard queries, and run it occasionally. When they propose a "better" solution, point them at the benchmark and wish them well. This will achieve the two goals of measuring raw performance and keeping them out of your hair.
The problem you had was that you were trying to convince management to act in the best interest of the company rather than their own best interest. They would much rather be in charge of a $100k a month project instead of a $1k a month project. Also the big solution required a bunch of engineers working full time which puts them higher up the ladder compared to your 1 engineer working part time solution.
You were saving the company money but hurting their resumes.
Anecdote: At the startup (now acquired) that I work for, we simplified the architecture so much that we spend under $3k a month in infra for tens of millions in revenues a year while having 99.9% availability for 5+ years now. I don't have any big name tech to add to my resume. The fact that we ran a e-commerce site with a low budget and few engineers was never recognized by recruiters. They called me based on my experience with "distributed systems" 8 years ago. All recruiters (well, most of them at least) can look for is keywords.
On the other hand, we also had difficulties hiring "good" engineers. People chose either a company with a brand name recognition or one that's working on "exciting" technology.
As engineers, we fail to appreciate that we are there to serve business first and foremost.
As leaders, we fail to put our companies first.
This is an industry wide problem. If past trends of other major industries are any indicator, the whole "meritocracy" of tech industry will disappear within the next decade.
On the recruiting standpoint, I would aim to attract older applicants.
People established in their career don't need buzzword bingo resumes. Stability is important because you can leave the job at the door. Other things are more important, such as paying the mortgage, taking kids to the park on the weekends and not working all hours with a fragile stack.
Unfortunately, my friends on management positions say that they favor youth. Young devs lack experience and foresight, so ran into a lot of problems, but they cover all of that by working 18/7 and bringing sleeping bags to office.
Not sure where you work - but not anymore... most kids out of college willing to work for places with "management" pay attention to 9 to 5 thing much more these days.
I remember Joel saying something about "smart AND gets things done" back in '06. But I guess the youth of today are too busy working 18/7 and learning the latest, hippest Javascript meta-framework to read such outdated doggerel. https://www.joelonsoftware.com/2006/10/25/the-guerrilla-guid...
I would say that "buzzword bingo" is even more important for older candidates. The younger candidates will be perceived as more current, so the older ones will need to make clear that they are not "dinosaurs" stuck on 20 year old tech.
There has never been a meritocracy in tech, only the illusion of it. 80%+ of any job is about being well-liked, by peers but especially by the upper echelons. 20% is the upper limit of actual job function expected to be performed. If you're faking it well enough to pass the surface inspection and you're well-liked, your job is very secure.
The issue is that, compared to other industries, it's really hard to find people with that 20% in tech, so business people are forced to let political and image ignoramuses (sometimes ignorant to the point of failing to perform basic hygiene) into the club, and forced to try to stuff them in the attic or some other dark corner of the office where they won't ugly things up too much.
Many developers naively interpret this as a ruling meritocracy. The reality is that the business types resent having to do this, and a horrible technology worker with some image consciousness will easily destroy his better-qualified peers.
I'm familiar with a case of a person who can't even code getting himself elevated to a technical directorship with architectural and high-level direction responsibilities through such tactics. He appears to code by merging branches in the GitHub interface, by making vacuous technical-sounding comments in meetings with executives, by occasionally having someone come "pair" with him and committing the changes they make together at his workstation, etc., but if you locked him alone in a room with a blank text editor and said "Write something that does this", he wouldn't be able to do it. And the executives believe in him wholeheartedly and are currently working to install him as the company's sole technical authority. All significant decisions would have to get his approval, despite his being literally the worst technician in the company. All of his decisions up to this point have been politically motivated, aimed at coalescing power within his group and blocking outsiders who may want to contribute.
He was able to get there because he dresses well, he adopted the executive's interests and chitchats with them about these, he walks around the office telling jokes and smiling at people, and generally being what would be considered personable and diplomatic, whereas the other technical people go to their desks, hunker down, and spend their day trying to get some real work done.
I have seen similar guys as "senior architect". They speak well, are nice, sit in a lot of meetings, make nice Visio charts, use words like "integration pattern" but you never get anything real out of them.
There are so many people out there like this, it's maddening. So many people that don't actually DO anything other than create work for others and create an illusion of 'management'...
Meh - there are just as many people making the same accusation you are as to render it useless.
I read countless anecdotes on HN and hear many more in person of people with just the shittiest managers, of people who rarely see "competent" engineering organizations, of people who have "never" seen a competent project manager, that it really is a wonder we have any profitable companies at all.
In reality, if you don't understand the value someone is providing them, you should make an effort to understand what they might be doing before making claims like the ones you're making.
I hear what you are saying. Before declaring someone is useless you definitely should make sure to understand what they are doing.
On the other hand, I am pretty convinced that there is a sizeable number of people in companies who create a lot of busywork "managing" things. The project I am on has 3 developers (as far as I can tell) and probably more than 10 business analysts, project managers, architects and other managers putting their name on it. I have tried to understand what they are all doing but from what I can tell there are two managers who actually help the project and the other ones write reports to each other, call a lot of meetings but don't really contribute. They just regurgitate what the few active people are doing.
I'm 32 and work as a PM at a Big Hip Tech Co. in the bay area.
Once I was on a team with 2 QA analysts, 1 Eng Manager, myself as PM, 3 BA's (that I did not want), and 3 developers, and one platform architect. All this plus 1 director overseeing our tiny team. Not to mention the 1-2 BA's I worked with whenever I worked on something that impacted another team.
During my 1:1 with said director, I once lashed out - I hadn't slept well in 4 days and I simply sounded off. I literally said everything that's been said in this thread: everything from why the fuck do we have so many people, give me 5 engineers and fire everyone else, to all you care about is the headcount that reports to you.
Luckily, I was a top performer, and while this tarnished my reputation with this director, I was able to smooth things over over the course of a few months.
This director explained to me that I was no longer at a start up. That this team should be resilient - that anyone should be able to take 2-3 weeks off at a time without interrupting the work. That they didn't want us working pedal to the metal 100% of the time. That it was ok that it was slow, and that I shouldn't be so self-conscious or hard on myself if I wasn't always working my fingers to the bone.
Now, I still thought we had way too much fat. Some of those BA's had no business being on a technical team, even as BA's and we should have traded in the architect and dev manager for an extra QA and developer.
But what that conversation did was bring me back down to earth. So much of what we view as right and wrong is personal preference. While I still disagreed with the amount of waste, it removed the chip on my shoulder and now I simply make sure to join teams that I like.
That's more of a ramble, but gives you some context as to where I was coming from.
As a dev I would be OK if these superfluous people would stay out of the way but in addition to not contributing they call meetings, ask for reports, filter information, schedule reviews and whatever. So they make my life more difficult without adding value.
Yeah, it's definitely true that a lot of people fail to comprehend the more holistic, bigger picture perspective. There's nothing necessarily wrong with a moderate working pace and building in some redundancy, especially since multiple people with overlapping functions can review one another's work, cover for each other, etc.
This, however, doesn't excuse hiring incompetent people based on appearance and likability with blatant disregard for their competence (I recognize that for many non-technical managers, it is difficult or impossible to discern the quality of one's skillset), nor does it excuse stuffing teams with dead weight just because the hiring manager personally likes the people involved. And those practices are indeed rampant.
> As engineers, we fail to appreciate that we are there to serve business first and foremost
The company has zero loyalty to you and will screw you over if it makes business sense to do so. There is absolutely no reason to put the company above your career interests.
This only holds if you believe that deliberately making decisions that will cost the company more than they should is in your career interests.
As said in this topic - it's true that in some places, resume-driven development pays off for the developers. It's not the same everywhere and to me it looks more like a symptom of a dysfunctional structure than par for the course in business.
This means it is in your career interests to increase company revenue and/or reduce costs, because this will make you more attractive to many companies (the ones we would likely want to work on) when you move on.
Exactly. It's in my best interest to make sure the business is doing well and I don't do things that jeopardize that. And no, it doesn't have to come at the cost of taking a hit to my career either. This is not a fine line balance I'm talking about - doing things with the main goal of boosting my career will sooner or later ruin it.
I am not sure how much it is in the interest of a developer to help the business to increase revenue or reduce cost. It's certainly good to come from a successful company but if you haven't worked on the latest cool tech you still don't look good. Even if that decision was 100% the right thing to do.
That's just my personal impression but I believe you have better chances coming from an unsuccessful company with all the right buzzwords than from a successful company with old tech. You will quickly be labelled as "dinosaur".
The real winner is to use cool stuff AND come from a successful/known company.
Can't say I'm 100% sure either. Certainly having nice buzzwords in the CV can be beneficial, but I would guess would be more common in certain corporate structures/areas/sizes.
I have no experience in being labelled a 'dinosaur', but I'm sure there are jobs where being practical and generating actual results will matter. In ideal conditions, these are the jobs which are desirable to work at, so I don't like the idea of optimizing for hotness in itself (at least for my own career decisions).
Given that changing jobs every 1-2 years is common now, I don't think actually trying to act in the interest of the company is a good strategy. By the time anyone really notices that you have, you'll probably already have another job. So perhaps the cure for resume driven development is fixing the constant job hopping problem.
The most (only?) reliable way to get a raise in this industry is to change jobs (or make noise about doing so). Most young companies don't provide a path forward for employees, which is somewhat understandable, b/c most young companies don't have a clue what their own path forward is. But without companies making an effort at solving this, they have pretty much guaranteed they will have very poor retention.
Is there a win-win here if we want to strike out on our own as "consultants"? What if we charge exorbitant sums for simple robust solutions, but provide the execs with needlessly technical descriptions of what we've done so they can show it off? Big budget, sounds impressive, works consistently, and we get paid. Or is that plausible?
As a lead dev, I recognize that if I want the early- or mid-career guys to stick around and keep working on this boring project, I have to throw them a bone from time to time.
Not everything should build their resume, but some of it has to. It's one of my arguments for buy over build.
What about the senior folks? Do they not need any bones because they're given challenging projects routinely, or because they're content doing a job for a check because they've got the spouse, house and kids to deal with?
Great point. Ultimately, you need to create a situation where they will also be winning career-wise.
I recently wrote about this. The TLDR is that 20% time is a great investment and can ultimately save the company a ton of time and money. It gives the engineers some playtime in order to build their CV's and "get their wiggles out". Ultimately, if done right it can protect your production systems from a lot of madness. https://hvops.com/articles/how-to-keep-passion-from-hugging-...
This is terrible advice. A C-level will see you as a rogue engineer. He will trust his managerial layers over a peon. I know this as I spent far too long being that rogue, peon engineer, screaming into the void. I just didn't get it. It's not about the tech used, and it never has been. It's about being enterprisey.
The grandparent is correct. This is 100% a political problem. We have a bad habit of discounting such problems in tech. We shouldn't do that anymore. Life is much better when you cooperate, instead of fighting an uphill battle.
Compact, reasonable solutions are the domain of startups. Bloating the engineering layer beyond any reasonable limit is an inherent cost of growing the company, and we shouldn't try to counteract that. We must operate within the framework we're given.
The political concerns extend beyond the company's own internals. They must appear enterprisey if they expect to be treated enterprisey. Today, enterprises have "data science" departments and blow $100k/mo on useless crap. If they're not doing that, it's a liability for the whole company, not just from the petty territorial perspective of one individual. It doesn't matter that it's possible to accomplish the same thing with one one-hundredth of the monthly expense. The enterprise signaling is worth the cost.
Google, Facebook, Amazon, Apple are definitely not enterprisey, and I don't think it has hurt their profits. I think they tend to be pretty frugal, overall, when spending their own money on hardware and infrastructure.
And I bet if the "enterprisey" company ever tries to compete with a company like Google, Facebook, Amazon, or Apple, they will be destroyed in that market.
Like most things that matter, the value of enterprise signaling is abstract, but it is undervalued at the company's peril. There are real consequences to getting it wrong, even if they're not directly measurable.
>And I bet if the "enterprisey" company ever tries to compete with a company like Google, Facebook, Amazon, or Apple, they will be destroyed in that market.
There are entire bodies of work on the question of when an enterprisey dynamic is better suited than a "disruptive" dynamic, to use Clayton Christensen's term, and vice-versa.
Such questions are not straightforward because in business, the winner is not the best technician. Good technology can give you an advantage if it's used right, but there is much more to business than just having the tech down. Most people cannot value the technology on its merits, and it therefore does not enter into their purchasing decision.
Maybe I'm being naïve but wouldn't it make sense to run the $400 solution and use the remaining $99600 to develope something that gets $99600 worth of work done?
You're not being naïve and if you're thinking in terms of maximizing positive company financial impact, you're correct.
At the scale of most major, non-startup tech companies, however, 99k worth of work is miniscule: it is less than the cost of a single fully-loaded engineer's salary and benefits package.
We can look at the manager based on this and see his choice from two angles, depending on if we assume he has good or bad faith for the company:
>Good faith:
"The large team is effectively guaranteed to succeed.
The likelihood that the 400 dollar solution works is an unknown quantity, and since that single engineer made it in the first place, I'd be putting a lot of negotiation power in his hands to ask for some large portion of the savings back as pay, meaning it's less likely we succeed and extremely possible he goes rogue. I'll go with the team."
>Bad faith:
"The company doesn't care about the difference between those numbers, they're the same at our scale. If I can waste ten people's time and net a sexy resume boost out of it for that little cost to the company, I'm probably the best manager they have.
No, you're not going to get to sabotage my next job if you're not going to do any work helping me spin this as somehow being better for my resume than me running a department with 10 people under me.
Actually, I've got an idea about that! I'm sure I can find something either wrong with your solution (or you) that allows me to say I tried for the savings, and after that failed, I went for the department I wanted anyways. I love a good compromise, don't you?"
> I'm sure I can find something either wrong with your solution (or you) that allows me to say I tried for the savings
Spot on. I'm getting flashbacks just reading this!
It's so easy to create FUD around "the $400 solution" that it's laughable.
-----
Upper management will be filled with so many questions:
* "If this is so cheap, why isn't everyone doing it this way? Surely all those important people wouldn't be wasting money, so there's gotta be something we're missing here."
* "What have we been paying 4 guys to do this whole time? Surely they would've figured this out earlier if it could've been done this way. I hope my boss doesn't hear that I've had a completely redundant department this whole time..."
* "If it sounds too good to be true, it probably is. This guy is probably just trying to supplant my trusted middle manager by making him look like a money-waster. I need to tell my secretary to filter my emails better..."
-----
And middle management can easily say:
* "While Bob is able to get the same output right now, he is doing it in a non-scalable way that will have to be rewritten over and over again as we grow. Our way costs more upfront but it will allow us to expand to fulfill EXECS_WILDEST_DREAMS. You don't want to go down the day that you're featured in Fortune Magazine because Bob's data analysis script hammered the database, do you? We should use the solution you wisely previously approved, the solution on which you heard that compelling talk at CIOConf last year. It is much better than being penny wise and pound foolish!"
* "If I pull up Monster.com right now, there are 500 Hadoop candidates in our area. How many 'Bob's Data Processing Toolkit' candidates are there? We would be painting ourselves into a corner, and if Bob ever left us, we would be stranded."
* "I too was amazed by Bob's Data Processing Toolkit, and I enthusiastically tried it. Unfortunately, my best employee Sally pointed out that Bob's Toolkit causes disruptive fits of packet spasm in the switch hardware, threatening our whole network. I asked him to fix this but he says that he doesn't even think that problem is a real thing. Yes, he had the gall to impugn my best employee, Sally! He is clearly in denial about this and too close to see the impact objectively, so I put him on another task. [Under breath: he is also clearly a sexist pig, and we're lucky Sally didn't call HR.]
"It was a valiant effort and I do indeed applaud Bob for his attempts and concern for the company's well-being, and I assure you, Mr. Upper Manager, that we are continuing to analyze his Toolkit's mechanisms in depth and we will apply all savings and optimizations that we can. However, as you know, if it seems too good to be true, it probably is, and it is just not realistic that a Very Important Company like ours could handle all of our Very Important Data for less than half the cost of your car payment each month."
It's interesting that's the connotation that was brought to mind for you. In the context of this thread, Google, Facebook, and Amazon are the definition of enterprise infrastructure. Google laid the ground work for much of the enterprise big data infrastructure. They've also developed buzz worthy software such as kubernetes, and spanner. Facebook and Google both are known for big data machine learning. Amazon brought the cloud to the mainstream, and offer many popular big data services through AWS. In general, there are few companies in the world who operate near the scale, with the reliability of these giants.
I think you're conflating enterprise with old and stuffy, and non-enterprise with bright colors and cutting edge technology. When I think of enterprise I think of software that needs to operate at scale with strict requirements on performance and uptime.
Apple loses a lot of the talent to other companies, and has never really been known having strong technology, so I understand that.
> Apple loses a lot of the talent to other companies, and has never really been known having strong technology, so I understand that.
That statement is just ridiculous.
Apple innovates a lot in the mobile and desktop spaces and on the software side they have pushed a lot of projects forward e.g. WebKit, LLVM. They also run some very large web services e.g. iCloud, Messages which are on par with some of the challenges Google and Facebook have.
100M people have downloaded apps I wrote. I know all about the issues with iCloud.
But as a web service that underpins so much of iOS it is still on a scale and complexity that rivals anything Google and Facebook has. Apple doesn't get enough credit for actually make this work on a daily basis.
iCloud is impressive and is easily on a scale that most companies will never reach. But Google and Facebook are on an entirely different level. The comparison isn't even close. iCloud isn't a rival. It's more of a distant cousin.
They definitely deserve credit for making it work because even at their scale it's an amazing feat. But there's no comparison to Google or Facebook's scale.
The interesting thing is that Google and Facebook created big-data solutions to solve actual problems they were facing. There are plenty of Google data scientists that reach for R or Pandas well before they write a MapReduce, and if you do need to write a MapReduce (well, Flume/BigQuery now), it's highly recommended to run it on a sampled dataset before extending to the full corpus.
There are some "enterprisey" companies that do the same, but there are also a whole lot of companies that reach for big-data tools because they want to be like Google, ignoring that their problems are actually quite different from the problems Google faces.
Yes, I strongly believe that this has been one of the strongest drivers of tech fads since at least 2010. People want to be like Google, so they copy Google. They don't understand that Google probably would've loved to be able to make the thing work with Oracle v. spending years developing their own internal systems, but the unique problem space put them at the disadvantage of needing to use a completely custom solution.
Google publishes an academic paper on this and the general public misinterprets it as a recommendation. Soon you see people writing open-source implementations "based on the GoogleThing Paper", and a new tech fad is born. It will consume billions of dollars before it dies in favor of another fad "based on the FacebookThing/TheNextGoogleThing Paper".
Walk up to most business guys and they will jump at the chance to "become more like Google". Try to talk them down from this, and your challenge is to convince that no, we don't want to be more like one of the most important and influential technology companies in the world, the company that's on the news every day, and whose logo he sees every time he looks at his phone, and the company who keeps taking all of the best hires from the universities. Worse, you'll be making that argument because "we're just not as big [read: important] as them". Not a promising position for the reasonable engineer.
This has been a terrible blight on our profession these last several years, but we just have to learn to roll with it. It's only by understanding and accepting the psychology around this that we can formulate effective counterstrategies, or make the best of the situation that's before us.
That's maybe a bad example as those companies all actually make use of the more expensive infrastructure.
A better question is what ROI does generic non-tech enterprise company X get from standing up a huge data science team for simple data management problems.
Have you actually worked for any of those companies ? I have.
And I can assure they absolutely fall under the definition of an enterprise. Sure they develop a lot of technology in house but they still have significant amounts of classic, enterprise technologies. Especially in the "business" side.
I am always amazed by why people think Google, FB, Amazon etc as something above the rest of the IT world. As if there are no managers (non-tech) or developers with a bias tool preference or people working with a particular agenda.
Many times these issues are not in the core product teams. Companies tend to hire the best and frugal on spending as it shows up as COGS in their finance reports.
Issues crop up when it comes to non-core product teams. Example a business intelligence (BI) team is more prone to over spend time and money on getting huge clusters with "big data" because they perceive their users needing data in real time.
You just mentioned 4 of the most valuable companies on the planet, who mostly got that way because of an awesome ability to rationality engineer their way out of a corner, and now you want to expect that rather than being best in class that they are merely average????
Because that's what you are doing with your statement above.
The reason they are the Google, FB, etc of the world is _because_ of that unique capability. Do you honestly think there is, say, a hospital, anywhere on this planet that can hold a candle to what they do everyday?
The poster above was simply stating what the normal world looks and acts like.
My advice is to learn to enjoy the ride and let the company worry about its own business. Alienating people just to make a point does not end up at a happy place. I've learned this only after long years of getting beat up before finally deciding to accept the lesson.
>This kind of thinking is what creates so much madness in the programming world. I don't know where you work but it sounds like hell.
At some point, enough developers/managers will begin to take advantage of the system until executives wise up.
It's also possible that this is the CTO/CIO/management's first real position where they can throw around $100k projects out of a multi-million dollar budget and they are simply learning the ropes.
It's also again possible that the company has so much cash devoted to this department that no one cares because they are collecting large paychecks. In which case, you're likely acting against your best interest (long and short term) to not take advantage of it.
> The political concerns extend beyond the company's own internals. They must appear enterprisey if they expect to be treated enterprisey. Today, enterprises have "data science" departments and blow $100k/mo on useless crap. If they're not doing that, it's a liability for the whole company, not just from the petty territorial perspective of one individual. It doesn't matter that it's possible to accomplish the same thing with one one-hundredth of the monthly expense. The enterprise signaling is worth the cost.
This is a great opportunity to sell $400/mo. enterprise solutions for $100k/mo.
It's also potentially a good way to fund high-security implementations. Get the protocols, data formats, and so on that you cant easily change right ahead of time. Design rest for easy modification. Rapidly build product with just enough assurance. Focus on sales, support, and integration contracts. Eventually shift to rewriting the thing a bit at a time for high security/QA with any reusable components dual-licensed. If enough money comes in, pay hardware team to put together some secure, RISC-V servers on top of that.
You're presenting a dichotomy. Alienate superiors by communicating erratically or without solutions; or be a willing accomplice to incompetence and fraud.
I don't understand your interpretation. The company knows that there are cheaper ways to accomplish the task, but they don't care. They've made the knowing acknowledgement to spend on a solution that costs more on the belief that they are extracting some value from the additional cost.
The key is that that value is not strictly technical. You can present a technical solution that is cheaper, but doesn't offer the non-technical value they derive from being a player on the "Big Data" scene. They can say "Yeah, we use our main guy's Perl script" or they can say "Yeah, we use Hadoop".
Is the value in that worth $99k per month? That's a subjective judgment for each company to make based on their specific circumstances.
Performance is really not something you can design from scratch. If you are using Hadoop for a 1GB job, it's likely that you have architectural bottlenecks that will prevent you from scaling to multiple-terabyte workloads anyway.
And if you overcomplicate things, you can easily get to a state where there are 2 guys that both half-understood the Hadoop setup, but both left for different startups. Complexity alone does not make things simpler.
Of course, you can provide the non-technical value of providing ~training~ makework and resume lines for 10 code monkeys and their managers, but that is not really value to the company.
a) get its engineers to give a talk at a Hadoop conference, resulting in marketing [logo shown prominently around the conference], PR, and recruitment gainz;
b) get articles published about how the company uses cutting edge technology to do new things and all the other CIOs and big shots better listen up, resulting in prestige, PR, and recruitment gainz; (this happened to a client in real life)
c) reasonably field interrogatory questions from other fad followers, whether they are investors, journalists, peers, or whomever. When asked "How is YourCorp using data science and Big Data?", being able to say "We have a team working with that" is much better than having to say "Our guy Bob says that's just a fad, so we don't really 'do that'". This is basically a PR gain, but it means that investors and clients will feel the company is cutting edge, instead of backward philistines who listen to Bob all the time.
I could go on but it's pretty boring.
The point is that business is all about the customer's perception of the company as something to which they want to give money. If the business does not appear to be following the trends, they will be substantially harmed, because people do not want to get involved with an outmoded business. Being perceived as the last to adopt a new technology looks bad.
Reminds me of almost every technology project I've seen in finance! It's all about building complex empires, rather than simple, functional solutions. Sometimes, it's also just about using up the assigned technology budget so that it's not scaled down the following year!
Seems like you could spin up a side project business that has a simple solution to their problem but charges enough to soak up their whole budget. Everybody is happy!
Have to provide enough bits and twiddles to allow the dept head to hire his favorite people and you're right. This is basically what Tableau et al have done. There are enough buttons that you can have a few "Tableau guys", and you don't have to hire custom grapher guys.
Funny, I had a phone screen today and in the midst of asking about how I saw my future and what I wanted to work in, the recruiter was tiptoeing around a situation where someone at the company had suggested React or something and the devs had pushed back (successfully) on the basis that the site didn't need it. I got the feeling he was trying not to step on my dreams of being a cutting-edge JS frontend trailblazer, but it is really a point in the company's favor (to me) that they were able to resist the urge.
Basically, they are looking for web developers but it seems like they have to filter out all the frontend ninja rockstars.
Why didn't they "need" it? How do they decide what they "need"?
This sounds like typical anti-change pushback, which I have learned can actually be a good thing. However, this anecdote is severely lacking in insight; much like most people's support of, or opposition to, change. Further, like the widespread belief that sentences shouldn't start with conjunctions; much less conjunctive adverbs.
Sure, maybe. But the post doesn't go into the details or even indicate the details existed. I'm not sure there is much insight here sans those details.
At this point, I'm not sure I agree. I am...not a fan of JavaScript, to put it mildly (though ES6 does a lot to suck less). But for my money, nothing out there is better at templating even static HTML than React and JSX. The compositional nature of components is way better than any other templating stack I've ever used and it's pretty easy to bake it into static HTML (Gatsby and `react-html-email` being awesome projects for it).
I'm sure there are declarative, object-oriented (rather than text-oriented) templating engines out there that use an approach like React's. But I would consider using an imperative, text-oriented templating language a yellow, if not red, flag in 2017.
I use Twirl, included in the Play Framework (Scala/Java).
It is functional (well, as functional as React), and templates compile to plain 'ol functions, so compatibility and static typing is the same as the rest of the your program.
Obviously, if I needed a SPA or something, it's not what I would use, but again, not everything should be an SPA.
You don't need to use React as a SPA, though, is what I'm saying. (When using React, those component trees, too, compile to 100% plain-ol'-functions.)
Twirl is fine, insofar as it's attached to the Play (that's not to impugn you for picking it, my history with Play is colorful and frustrating). I wouldn't raise a flag for that. But not using something in this vein definitely is, and React is probably the most accessible way to do it for the 90-95% case of developers.
I don't know all the reasons they didn't "need" it, it was just a phone screen with their recruiter. The point was just that they only used a sprinkling of JS in general.
I had a phone screen with a start up in the middle of my last job hunt where they let me know they were in the middle of porting their entire frontend to React + Redux, while rewriting their backend. I was "unable to find time" to meet with them further.
> You were saving the company money but hurting their resumes.
To take it a step further, Management has to own up to their original failure and try to explain to their bosses how they could spend so much time and money unnecessarily.
Another psychological problem here is the perception by people who do not understand the technology that the higher priced solution is better because it costs more.
And a step further: a 100k project had commissions, meetings, expenses all associated with a 100k project. If people are cheating enough to waste 100k on useless crap expect them to cheat some of that for themselves. That means hotel rooms, airline tickets, meals, etc and all, and any commission the sales guy shares with you for making it happen. And it does happen...
I would love to agree with you but there is another interpretation: they are clueless.
Suppose management in a firm uncritically contracted the big data revolution meme. Then they believe they are in the position of a city mayor trying to build a bridge and some wiseacre comes along and says they can do it in an afternoon with two pieces of string and a stick of gum. The problem is that the analogy doesn't hold, but they don't know that.
Add to that, if anything goes wrong, they would have a much easier time cya-ing and convincing their superiors they did all they can do to prevent it if they build a large, complex solution than if they just went to the quickest, simplest one. In fact justifying anything going wrong at all is much easier with large, complex undertakings than small, simple ones.
Plus people jump on fads, big data is trendy right now and when you read and hear about something a lot your mind tends to go to it first.
This. If you've ever heard the story about the 1 man engineer with the better, faster cheaper solution that competed with IBM (slower, thousands of $$$ more) for business? IBM won every time because the buyers wanted to be associated with big impressive projects(and have IBM on their resume). This is actually a sales pricing tactic that every engineer should learn so that they dont price themselves out of the market - by being too cheap.
I've worked in a similar situation -- a transaction processing system with ~1M tx/day and ~2TB total data. They used 15 Cassandra nodes and a monster Redshift cluster (>64 CPU cores, ~512GB RAM, 30+TB disk) for the OLTP and OLAP queries. I almost can't put Cassandra and Redshift on my resume because when more knowledgeable people ask about my experience with them, that pathetic use case makes me look bad by association.
One bunch of yahoos I ran into was building a Hadoop cluster to do some bullshit to user and PC configuration data before loading it into a big helpdesk system like Remedy or ServiceNow.
They were 6 months in and made no progress other than building 30 servers. I wrote an awk script to process the CSVs and did the rest in Excel in about 30 minutes. I had an intern automate the process with a perl script, which took 3-4 days! :)
The program management was very upset, mostly because they looked like a pack of clowns.
Their solution is too hot, but yours is maybe too cold. Somewhere is an approach that is just right. While I too am weary of those needing to use the latest craze on any simple problem, I've become equally weary of the "tried and true" orthodox. The Python/Flask-Postgres stack can be great for rapidly prototyping a functioning application, but sometimes this solution is unable to evolve or scale to address irregular needs. It's almost always fine for your very typical web app. It can struggle with more complicated data processing, especially where irregular resource utilization and complex workflow orchestration are concerned. Celery workers can only address those problems to a degree. Home grown ETL is easy at version 1 and usually a nightmare by version 2. It's a hard problem with lots of wheel reinventing, so it's good that there are some emerging common platforms (particularly things like Airflow).
A full on hadoop stack is rarely warranted, but I can understand the reasoning behind wanting a flexible enough processing capacity to accommodate any anticipated load regardless of frequency.
Yes, but if you try to design for what you anticipate to be the bottlenecks as you scale, you are almost guaranteed to discover the real bottlenecks are in a entirely different part of the architecture than you anticipated.
So there is still a good argument to be made for developing the Minimum Viable Product in whatever technologies are most productive for your developers, and figure out how to scale as you grow.
Some of this however is 'cookbook coding'. Too many people learn recipes rather than ingredients.
Consider your programmer that goes to a 'big data' class and is taught how to use the stack. They are taught this generally on a 'toy' application because it would take to long to set up a real application. They are there to learn the stack, so they either ignore or get only lip service to the 'when would this be appropriate' slide. Now they come back to work and they know the recipe for using this big data stack.
The boss gives them their task, they apply their recipe, and voila, they have a solution. All win right?
If it is any consolation the type of engineering situation you are experiencing does (in my experience at least) eventually correct with the manager being moved out.
okay - but what if frequency increases to three files and beyond within a year, the file size doubles and the expected latency of processing needs to be reduced?
Also compressing doesn't go well with ad hoc analysis.
Sounds like maybe a Hadoop setup is not the worst idea to be ready for the future.
I was interviewing for a company that had "big data" once; they were looking for someone with Hadoop experience. It turns out 8GB is big data, and I told them they might want to explore other options, since they could do that in-memory with Spark or Pandas.
Similar story. It became clear over the course of a (contracting) interview that the cloth-eared PM who according to LinkedIn had just completed some big data course was using every imaginable buzzword he could fit on the whiteboard, and seemed less than impressed at my modelling attempts interfering with his drawing the most ridiculous architectural diagram to power.. an executive management dashboard.
I just flat out told him I didn't want anything to do with that kind of project and asked to be shown the exit.
I am going through this right now. We are producing 1 TB per year and have to run a few SQL queries per week. Some people in corporate IT have sold our management on Hadoop and now the whole project has turned from super simple storing into a SQL database into this crazy cycle of writing requirements for IT for every little thing. We are not even allowed to write directly into Hadoop but have to go through their badly defined API.
I have run a SQL database on a USB disk with the same data without problems but some people are just attached to the idea of "big data" so Hadoop it.
This seems to be a pretty common sentiment here. But, here's my question, if you start up doesn't have big data now, then should you assume it will not have big data tomorrow.
I work with a start up which currently doesn't have "big data", but perhaps "medium data". And, I can perhaps manage without a big data stack in production. But, if I look at the company/sales targets then in the next 6-12 months we will be working with clients that assures "big data".
Now, here are my choices -
1. Stick to python scripts and/or large aws instances because it works for now. If the sales team closes deals in the next few months after working tirelessly for months, then though the sales team sold the client a great solution, but in reality we can't scale and we fail.
2. Plan for your start up to succeed, plan according to the company targets. Try and strike a balance for your production stack that isn't an huge overkill now but isn't under planned/worked either.
It's easy to say we shouldn't use big data stack till we have big data, but its too late (specially for a startup) to start building a big data stack after you have big data.
Been there. Believe me: stick to Python scripts! Always. And when you finally land that first customer and you have trouble scaling, first scale vertically (buy better machine) and work days and nights to build a scalable solution. But no sooner.
Why? Because your problem is not technical, it is business related. You have no idea why your startup will fail or why you will need to pivot. Because if you did, it wouldn't be a startup. Or you would have had that client already.
You might need to throw away your solution because it is not solving the right problem. Actually, it is almost certain that it is solving a problem nobody is prepared to pay for. So stick to Python until people start throwing money at you - because you don't have a product-market fit yet. And your fancy Big Data solution will be worth nothing, because it will be so damn impossible to adapt it to new requirements.
I wish I could send this comment back in time to myself... :-/ But since I can't, how about at least you learn from my mistakes and not yours?
With tools like message queues and Docker making it so easy to scale horizontally you don't even have to go vertically.
We just won an industry award at work for a multi billion data point spatial analysis project that was all done with Python scripts + Docker on EC2 and PostgreSQL/PostGIS on RDS. A consultant was working in parallel with Hadoop etc and we kept up just fine. Use what works not what is "best".
I assume searching for [employer] could direct someone here vs searching for knz/me specifically would take you to my HN profile page.
I'm not ashamed of anything I've said on HN but would rather not have people just searching for my employer ending up here (especially since I work in an office that routinely deals with sensitive political and community issues). It's a minor amount of (perceived) anonymity vs stating my name/job title/employer here!
> With tools like message queues and Docker making it so easy to scale horizontally you don't even have to go vertically.
That depends entirely on the workload. It's not always a good idea to move from one sql instance, to a cluster of them. Just buy the better machine that gives you time to make a real scalable solution.
Well, they could start by using something faster than Python. I would tend to use Common Lisp, but Clojure would be the more modern choice.
But yes, scaling up is far easier than scaling out. A box with 72 cores and 1.5TB of DRAM can be had for around $50k these days. I think it would take a startup a while to outgrow that.
Python is plenty fast where it matters. You have heavily optimized numerical and scientific libraries (numpy and scipy) and can easily escape in C if it matters that much to you. But in my experience bad performance is usually a result of wrong architecture and algorithms, sometimes even outright bugs, often introduced by "optimization" hacks which only make code less readable.
This holds for all languages, of course, not only Python. Forget raw speed, it is just the other end of the stick from Hadoop. Believe me, you don't need it. And even when you think you do, you don't. And when you have measured it and you still need it, ok, you can optimize that bottleneck. Everywhere else, choose proper architecture and write maintainable code and your app will leave others in the dust. Because it is never just about the speed anyway.
I agree that it depends on what you're doing, and the speed of the language often doesn't matter -- that has to be the case or Python would never have caught on to begin with.
But you can write code in Common Lisp or Clojure that's just as readable and maintainable (once you learn the language, obviously) as anything you can write in Python, and the development experience is just as good if not better.
Your choice of "big data" vs. python scripts sounds just like the classic trade-off of scope creep vs. good enough.
IMO the answer is almost always "good enough". This has been expressed in countless tropes/principles from many wise people, like KISS (Keep It Simple Stupid), YAGNI (You Ain't Gonna Need It), "pre-optimization is the root of all evil", etc.
If you go the YAGNI route, and when your lack of scale comes back to bite you (a happy problem to have), you'll have hard data about what exactly needs to be scaled, and you'll build a much better system. Otherwise, you'll dig deeper into the pre-optimization rabbit-hole of hypotheticals, and in that case, it's turtles all the way down (to use another trope).
Thanks for sharing this story. My experience has been that data scientists and analysts aren't able to efficiently use Hadoop/Spark even in cases where it is warranted. These individuals don't generally like working with Java/Scala and/or haven't spend time understanding the underlying structures used (e.g., RDDs, caching, etc.). As a result, they either don't put their sophisticated modeling or analyses into production, or they hand off their application to other engineers to implement for production size "big data." This produces all sorts of problems and inefficiencies, not the least of which is the fact that the engineers don't understand the analyses and the data scientists don't understand the implementation.
My (biased, as I work for them) opinion is that something like Pachyderm (http://pachyderm.io/) will ease some of these struggles. The philosophy of those who work on this open source project is that data people should be able to use the tooling and frameworks they like/need and be able to push there analyses to production pipelines without re-writing, lots of frictions, or worrying about things like data sharding and parallelism.
For example, in Pachyderm you can create a nice, simple Python/R script that is single threaded and runs nicely on your laptop. You can then put the exact same script into pachyderm and run it in a distributed way across many workers on a cluster. Thus, keeping your code simple and approachable, while still allowing people to push things into infrastructure and create value.
I work in data science. Before interviewing at a company, I try to find an engineer and ask them two things:
1. What's your tech stack?
2. Why?
This blatantly disqualifies ~90% of startups which are doing crazy things like using Hadoop for 10gb of data. OTOH, I get really impressed when someone describes effectively using "old" technologies for large amounts of data, or can pinpoint precisely why they use something with reasons other than data size. One good example: "We use Kafka because having a centralized difference log is the best way we've found for several different data stores to read from one source of truth, and we started doing this years ago. If we were starting today, we might use Kinesis on AWS, but the benefits are small compared to the amount of specific infrastructure we've built at this point."
Not quit the same, but years back I was to take on the title of "BizTalk consultant". My company and another consulting company had decided that what our customer needed was a BizTalk server.
When the BizTalk implementation broke down and I first needed to look into how it worked and what it did, I found that it just moved a small XML file to an SFTP server and wrote a log entry. So I replaced the entire setup with 50 lines of C#. Luckily my boss was onboard, arguing that we didn't really have the qualifications to do BizTalk.
The idea had originally been that the customer, a harbour, would in the long run need to do a ton of data exchanges with the government organisations and shipping companies. The thing is that they where planing for a future that never really happened.
> I just pulled out a python script, and combine that with Postgres and within an afternoon I had completed the project (albeit not production ready)
There are two problems here - one is that people prototype their architecture using massively over-engineered systems and the second is that a rough prototype makes it way into production.
So, as a Hadoop perf engineer, I deal with both issues - "We have a Kafka stream hooked up to a Storm pipeline and it is always breaking and we can't debug ... what is this shit?" or "Postgres is stuck on BYTEA vaccuum loops and you can fix it with Hadoop, right?".
There are significant advantages in prototyping with an easy to debug architecture, until all the business requirements stabilize.
Sometimes the right answer is to use Postgres better (specifically, table inheritance for deletion of old data + indexes instead of delete from with <=), sometimes the right answer is a big data system designed for cold data storage & deep scans.
This all comes back to the "We made a plan, now it is a death march" problem - the iterative failure process of building a prototype and having it actually fail is very important, but most people feel like they'll get fired if their work fails in the real world.
Over-engineering is usually career insulation and somewhat defense against an architecture council.
I'm no fan of big data but this is completely unbelievable. And by that i mean i don't believe you.
The industry standard is ~100-200 debugged lines of code a day. (If your team tracks hours, look this up on a big project; taking into account all the hours one spends, not just coding.)
So your claim, being generous, is this team spent 6 months to not produce the equivalent of 100 lines of code? Even if completely true, this comes off as a humblebrag.
I had a job interview for a Data Science position, in which I was talking about data science stuff I'd done in my current position. I mentioned that I started with R, but it would keep choking and running out of memory, so I moved to Python and largely fixed the issue. To which one of the interviewers asked why I didn't use a spark cluster.
I was almost too stunned to answer. Their solution, was not to just use something more efficient, or even to just rent a single, more powerful EC2 instance or something, but to go to the effort of setting up and running a cluster. The dataset wasn't even that big: a few gb on disk.
There's no a priori reason setting up a cluster has to be more effort than setting up a non-cluster. I mean, a computer is more complicated than a breadboard of custom-wired components, but most people find it easier to solve their problems with a computer.
And Spark is usually much more efficient than R or Python (and gives you a nicer language to work with IMO, though that's very subjective). It's entirely possible a 1-node cluster would have outperformed your non-cluster approach, and while a 1-node cluster seems stupid it's useful if you know you're eventually going to have to scale, because it ensures that you test your scalability assumptions and don't embed any non-distributable steps into your logic.
Hadoop doesn't take six months for the project you described. It sounds like your team was learning Hadoop, not executing on Hadoop. Are you concerned that you fought a utility in favor of a tactical solution that now requires its own independent maintenance maintenancelifecycle?
I've been around this loop over the past few months. Something to bear in mind: Hadoop of 2017 is quite different to the Hadoop of 2013.
Our data isn't enormous, by any means - 160G in one particular instance that's being used for proof of concept, it'll add up to 5-20T should it reach production. The catch is that it's 160G in MySQL; it's only 5G or so once it's been boiled down to Parquet files in HDFS. Columnar stores can be a really big win, depending on the shape of your data.
We use Impala for our queries. It's quite good tech; it's much faster at table scans than everything that doesn't describe itself as an in-memory database. That means writing SQL much like you would with Hive, only it runs faster.
I tried out both Citus and Greenplum to give PostgreSQL a fair shot. First problem: PostgreSQL is limited to 1600 columns in a table, and the column limit for a select clause isn't much bigger. We have several times this number of columns in our largest analytic tables. Not the end of the world, you can cobble things together from joins and more special-purpose tables.
Second problem: CitusDB doesn't come OOTB with a column store, and it's far too slow when using a row store. I didn't bother trying to compile the column store extension to use with Citus; the pain ruled itself out. I continued ahead with Greenplum, focusing on columnar storage - row storage is consistently poor.
Third problem: Greenplum is cobbled together from a pile of duct tape, an assembly of scripts and ssh keys to keep the cluster in sync. It does not inspire the same kind of confidence for operational management as HDFS, whether rebalancing the cluster, expanding the cluster, or decommissioning nodes (not supported with Greenplum, AFAICT).
Fourth problem: Impala simply runs faster than the Postgres derivatives, and its lead increases the more data you have. Impala seems to do table scans over twice as fast on identical deployment environments.
Indexes only help when the operation being performed can use the index. As it happens, most analytic queries do full scans, or have predicates that are either not very selective (randomly skipping rows here and there) or are really selective (date bucketing, which maps well to typical Hadoop partitioning strategies). I had some hope that indexes would help for joins; but Greenplum didn't elect to use my indexes, and when I forced their use, it ran slower. The ancient version of Postgres that Greenplum is forked from doesn't help much either, since it can't e.g. use covering indexes to avoid looking back to the table.
If it was my startup, I'd take a risk on something like MonetDB, or look harder at MemSQL, given what I've seen about how data has shrunk with column stores. But from what I've seen and measured, Postgres doesn't really cut it for analytic queries.
If it's 5G boiled down and your production data is likely to be within two magnitudes, why in the world are you not just reading all of it into RAM and operating on it?
Like I said, if I it was my choice, I'd use MonetDB - that screams. But it has operational deficits, like experimental replication. I'd also look further into MemSQL than is visible on their website.
The reason I don't load it into memory and operate on it directly is because it's analytics as a service; I don't particularly want to write a SQL parser and execution engine.
We have looked at things like Zeppelin for more interactive data manipulation, using Spark to keep stuff in memory. But building a UI around that is an open-ended rabbit hole.
> PostgreSQL is limited to 1600 columns in a table, and the column limit for a select clause isn't much bigger. We have several times this number of columns in our largest analytic tables.
What's "this kind of work"? :) The company I work for does reconciliation as a service, something that's pervasive in finance. We support customer-defined schemas; in fact that's one of our selling-points. I could start talking about the kinds of things we've done with MySQL to make this perform well - it's not difficult, just a bit unorthodox.
Anyway, the diversity in customer schema leaks out into the Hadoop schema, where we'd much prefer to give customers data using column names they're familiar with, and we also want to give them rows from all their different schemas in a single table (because many schemas have overlap by design). The superset of all schema columns is large, however. The problem can be overcome with more tooling - defining friendly views with explicit column choice - but having the option to implement that (and go to market sooner), vs a requirement to implement that, adds up to a distinct advantage for tech that can support the extra columns.
Something you need to bear in mind is that distributed joins are very expensive; you have a better time designing your schema such that related data can be placed logically close together, whether it's arrays / maps inside rows (for one to many), or very wide rows (denormalizing what might be a star schema).
(I know, in a column store having related data in another column isn't actually close together; but it can be stepped through at the same time, it doesn't need a join to be correlated, it's correlated naturally.)
Yes. Customer analytical record tables are extremely wide. I have a number of enterprise customers who have use cases with tens of thousands of columns. And trust me that every enterprise is moving to having one since they are needed for fast supply of data to decisioning systems like PEGA.
That's why I always find it hilarious when HN goes on about just using PostgreSQL or some other SQL database for everything when they don't understand the use case. They simply doesn't work in these scenarios.
Why so many columns? I would expect there would be some way to break the data down or transpose it somehow to make it more manageable. At that point are you even really dealing with a database as most people use the term?
Because they are attributes of a customer or a product. And many companies these days know a LOT about you as a customer e.g. everything from your age to how likely are you to purchase product X.
It has to be one table because you need to get attributes of a customer very quickly (single digit milliseconds) in order to respond with the next best action e.g. show this advertisement or route them to this call centre person.
And of course this is a database. It's all of the information about a customer in one place.
I agree with the sentiment of the article, but why does this site ask for notification permissions? It's a blog - if I want notifications about new posts I'll add your RSS feed to my reader.
There's a meta-pattern here. I've seen countless (especially younger) dev teams focus on new and hot tech to the detriment of solving the business problem. I've seen this happen in two, sometimes overlapping, cases:
1. When the devs' agenda is about learning new tech rather than solving business problems. The ways to solve this are to incentivize devs at the business problem level (hard) or find devs who care more about solving business problems instead of learning hot new tech (easier).
2. When the product management function is weak within an org. Product defines the requirements, and makes trade-offs around the solution. A strong PM will recognize when a bazooka is being used to kill a fly, and will push dev to make smarter trade-offs that result in a cheaper, faster, more maintainable solution. This is especially challenging when the dev team cares more about shiny tech than solving business problems.
It's not necessarily devs pushing for hyped technologies that don't fit the business problem.
Half- or nontechnical managers often follow tech hypes as well and may push for the project to use "Big Data technology" simply because it makes them feel more important to lead a project that is part of the hyped topic.
Yes, I've seen this syndrome coming from (pretty senior) management far more than from developers. My suspicion is that it's more about padding the CV than just "feel[ing] more important" - they may well be behaving rationally at the individual scale, they're just responding to perverse incentives in the job marketplace.
Agreed! This falls into the "weak product management" category imho.
FWIW I think most devs could learn how to push back on poor decisions like this made by non-technical managers as it really shouldn't be the manager's responsibility to dictate technology choices for a solution. It should be their responsibility to push the dev team to make tradeoffs in order to achieve a business result.
Devs who are good at understanding the business objectives and pushing the non-technical team to make better decisions are both wonderful to work with and command higher salaries / fees.
I would say that business work and new tech should basically never meet.
Scout the technology ahead of business needs. If you want to look at new tech, you should be doing a proof-of-concept that the business is not depending on. If you are doing business work, then you can either use things you know, or use a completed proof-of-concept to move in a new direction. But you should not mix business needs with that initial proof-of-concept.
the technical problem is "create secure tokens and distribute them in authenticated messages to N clients that have not established trust" whereas the business problem is "we want a user login system".
I realize this whole thread is pushing back against undue scalable tools, but what if problem actually scales? Is it that foolish to take a risk on learning and deploying scalable tools vs. simpler ones?
Good point. Maybe it's actually a good risk mitigation. Even if the odds of scaling that far are small, it may be worth it to make sure you're not squashing them further by introducing scaling issues to trigger the moment you have a big break.
Developers have been telling other developers that they need to study and learn new tech to survive.
Developers have also been telling other developers that switching jobs is the best way to maximize their paycheck.
It's debatable whether developers should study this new tech on their own time or not, but there's clearly developers who would prefer to do it on the company's time.
Naturally, some developers will learn new tech while solving a company's problem so they can get paid to learn new tech.
Combining all of these things, it's rather obvious why developers are building solutions to simple problems with new tech.
If you want developers to act differently, you have to change the narrative (or incentives) around new tech, jobs, and salaries.
You can't blame us for doing the things we've been told to do by our peers for at least the last 5 years.
Honestly this is an outdated article. In 2017 Hadoop is a more like OS/Platform/Ecosystem with FS (hdfs)/ Scheduler / Applications etc. Spark, Presto, Hive now ensure that you no longer have to write Map-Reduce jobs. While I understand the message about 600 MB being enough to fit in memory and speed offered by command line tools etc. Yet its better to just use Spark (which has local mode with convenient in-memory caching) so that when you have to create a company wide data handling/processing system you can just "plug-in" your code into a Spark/Presto/Hadoop cluster.
Finally if you are truly looking for speed while maintaining portability, these days I would recommend using Docker containers with external volumes created on tmpfs thus providing both speed of an in-memory implementation while being agnostic to both OS and FS.
Honestly, you also must factor out the price of distributed computing, and take into account the fact that processing power and disk size have increased a lot in that timeframe also.
Then you don't need a new OS/Platform/Ecosystem with FS (hdfs)/ Scheduler / Applications to maintain, host and secure to compute thing that could be done with local scripts, or even java / cython programs. or distribute on a few servers what can (often) be distributed easily.
You don't need a new OS/Platform/Ecosystem to run Hadoop/Spark. It is literally just half a dozen Java apps that you run on typical hardware on your regular Linux OS. That is specifically what it was designed for. So rather than write your own Java apps and distribute it yourself why not use a platform that is proven, performant and with Spark will allow you to do things that you could never write yourself.
The OS/Platform/Ecosystem was a reference to the post I was replying to.
To be clearer, I don't think you need such things when you just need one or maybe a few servers talking together when you remove the cost of parallelism (see article about COST) and all the hadoop machinery. I have an OS, platform and ecosystem, it's called unix and works for 95% of my needs.
Frankly I've seen my current project be implemented partly on single node (to be replicated for later scalability) and partly multiprocess python. The python is maybe 10x speedier and has 10% the code and run complexity. We tried to run it on a shared cluster of hundred nodes for giggles, the python finished running before the other one started launching. Not taking into account the setup / maintenance.
The COST model is a good one indeed.
So I'm roughly paraphrasing the article (which I point anyone talking about needing big data to, just before asking how big their data exactly is.)
Yup. Hadoop is a DIY database engine ecosystem; the value is in the ecosystem and the choice of plugging in different bits, all tied together around a common core in HDFS and possibly YARN.
But if you start with 600 MB, you have quite a bit of headroom before you grow out of memory on a single server. Normal off the shelf servers fit 3-6 TB, which 5 orders of magniture up from 600 MB. Even AWS EC2 has 2 TB instances.
Anyone have suggestions for tools to help with data analysis on moderate numbers of very large records? We're working with close a million records per year, which isn't "Big Data", but the records themselves usually contain over a thousand fields. (Is that "Wide Data"?)
Right now the records are stored in JSON documents, and we're working on ways to process those documents for reporting and analysis. There were some previous attempts to store the records "normalized" in SQL, but it quickly became difficult to modify and maintain.
We want to be able to generate things like pivot tables (charts, graphs, etc) using location and/or time data from the records compared with many other fields in the record. We're feeding a little of the data into Microsoft SSRS, but setting it up is time-consuming and the results are slow and ugly.
I've done this in the past with OLAP cubes (SSAS) with much larger numbers of small records. The difference here is there are so many more fields and we have 20-30 new or modified fields each year. It doesn't seem like this should be a unique problem, but I'm having a hard time finding tools that might make it easier and less time-consuming. Maybe I'm just not asking the right questions.
I find it hard to believe that there is no normalization that can be done in a thousand-column record. If you use any decent ORM, you can look at these records and find out what are the real classes/entities and their relations.
But if you really have some crazy single record with thousands of attributes, look at CouchDB. If you know your way around Javascript and don't mind writing map/reduce functions, it should be the easiest way for you to do reporting/analytics.
Or if you just want to avoid a database altogether and know some Python, take a look at Pandas.
A potential "quick" solution could be to identify the most frequent fields, create a postgres table with those as columns (with relevant indexes) and then add the rest to a JSONB fields. This would allow you quick lookup on specific columns, while still giving you access to the rest of your data as part of SQL queries (postgres 9.5+ has a lot of nice JSON operators as well as indexing options).
Besides that, a column store might be what you are looking for if you are more interested in aggregate data on specific columns?
It really depends on what type of querying you're doing.
If the data is wide as you say, but you end up performing a large number of aggregate queries on columns, like SUMming a bunch of values, then it might be worth looking at a columar store which structures your data on disk by column rather than by row.
Again though, I go back to my first point and try and get a good characterization of how the data is typically access, a million rows per year isn't much and you might just get a lot further with some indexing strategies in your regular RDBMS.
Datasets I've worked with have topped out probably at around 500-600 fields - so not as wide but often weighing in the hundreds of GB, and often with hundreds of millions of rows. Spark is our primary tool to handle cleansing, analysis, feature engineering, joins, machine learning, etc. It does quite nicely.
I'm generally for people not over complicating their stack with tools beyond their actual needs - but a lot of this stuff is quite some distance from where it was in 2013 when this article is written, in ease-of-use, ops, tooling and maturity. Its simply becoming cheaper and easier to throw even modest amounts of data through a "big data" engine like spark in many cases, than it is to use more traditional tools which might be able to do the job, but require more advanced tuning, ops, and possibly infrastructure.
And there are lots of compelling managed solutions out there these days. Amazon's EMR (elastic map reduce) is a popular option, gives you lots of tools to choose from, including Spark. Google Data Proc is similar I believe.
One problem I've noticed is that there are no good "medium data" tools.
Column stores are crazy fast, but there isn't much simple tooling built around things like parquet or ORC files. It's all gigantic java projects. Having some tools like grep,cut,sort,uniq,jq etc that worked against parquet files would go a long way to bridge the gap.
Something like pyspark may be the answer, I think it may be possible to wrap it and build the tools that I want.. like
find logs/ | xargs -P 16 json2parquet --out parquet_logs/
parquet-sql-query parquet_logs/ 'select src,count(*) from conn group by src...'
I've been testing https://clickhouse.yandex/. I threw it on a single VM with 4G of ram and imported billions of flow records into it. queries rip through data at tens of millions of records a second.
Edit: another example... I have a few months of ssh honeypot logs in a compressed json log file. Reporting on top user/password combos by unique source address took tens of minutes with a jq pipeline. The same thing imported into clickhouse took a few seconds to run something like
select user,password,uniq(src) as sources from ssh group by user,password order by sources desc limit 100
this is a really good point. there's a really uncomfortable space where your production database is just a wee bit bigger than you can comfortably work on with a single machine, and there is a case to be made for using some level of parallelism, but you don't yet need to go full monty on it with a compute cluster and a distributed datastore.
so you start looking at stuff like sharding or vertical scaling and you keep on doing things more or less the way you have been but with steadily degrading performance on every new insert.
I used ElasticSearch for this end in the past. Since everything is indexed in elastic you can run arbitrary queries in mili-seconds as long as you don't need joins.
yes. the required infrastructure will look different though, and obviously elasticsearch doesn't speak sql (not without some crazy 3rd party plugin anyway)
> I've been testing https://clickhouse.yandex/. I threw it on a single VM with 4G of ram and imported billions of flow records into it. queries rip through data at tens of millions of records a second.
Did unrelevant and crazy testing on 4GB DigitalOcean VM with a part of a latest hosts dataset from Rapid7's Project Sonar.
Data is pairs of ip,certificate thumbprint. On ~1500000 of entries in ES (~300mb with indices, much wow) a sort for ip occurancies is made on the speed of light within 3.5 seconds if data isn't cached and 300-350 if it is.
For "medium data", my company has found a lot of success using dask [0], which mimics the pandas API, but can scale across multiple cores or machine.
The community around dask is quite active and there's solid documentation to help learn the library. I cannot recommend dask enough for medium data projects for people who want to use python.
They have a great run down of dask vs pyspark [1] to help you understand why'd you use it.
I've been trying to change all of my Luigi pipeline tasks from using Pandas to Dask, so that I can push a lot more data through. Seems like an easy process so far, and I like the easy implementation of parallel computing.
Pachyderm (github.com/pachyderm/pachyderm) is a great "medium data" tool. I'm biased of course because I'm one of the creators, but we build in data pipelining, data versioning, and you can use whatever languages or libraries you want because we containerize the whole workload.
You example above would actually work perfectly. You can literally use grep in a distributed fashion in our system. One of our example pipelines is using grep and awk to do log filtering/aggregation or word count.
For medium size datasets I would argue that you could just dump it into any SQL database, assuming that the data allows it self to be stored and access in a reasonably predictable fashion.
Postgresql, SQL Server, Oracle, or MySQL, properly configured can handle larger loads than some people assume. Even SQLite is pretty performant; you just need to write some kind of front end to make it usable as a service.
Of course, it depends on the precise nature of your workload.
Yeah.. sqlite is crazy fast but I've found performance falls off a cliff once the DB is larger than ram. a 4G database on a VM with 8G of ram runs great.
This is why you'd use something like Impala; SQL is your cut, sort, uniq etc. That's why they don't exist; put the Parquet file into HDFS, turn on shortcut access, and use Impala to query it. I see throughput of about 250M records/sec on a 12-CPU box [1]. If your data doesn't fit in memory (FS cache), I/O will probably be your bottleneck.
[1] Technically, 6 2-core guests on a single hypervisor; using guests to test deployment and scaling more easily.
As someone who runs 3x244GB AWS EC2 instances, this stuff is expensive. Before throwing everything into a hi-mem cloud setup, it's worthwhile to pause and think if it's possible/worthwhile to find an alternative approach.
I'm biased in that I've been doing nothing but Hadoop consultancy for almost 10 years now.
I have done maybe a hundred to two consulting hundred projects based on Hadoop (as you can imagine it's mostly short-term and troubleshooting stuff) and am a ASF committer myself. So take this with a grain of salt.
I've seen the kinds of projects the article and lots of people in here refer to and I agree with lots of what's been said.
I also agree that Hadoop (the whole ecosystem) probably not has a single "best of category" tool that ticks all the checkboxes but IMO that's totally missing the point.
What it gives you is:
- (Mostly) Open source stack (compare that to your Informatica/Talend/IBM stack that big companies have and not to your Python jobs they don't), lots of the code is crap (see my Apache JIRA history) but it's there and you can attach a debugger to your running cluster, that's fantastic and very hard to do with e.g. Python (at least I'm not able to find good ways of doing it)
- Proven technology, it works most of the times and if not there's tons of SO answers, mailing lists, books.
- All the boring enterprise stuff: Encryption at rest and on disk, strong authentication, strong authorization, auditing, high availability, failover etc.
- Integration into Monitoring & alerting tools, distributions figure out the important stats and thresholds for you
- Mostly easy to operate these days even at large scale thanks to Cloudera Manager et al. - no need to manually run/install stuff
- "Coolness factor"
- Costs way less than established tools with e.g. external storage filers or similar specialized hardware
- It's not trivial but also not super hard to find people with Hadoop/Spark experience
- Existing purchasing agreements with Hardware vendors or Cloud providers
- Business/C-level gets and supports it
Edit: Three more I thought of:
- You can buy licenses and support for the tools. Both of which have been discussed multiple times here on HN, a donation doesn't work that well for companies
- You can buy indemnification from license problems from at least Cloudera but I think Hortonworks as well
- If the same stack has been used elsewhere in the company already chances are there are processes in place, someone has already vetted it, done the open source checks etc.
See what I don't mention are things like performance or absolute money amounts or even the amount of data (as the whole ecosystem is much more than "big data" now) etc. as we see surprisingly little projects that care about specific SLAs or features. It's all relative.
That said: At least 50% of the projects would still be better served by a simpler solution but that still leaves a whole big chunk of projects where Hadoop makes sense for other than technical or monetary reasons.
But I very much disagree with this part of the post:
> The only benefit to using Hadoop is scaling.
The other point that's commonly missed is that Hadoop is really only useful when both inputs and outputs are too large to fit on one machine. If you have a big-data input but a small-data output (which is very common in a lot of exploratory problems), you can get away with a simple work-queue setup that sends results into a shared RDBMS or filesystem.
At the beginning of my current project, I had a job that involved 35T of input, but the vast majority of records would be ignored, and then for each successful one, only a few hundred bytes of output would be generated. Rather than Hadoop, I setup a simple system where a number of worker processes would query Postgres for the next available shard, mark it as in-progress, and then stream it from S3 and process it. When they finished, they'd write a CSV file back to S3. The reduce phase was just 'cat'.
The resulting system took a few hours to build (few days, including the actual algorithms run), and it was much more debuggable than Hadoop would be. You could inspect exactly where the job was, what shards had errored out, and which were currently running on machines, and download & view intermediate results before the whole computation finished. You could run the workers locally on a MBP if you needed to debug a shard, with no setup needed.
When I was at Google, we had a saying that "The only interesting part of MapReduce is the phase that's not in the name: the Shuffle". [That's the phase where the outputs of the Map are sorted, written to the filesystem and eventually network, and delivered to the appropriate Reduce shard.] If you don't need a shuffle phase - either because you have no reducer, your reduce input is small enough to fit on one machine, or your reduce input comes infrequently enough that a single microservice can keep up with all the map tasks - then you don't need a MapReduce-like framework.
Yes, very strong agree. I currently use HDF5 on my home machine that sucks in a fairly large amount of data (it's in the tens of TB, and adds 20GB daily). Before I set up HDF5 my database experience was mostly limited to Postgres (although more recently MySQL). I explored a variety of options including Hadoop and Cassandra but I just didn't really want to have more than one node for this exact reason, and I couldn't see a compelling advantage to either without that sort of workflow. Were I not working specifically with timeseries, I probably would have thrown it into postgres.
There's a particular bias I come across where people who genuinely have big data want to set it up in ways that is not necessarily performant because Hadoop is basically the most recognizable tool. If the data processing I'm doing generated a lot of output data I might consider a different flow, but there just isn't much of a reason: most of the data is inert for long periods of time, the output insights are fairly small and the actual processing has to occur very quickly and with low latency.
> it was much more debuggable than Hadoop would be. You could inspect exactly where the job was, what shards had errored out, and which were currently running on machines, and download & view intermediate results before the whole computation finished. You could run the workers locally on a MBP if you needed to debug a shard, with no setup needed.
To the extent that's true it's an indictment of Hadoop's implementation. Doing all those things in Hadoop ought to be trivial; maybe there are a few tools you'd have to make a one-time effort to learn, but reusing them ought to save you effort over making a custom system every time.
That was certainly the case in the past. We backed away from it ~5 years ago after wasting too much time investigating cases where a daemon had leaked memory, hit the heap limit, and failed in a way which caused work to halt but produced no visible error state other than a stack trace in one of the many log files. Local testing wasn't viable since you needed something like 16-20GB to run stably, and back then we didn't have individual dev systems with enough RAM to run that kind of overhead and still have enough left for the actual work.
I may be mistaken here, but I thought the point of Hadoop wasn't merely to hold more data, but to do larger, distributed, computations on that data. You might have only 10GB of data but need to perform heavy computations on them, requiring a large cluster, with each worker needing to exchange data with other workers periodically.
Yes, this and also parallelizing disk I/O. For example you could fit a 5TB table on a single machine, but if you have an operation that requires doing a full scan (e.g. uniqueness count over arbitrary dates), that will take a very long time on one disk. Yes you could partition into multiple disks, but Hadoop offers a nice generalized solution.
Hope the parent comment gets more visibility. In addition to parallelizing IO and automatic management of failures, Hadoop also provides hooks to implement complex data partitioning schemes - think dynamic range partitoning, compound group partitioning, etc. Unix tools, MPI, scatter-gather are convenient only for embarrassingly parallel jobs.
In a world you can buy a 24 TB RAM 8-socket 192-core 384-thread box able to hold 20 2.5" devices and a mix of 16 devices picked from NVMe storage, GPUs or other coprocessors (giving you up to 16 coprocessor hosts, each with 72 x86 cores, giving you 2304 threads and 256 GB of RAM), the odds of you actually having big data (as in "intractable from a single box") are remarkably small.
Except cloud computing. HDFS is only useful if you own hardware. HDFS is still useful for these workloads, but calling it "Hadoop" when you are using "Spark" as the execution engine doesn't make sense. You are using Spark, which depends on HDFS for local installs only. And if you can be in the cloud, you should.
It is also worth noting that many people use Spark against a database like Cassandra and not HDFS. So it isn't even universal for local installs.
I was a Hadoop evangelist, but its time has passed. It is a foundation, not the tool you use to get work done.
Looking at the comments, one valid reason a company might be using big data without actually having big data is the potential need to scale and grow. Don't forget that many of them are expecting a 10x increase in users.
Don't forget that most of them never get there. Often because they are not able to find & make the product which the market wants/needs. Sometimes partly because engineering is too slow and unadaptive, possibly because they are overcomplicating things unnecessarily.
Imagine my glee when reading this thread after a long day of trying to get a simple answer as to where to deploy a JAR / where data should be landed.
7 hours later, 4 email chains and around 12 people involved, and still no answer, and they've all gone home...
I've been working in a project for an insurance conglomerate for the past 6 months (total project has been running close to a year) that has very recently hit the rocks.
The client has gotten fed up of our inability to deliver, so canned the long term plans of the project in favour of a stop gap solution in order to fulfil some short-term business goal.
We've acquired a relatively ludicrous amount of hardware (i'm talking $100,000+) for what is essentially the equivalent of SELECT COUNT(*) FROM ALL TABLES, which is then filtered into some slow, unusable BI tool that costs $999+ per user / per year.
The real tragedy is they already have geo-diverse SQLServer instances running that would have allowed them to do this in 1 evening. 1 year and $1m+ in billable's and we somehow are going to fail to deliver even that.
We fell in this "trap" as well. Whilst working on a marketing automation system, we were integrating a Google Analytics/Piwik clone. Our guestimates indicated we were going to be storing around 100GB of events per month. We geared up and got working. The team built complex Hadoop jobs, Pig & Sqoop scripts, lots of high-level abstractions to make writing jobs easier, lots of infrastructure, etc. etc. After about 2 months we scrapped the "big data" idea and redid everything in two weeks using PostgreSQL. As most of the queries were by date, partitioning made a huge difference.
I recall one of the classes was named SimpleJobDescriptor. At the near end it was 500+ lines long. Not so simple after all.
2. SQLite. The maximum database size is 140 terabytes, and SQLite can join different database files together. First I would recommend using bash to cull your data down before import, if possible (https://www.sqlite.org/limits.html).
3. PostgreSQL. Again, I would recommend using bash to cull your data down before import, if possible (https://www.postgresql.org/about/).
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 - 1600 depending on column types
Maximum Indexes per Table Unlimited
It always cracks me up what people consider "big data" to be. I once wrote a few Julia scripts to process terabytes of data from the results of molecular dynamics simulations. Just spin up a few high-memory cloud instances, process for a couple hours, and then shut down. Total cost: about $12. It's amazing what a couple billion CPU cycles per second can do. All without Hadoop, Spark, etc.
Nice article and it still holds despite being a few years old. There's a general lack of understanding or appreciation for this "medium" data world which is actually where most of "big" data lives (i.e. too big for Excel but not PB or many TB).
Interestingly, we (Cloud Dataproc team) have been trying to work in the opposite direction. A few months ago we launched single-node clusters so people can use Spark on one VM instead of creating crazy-huge clusters just for lightweight data processing. No sense in using a a fleet of cores for simple stuff. :)
Disclaimer - work for Google Cloud (and think we need better tools for simple data processing)
Just conjecture, but I imagine that he got tired of the mods accusing him of trolling (in his uniformly high-quality and civil commentary) due to his non-standard political views.
A 2016 work project I did requires to crawl 100 million web pages from around 100k different web sites each night. There was no existing infrastructure what so ever. I was told that it is a big data project and internet scale processing power is required (and actually budgeted for it).
Built a Python crawler to pull data from the Internet using EC2 spot instances, as they are dirty cheap. After just a few days happy coding, I eventually got to a place where I can reliably (spot instances are not that reliable) pull the data, compress them and download them to our own data centre for processing (policy rather than design decision) for under $50 with the majority spent on transferring the final compressed output.
Big data? 100 million small files a run is a typical small data project, big data means trillions of files a day, or something like exabytes scale data volume.
You're defining the term big data as "an amount of data that seems impressive to me". That's a pretty loose definition, and two people could have wildly different ideas.
I prefer to define big data as "an amount of data that cannot fit on any one computer." Once you start using distributed systems, either with sharding, NoSQL, or similar tech, you're in big data territory. Going from one computer to two you greatly increase the complexity of the system.
Something no one has mentioned directly: For startups selling software to fortune 500 companies, they generally have to be very open about the architecture of their software, how it all works, etc. Sales are always a "build vs buy" decision with these companies.
If the architecture diagram looks complex, it makes for an easier sale, i.e. "well clearly that would take us forever to build/integrate/whatever, so buying seems easier" whereas if it's just a database and a app server, it's can be a much harder sale.
obviously all the usual caveats apply, but in general, demonstrating "technical complexity" to the customer is something a lot of sales and CEOs will push for.
The thing that's so frustrating to me--and many others, based on the comments--is how much of this gets pushed to management and product people.
Our team just acquired a product manager for the first time. And for the most part I absolutely adore her. She's really great at pacing things out to a timeline that works for us and pulling us out of our idea that a hacked up solution that you do in a couple of days isn't really a product. And she's very much pushing against putting those into production, all of which I agree with.
But she doesn't understand technology or needs. She's constantly having meetings where she pitches our manager on product ideas that don't even make sense compared to the client needs. Many of these are buzzword-based. But it makes me seem like an asshole where every team meeting, I have to be like, "No. That's not what the client asked for. The client asked for a solution to this problem x. The client (doesn't matter if it's internal or external) often doesn't know what the right solution is. Good product people and managers know that to get to a good place, the key is to get the client to describe the problem instead of the solution.
I'm not a product person. I don't know jack dick about products except for how to make them. But the level of miscommunication I see even in our small organization is astounding.
You get one manager, one product manager, and two stakeholders, and one engineer in a meeting together. It's absolutely unbelievable to me what management and product take away from those meetings.
It's so far from the reality of the problem to be solved that it sometimes makes me nuts. Like it has today.
I think that our whole way of doing things and our hierarchy is completely broken and backwards. You want to get something done? Put a software engineer in a room with an operations person who is on the ground. Talk about the problem. Propose a solution.
Take it to the management and let them sort out priorities among themselves. This bullshit about management being the first point of contact virtually guarantees that buzzword-based development is going to happen.
I know that we are supposed to be the sacred cows, and that we need to be protected. Fuck that, I say. Let me interact with the people on the ground whose problems I'm trying to solve. Don't put a non-technical product person who doesn't get any of the details right between me and the end-user. I'll take the time out of my day two days per week to go to the office and have meetings, and I do the rest at home where I can be productive and write code.
Sorry for the rant, but fuck all of this. We have a very broken system, in most cases.
In my experience, there are not many problems in the course of everyday business that cannot be solved with SQLite's CSV import feature and/or its virtual tables.
Seriously though, this also applies to a good chunk of ML "applications". Most of times, a 'simple' SVM would work perfectly.
reply