Spark is still commonly used for data processing and large-scale transform jobs (I see it often for parsing and converting datasets from one format to another).
Given that this survey is from Kaggle, I won't be surprised if there's some selection bias going on with the results. I suspect most of the respondees primarily work with smaller datasets at work with limited big data or ML opportunities, and hence were on Kaggle to get experience in those areas? According to the survey the most common query language is MySQL, which isn't great for actual big data.
Spark is a fairly generic data processing engine with excellent SQL support, and its DataFrame structure is pretty similar to a pandas DataFrame. It can be really useful even on a single node with a bunch of cores. Spark's MLLib also has distributed training algorithms for most standard models (notably excluding nonlinear SVMs). It also has fantastic streaming support. So Spark is good for a lot more than straight-up map-reduce jobs these days.
Every single large scale data science team e.g. Google, Spotify, AirBnb will be using Spark for most of their work. It is by far the defacto standard for working with large datasets. Especially since it integrates so well with machine learning (H2O) and different languages (Scala, Python, R).
Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.
Initially that was my experience too but over the past 2 years I've found that more and more places are switching to Java or Python for Spark Data Engineering actually, it might only be in my local market though.
I’m not even sure what we are debating. Spark is for big data work, it isn’t something you would typically use as a general purpose database backing your application. My original point way back is that expecting customers to learn an entirely new paradigm for storing and querying data is a poor decision, and limits your work to niche use cases. Spark has a large niche but it isn’t comparable to MySQL / Postgres or other newish data stores like Dynamo.
Furthermore any competent engineer knows SQL because ORMs are cumbersome and annoying for anything except basic use.
I watched a video recently about how Scala took over the big data world
I think you mean Spark? I use Spark heavily at work and know little Scala (although I do have an engineering team who do work in it sometimes).
I would add Apache Spark to the list. I watched a video recently about how Scala took over the big data world (probably not true) [1], but the presenter made an interesting point about how Spark subsumes a lot of different things (streaming, machine learning, built in support for SQL) and it is good enough at those things even if not the best tool. Not surprisingly, that actually makes it a good candidate for adoption in the enterprise.
So Spark is great, but it isn't the only thing. For example, it does let people use Python and R on the same platform pretty easily, and with the potential for good performance.
However, you really need to know what you are doing to get the best of it (what a surprise, hey!). For example, Databricks likes to show how Dataframes/Datasets give huge performance advantages over the old RDD programming model.
This is true, but you need to understand why to make sure you see the same benefits. Basically, there are numerous primitive functions than have been implemented as native operations of the Dataframe classes, and if you use them they perform well. If however you want to use Python UDFs, then you won't.
My experience has been that spark actually doesn’t work with real big data without significant babysitting. It’s agorithms, be it window functions or joins, can take forever if not never finish on actual data that’s hundreds of terabytes large (even tens of tb). You immediately need to worry about crap like garbage collectors, worker memory, cache and data distribution. The majority of the data engineers out there can not actually deal with these problems but spark + actually not big data let’s them think they’re actually good at their jobs when in reality they’re not.
I still see myself going to spark and Spark SQL for some tasks like stratified sampling which I haven't been able to properly do with pandas. Somehow the spark DF API feels more intuitive and I was able to figure out a lot by myself.
I agreed with the article at the time it was posted. Now I use Spark for various sized data sets. Specifically Zeppelin + Spark is a great combination.
Then again, Spark doesn't really need Hadoop, I see more and more people using it with Kafka and Elasticsearch for all sorts of fun stuff.
And as other commenters pointed out, you get read-only SQL (very powerful SQL) for free. The other day I joined an elasticsearch result with a CSV file in SQL.
I use Spark at work, it's really good when you need to do some large scale analysis with your own custom code. Everything else it does is just ok. It certainly doesn't subsume ML tools.
Spark sits on top of YARN/Mesos, and is used for data processing scalability that pandas can't handle.
Personally, I think two areas often lacking are software development skills and general statistics knowledge. The former is necessary for writing production-quality code, assisting with an sort of data engineering pipeline, writing reliable, reusable code, and creating custom solutions. Unfortunately, the latter is often skimped on (if not skipped entirely) in favor of more 'hot' fields like ml/dl, with the result being a fuzzy understanding across the board. (You'd be amazed at the quantity of candidates lacking fundamental knowledge about glm's, basic nonparametric stats, popular distributions, etc).
Given that this survey is from Kaggle, I won't be surprised if there's some selection bias going on with the results. I suspect most of the respondees primarily work with smaller datasets at work with limited big data or ML opportunities, and hence were on Kaggle to get experience in those areas? According to the survey the most common query language is MySQL, which isn't great for actual big data.
reply