Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Why in the world would use Spark for such a tiny data set?


sort by: page size:

Well I mean, that's a problem with all those solutions.

If your dataset is so small it fits in RAM it's simply ridiculous to reach for those tools, and you don't start seeing real advantages until you have truly massive datasets.

Anecdotally using Spark on less than 1TB just means wasting your time, but in Spark's defense, that's not what it's for.


Yeah, nobody uses Spark because they want to, they use it because from certain data size, there's nothing else.

Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.

Spark has a number of features and constructs that can make it very powerful to work with, even on "small" data sets. Big data isn't just measured by size, it's also measured by computational complexity. 80,000,000 rows is massive if the operation you're performing against it is O(N^2), as an example.

Spark runs pretty nicely in single-machine mode, and it forces you to structure your logic as a nice clean map-reduce. So I think it's fine to use it for "small" data.

I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?

Spark is intended to work on big datasets. It's machine learning capability is very limited and it's primary strength is processing huge amounts of data. I think it's unfair to blame it for 'failing' on small datasets

Because Spark is far, far more than just about storing data and doing basic queries. It is also an analytics e.g. machine learning/modelling platform and a framework for building complex applications on top of Hadoop.

Interesting, sounds like you have a very specific usecase. I'm mostly dealing with huge datasets and Spark is a lifesaver.

That's what Spark is for. You can do petabyte-scale jobs... with DataFrames.

Good point. I used Spark and DataFrames at my last job just because the data I needed was conveniently available in that environment. Using a high-memory single server seems better on general principles, until your data won't fit.

Because most data isn't "big" data. If it fits in Excel on a laptop, why bother to roll out a distributed data lake system like Spark with all the associated ops work.

Thanks for writing this! I'm thinking about using Spark for a little 2M-data-point project that I'm working on, just for the learning experience.

Out of curiosity, what kinks did you find?


I mean for the problem at hand ... not for anything, ever :)

I think usually the argument for Spark and the like proceeds by just making the dataset bigger until whatever you’re comparing becomes absurd (e.g. command line vs Spark). It was the same deal in the map reduce days. Works like shit for 99% of datasets but it’s the best thing ever when there are no other alternatives :)


Spark is still commonly used for data processing and large-scale transform jobs (I see it often for parsing and converting datasets from one format to another).

Given that this survey is from Kaggle, I won't be surprised if there's some selection bias going on with the results. I suspect most of the respondees primarily work with smaller datasets at work with limited big data or ML opportunities, and hence were on Kaggle to get experience in those areas? According to the survey the most common query language is MySQL, which isn't great for actual big data.


Thanks for posting this. I'm starting to get a feel for when Spark is usable-- you need an underlying indexed data store which lets you fetch small subsets of your data into RDDs (or, your data can be tiny to begin with). We've been trying to use Spark on input sizes which, while smaller than our cluster's available memory, are probably too big for Spark to handle (> 1TB).

If there was something better than Spark for distributed processing, we would be using it. The rest of your comment is a straw man argument, assuming everybody uses it for datasets fitting in memory of a single node.

We use Spark essentially as a distributed programming framework for data processing - anything you can do on a small dataset on a single server, you can do the same thing on a huge dataset and 20 servers or 2000 servers with minimal extra development

I think it's marketing nonsense. Databricks in particular has good "proximity to data" IMO, in that once you've figured out how to use it with your data sources, you can just fire up a web browser and connect to them, and connecting those data sources to your code is easy.

The problem with seeing Spark as your one tool for everything is that's only true if it's trivial to integrate your code with Spark. Viz tools like Plotly/Bokeh don't integrate well with Databricks' notebook, Deep Learning tools are not really supported yet unless you're running your own clusters and running special libraries to wire thing together.

I think Spark is a good workhorse for big data; it can do repetitive things well at large scale, it's less good when you want to use more niche tools since most of the Data Science community is not focussed on Spark. PySpark exists and will probably be good enough, but only if your data fits into memory in a single machine anyway.

And if you're not dealing with big data, Spark is overkill. It's usually simpler to just get a box with more RAM.

next

Legal | privacy