Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I agreed with the article at the time it was posted. Now I use Spark for various sized data sets. Specifically Zeppelin + Spark is a great combination.

Then again, Spark doesn't really need Hadoop, I see more and more people using it with Kafka and Elasticsearch for all sorts of fun stuff.

And as other commenters pointed out, you get read-only SQL (very powerful SQL) for free. The other day I joined an elasticsearch result with a CSV file in SQL.



sort by: page size:

Nice, but I can't find any reason to choose Spark over modern Distributed SQL databases (CockroachDB, CitusDB, TiDB etc. or cloud vendor-specific SQL DBs)

spark is far more testable and composable than sql! and you even get static typing checking. plus i can read data from anywhere - local fs, s3, rdbms, json, parquet, csv... rdbms could not compete

How is Spark similar to the Hadoop way of doing things ?

It operates very differently from Hadoop for us. SparkSQL allows third parties apps e.g. analytics to use JDBC/ODBC rather than going HDFS. And the in memory model and ease of caching data from HDFS allows for different use cases. We do most work now via SQL.

Combining Spark with Storm, ElasticSearch etc also permits a true real time ingestion and searching architecture.


I've been following Zeppelin for a while and have been actively using it for a couple of weeks (using Scala and SparkSQL/Dataframes), and although there are a few rough edges, it has been a godsend for data exploration, analysis, and feature extraction. If you're working with Spark, I highly recommend giving it a try.

Don't use Hadoop even if your data is big. Spark replaced it.

Thanks for the response. I should note that my experience is with Hadoop, not Spark, which is part of why I was interested in your article.

Good point. I used Spark and DataFrames at my last job just because the data I needed was conveniently available in that environment. Using a high-memory single server seems better on general principles, until your data won't fit.

Spark has a place: at large scale. For 100s of GB to a few TB of data PostgreSQL works very well. At least, it does for my team. I don't want Spark, Kafka, NoSQL or any other modern fad near my team's data. It's just not appropriate.

In 2017, with Spark's Catalyst engine and DataFrames data structure (allowing SQLesque operations instead of requiring writing code in map-reduce paradigms), you can have the best of both worlds in terms of big data performance and high usability. Running Spark in a non-distributed manner may sound counterintuitive, but it works well and makes good utilization of all CPU, RAM, and HDD.

Spark is orders of magnitude faster than Hadoop, too.


Our experience with Spark has been horrendous. Very unstable. Marginal improvements. Big hassle.

Would strongly advise you to consider hadoop. We also used storm and found it to be much stable.

Databricks makes a lot of noise though.


depends what you’re doing. For querying large datasets? 100% with you.

For data cleaning, processing, analytics, ML on decently large datasets? Spark wins out


Spark is a good replacement for MapReduce. MapReduce != Hadoop.

Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.

Spark is still commonly used for data processing and large-scale transform jobs (I see it often for parsing and converting datasets from one format to another).

Given that this survey is from Kaggle, I won't be surprised if there's some selection bias going on with the results. I suspect most of the respondees primarily work with smaller datasets at work with limited big data or ML opportunities, and hence were on Kaggle to get experience in those areas? According to the survey the most common query language is MySQL, which isn't great for actual big data.


Spark is very much part of Hadoop. It uses Hadoop libraries throughout it including for the core part of reading/writing data

And that adage of big data can't fit in memory is nonsense these days. We run clusters with hundreds of terabytes of RAM which is very much big data. It's pretty easy and affordable with the cloud.


Spark is open source and free.

Databricks is their hosted version just like Aurora is a hosted PostgreSQL.

The fact you don't seem to understand the difference makes me question if you are qualified to be making statements that it is garbage and we can simply write SQL. Because having worked on many large, big data projects SQL is often not the right tool for every use case.


Hadoop (and any other Spark data source) via SparkSQL

Spark is pretty fantastic from our perspective. People just think about it in terms of a faster Hadoop MR but it is so much more. The APIs and integration with external systems are so much easier and more intuitive to use.

It really is Hadoop 2.0.


I have also found Spark (and Hadoop before that) a little clunky to prototype and develop on, but when you need to handle very large data sets with good throughout performance then systems like Spark/Hadoop are great. One problem they had was maintaining infrastructure, and to be honest, when I used mapreduce as a contractor at Google or AWS Elastic MapReduce as a consultant I didn’t have to deal too much with infrastructure.

Anyway, it makes sense that they backed off using Spark and HDFS - makes sense given the size of their datasets.

The original poster mentioned that their data analytics software is written in Haskell. I would like to see a write up on that.

EDIT: I see that they do have two articles on their blog on their use of Haskell.

next

Legal | privacy