Python has sophisticated ML tools and Neural Net packages, I'm honestly not sure what they are called in Python land, but I see Python on the listings all the time for the top ML libraries. Clojure may also have really great frameworks available by now.
Actually, I use both but my production models are in Clojure.
I often end up implementing minor things myself using lower level abstractions (e.g., Linear Regressions or PCA with whitening using Matrix libraries) and I check the results and/or try new things using scikit-learn.
So in general, I'd say I do the programming (outputing intermediate CSVs, tests, web service, thread handling, UI, ...) in Clojure(Script), and try other approaches (e.g., other models/parameters/...) in Python.
I'm quite happy with this pipeline but probably to some extent because I really love to understand how things work and nothing pushes you to learn as much as a missing function in your ML library :-)
I'd hardly say MLlib matches sci-kit in the number of algorithms available! For example we recently had to resort to a third-party implementation of DBSCAN.
It does have most of the important ones though. Also the pipeline API feels slightly cleaner than the one in sci-kit.
Our data scientists are learning Scala and Spark (MLLib) as a replacement for Python and R. So sure, maybe Python has long been the "best language for ML" but also one time in the not so far past "MySpace was the best social network"
Hello, people of HN. Let me first say that this post is about promoting an open source project which I've been working on for the past 10 months or so. I'm leaving it here with hopes of getting in touch with other devs who might be interested in machine learning, Scala or both.
Those of you who have stumbled upon ML before will know that Python is the go-to language for data-related things. It has high-quality libraries for analysis, modeling, and visualization. scikit-learn is a notable example and for good reasons; it's well maintained, has a large community, it's performant and it has a really good API (there's a paper about how they designed it: https://arxiv.org/abs/1309.0238).
I had been looking for a Scala equivalent for quite some time and then finally decided to start coding it myself. The main reason is that JVM-based languages are very common for building data pipelines and having the ability to serve predictive models directly within the pipeline offers several advantages. Here's some data to back-up my claims: https://cloud.google.com/solutions/comparing-ml-model-predic... (comparison of serving the model within the pipeline vs. calling a REST API).
The project currently has two main goals. It tries to expose its functionality through an intuitive API (mimic scikit-learn but use idiomatic Scala features and functional constructs) and provides performant implementations of common algorithms (here is a limited set of comparisons with scikit-learn implementations: https://github.com/picnicml/doddle-benchmark).
Why use Python though? Once you are rolling with an ML language wouldn't it make sense to use the C++ interfaces of the machine learning/data science projects that make Python interesting at all?
As a researcher in RL & ML in a big industry lab, I would say most of my colleagues are moving to JAX [https://github.com/google/jax], which this article kind of ignores. JAX is XLA-accelerated NumPy, it's cool beyond just machine learning, but only provides low-level linear algebra abstractions. However you can put something like Haiku [https://github.com/deepmind/dm-haiku] or Flax [https://github.com/google/flax] on top of it and get what the cool kids are using :)
Simplicity is one of the main traits of Python. It’s conducive to prototyping and GTD in a quick fashion. Libraries like scikit and pytorch have also helped developers build larger solutions using smaller building blocks without worrying too much about implementations.
Python suffers from serious performance constraints nevertheless and productionizing an ML service that requires real-time analysis is going to take some effort. For such systems, folks usually tend to lean towards a hybrid stack.
Can someone explain why should I pick this over scikit?
I don't have any ML exp. I found ML quite magically :/ and totally difficult to start if you don't have a phd in mathematics
As a side note, a lot of tutorials I've seen on machine learning use Python, and I'm curious as to why. Is it simply the number of libraries that have been developed for ML tasks, or is there something about Python the language that makes it especially suitable (versus, say, Ruby or Haskell).
100% agree, and there are a number of efforts in the space. mlpack (https://www.github.com/mlpack/mlpack/), Shogun (https://www.shogun-toolbox.org/), and Shark (https://www.shark-ml.org/) are three that have been around for over a decade now. They're a little niche because C++ is not that popular for data science, but they are generally pretty fast (especially mlpack, which focuses on speed).
in my opinion competes with python when it comes to DS/ML. I find it a lot more comfortable to use if you use emacs bindings
reply