Command line tools for Machine learning

来源：互联网发布：打电话变声的软件编辑：程序博客网时间：2024/06/10 17:09

https://www.linkedin.com/pulse/command-line-tools-machine-learning-marios-michailidis

Preamble

I thought I should make an article with my favourite tools that can be used from the command line and be run with minimal memory requirements. Why consider running an algorithm/tool from the command line? A couple possible reasons:

There has not been a great wrapper for it yet OR the wrapper does not take into account all the available features’ for the tool. In principle you have much more control with the command line tool.
It could run faster.
Most of the times it runs with significantly less memory.

Most of the times I made use of such command line tools was because of the latter bullet point or because the tool only existed as such. You can imagine that if you wanted to run the algorithm (for example) in python, you would have to create the python data object first and then fit the algorithm, which subsequently requires additional copies of the data, exhausting the available memory quickly. This is not something to worry about when the tool is used directly.

I will extend this list to include tools that can be run with minimal memory requirements too (“as if” they are runnable from the command line or that support online learning).

Tools

1. Vowpal Wabbit

This has been by far the most common command line tool I have used (especially in kaggle) . This is a great tool to parse really big data (and I have tested it with billions of rows) . You can run mostly linear models , but it allows for some neaty and quick implementations of matrix factorization and 1-layer neural networks. Most of the algorithms are trainable with some form of sgd.

I particularly like the ability to create namespaces/tags which can be interpreted as family or group of features. The great thing with namespaces is you can create interactions of features on the fly. For example you may have some features that all belong to “product” attributes like price, size, sales last month and you could assign a namespace to it (and lets call it ‘A’). Then you could have another namespace (lets call it ‘B’) for “customer” features like height, age, number of children. You could tell VW to create all possible interactions of AB (and it is quite fast in doing so!).

The available command line parameters can be found here. To install it, is quite straight forward for Linux and Mac and you could find the instructions in the first link on the top. For Windows is a bit trickier. I myself shamelessly ‘steal’ vw.exe (executable) from gatapia’s github repo – which by the way is a great repo for many good wrappers and tools to help you with competitive data modelling. Another way to install it is via creating a linux-like environment in your windows and this tutorial can help you do just that.

You can see the official tutorial and examples to learn how to use it. Fastml has a simple demonstration of Neural networks through VW which I had found useful in an old kaggle competition.

2. Kaggler

Kaggle master Jeong-Yoon Lee created Kaggler package for python. It includes many online learning algorithms found useful in multiple kaggle contests. It is written in cython for better speed and memory usage. This is technically not a command line tool, by it can be used with minimal effort from within python and it only requires to specify a file (to be parsed) along with certain hyper parameters regarding the available algorithms.

There are 2 things to be highlighted regarding Kaggler. The first is that it includes Tinrtgu 's FTRL Logistic model used in Avazu kaggle competition that could Beat the benchmark with less than 1MB of memory. This dataset was many million rows and this benchmark was competitive, quick and memory-efficient. The second is that I have found its 2-hidden-layer NN implementation very efficient and I have also implemented inside StackNet.

PS. Kaggler has a facebook page where you could direct any questions regarding the tool. There are 2 kaggle grandmasters administrating the page and it posts very good material.

3. LibFM

LibFM implements factorization machines where pairwise interactions among features are factorized using latent features. The official paper that explains LibFM can beviewed here . This tool has been extremely useful in recommendation tasks, especially when trying to predict a rating a user will assign to a product/service. This type of algorithm (e.g. collaborative filtering with factorization machines ) was among those used to win the infamous Netflix prize which changed the course of predictive modelling forever!

I had personally found it very useful in this kaggle competition.

You may find details on how to install this using the official manual. This includes the required data formats as well as the available command line parameters. Once again ,installation for Linux and Mac should be easier. For windows, consider stealing libfm.exe from gatapia’s github (again) – just don't make it a habit, ok?

4. LibFFM

Field-aware factorization machines is a great concept particularly useful in click-through-rate machine learning problems and comes with a great multi-core command-line based implementation. A nice distinction between linear models, polynomials and typical factorization machine models along with LibFFM may be viewed here.

The basic idea behind this implementation versus others is it treats certain features as “a group”. This is very similar to the namespaces’ concept described previously in Vowpal Wabbit. Doing so often yields improved results over other typical factorization machines’ implementations. You can find all the information needed to install it and find about its command-line parameters here.

If for any reason, installation is deemed difficult in windows, I presume by now you should now where I am getting at...get ffm-train.exe and ffm-predict.exe from here. It should be noted that LibFFM has been used to win many kaggle challenges. Two of them are described here and here.

Also my team heavily used this to achieve 2nd prize in a kaggle competition. My team mate's implementation was also very competitive and easy to use.Another great implementation can be found here. All details to run it as well as information about the algorithm can be found in the pdf.

5. RGF and FastRGF

I had made an extensive article regarding this here, therefore I won’t spend more time into it. The only thing I will add again is that this a great algorithm to consider adding into your machine learning arsenal.

6. LIBLINEAR

For fast Linear regularized models that can guarantee convergence without much data preprocessing there is LIBLINEAR. These implementations of linear models is found in various sklearn models too and people may question why would they need to use it as a command-line tool. It goes back to the points raised in the beginning. It is more efficient and faster to run it from the command line plus there is a multi-core implementation too. On top of that it is available in many programming languages, hence it could be integrated with more systems.

I have personally found it (among other things) very educational to look at the code andthe main paper in regards to optimizers and regularization in linear models.

7. LIBSVM

LIBSVM is the LIBLINEAR’s equivalent for support vector machines. While memory usage remains a problem (with all SVM implementations) , prediction-wise is probably the best tool out there. However nowadays there are even GPU implementations of if that could make training much faster.

Sklearn’s main svm implementation is based on LIBSVM. Also the LIBSVM data format is now considered the standard representation of sparse data in most command-line tools .

8. RankLib

This tool (made in Java) includes many algorithms used in learn-to-rank machine learning problems. It is ideal to optimize for metrics like NDCG or precision at K. Most of the included algorithms are tree-based and I have found its implementation of LambdaMART very competitive prediction-wise. You may find instructions on how to install this here and the available command line parameters here.

Ranklib (and LambdaMART) was used to win this kaggle competition.

9. Xgboost

I would presume that most people know this tool and commonly use it through python or R. However there have been situations where I was not able to run it through python – for example because the sparse matrix created was exceeding memory constraints. However I was able to run it through the command line.

Details on how to run xgboost from the command line maybe viewed here. If you encounter trouble in compiling it, you could take the executables from here

10. Ligthgbm

Similar to xgboost, this is also a known tool with good wrappers. There has been situations I had to run it from the command line. Many of the command-line specific parameters are here. Once again, you could take the executables from here.

Conclusion

This is a list of tools that I have used from the command line and have found them useful in various predictive modelling tasks and competitions. Many of the tools mentioned have been top performers in kaggle predictive modelling competitions. There are certain advantages in being able to run a tool from the command line, hence this list could potentially be useful especially when dealing with very BIG data that memory usage could become an issue (given technological constraints). Finally, there will be more discussions about this and other open source tools in the following facebook page.

阅读全文

0 0