Pipeline – Open Source machine learning library

24/03/2015

For the analysis of new problems, we use our in-house development. Using modular design principles, a data pipeline suited for the problem at hand can be built and evaluated. A data pipeline consists of readers, tranformers, classifiers, regressors, normalizer, parameter optimizer, output generators and other modules.

For each problem the best solution among a set of candidate solutions is chosen. The optimized solution is then adapted and tested for production. All components run on a Java Virtual Machine and are therefore platform independent. The software can process simple vectors and sequences / time series data. Structured as well as unstructured data can be processed with a corresponding reader.

The library is released under an open source license and can be downloaded here: https://github.com/eonum/pipeline

Below you can find a selection of pipeline modules.

Classifier / Regressors

Random Forests / Decision Trees / CART
Neural Nets
Recurrent Neural Nets (Long Short Term Memory)
Support Vector Machines
Linear Regression
Logistic Regression
Gradient Boosting
Nearest Neighbor
Ensemble Methods (Bagging / Boosting)

Optimization

Genetic Algorithms
Gradient descent

Clustering / Data Mining

Self organizing maps / Kohonen map
Gaussian Mixture Models
K-Means Clustering
EM Fuzzy Clustering

Transformators

Principal Component Analysis
Feature extraction and selection
Dynamic Time Warping

Validation

k-fold cross validation
evaluation metrics (RMSE, AUC, Recognition Rate, LogLoss)
Validation of meta parameters of entire data pipelines.