Pipeline – Open Source machine learning library

For the analysis of new problems, we use our in-house development. Using modular design principles, a data pipeline suited for the problem at hand can be built and evaluated. A data pipeline consists of readers, tranformers, classifiers, regressors, normalizer, parameter optimizer, output generators and other modules.

For each problem the best solution among a set of candidate solutions is chosen. The optimized solution is then adapted and tested for production. All components run on a Java Virtual Machine and are therefore platform independent. The software can process simple vectors and sequences / time series data. Structured as well as unstructured data can be processed with a corresponding reader.

The library is released under an open source license and can be downloaded here: https://github.com/eonum/pipeline

Below you can find a selection of pipeline modules.

Classifier / Regressors

  • Random Forests / Decision Trees / CART
  • Neural Nets
  • Recurrent Neural Nets (Long Short Term Memory)
  • Support Vector Machines
  • Linear Regression
  • Logistic Regression
  • Gradient Boosting
  • Nearest Neighbor
  • Ensemble Methods (Bagging / Boosting)

Optimization

  • Genetic Algorithms
  • Gradient descent

Clustering / Data Mining

  • Self organizing maps / Kohonen map
  • Gaussian Mixture Models
  • K-Means Clustering
  • EM Fuzzy Clustering

Transformators

  • Principal Component Analysis
  • Feature extraction and selection
  • Dynamic Time Warping

Validation

  • k-fold cross validation
  • evaluation metrics (RMSE, AUC, Recognition Rate, LogLoss)
  • Validation of meta parameters of entire data pipelines.