In part 1 of this series, we defined machine learning and made the connection to enterprise architecture. In part 2, we covered the three types of machine learning algorithms. In our third installment, we will explain how to make machine learning algorithms in six steps.
HOW TO CREATE A MACHINE LEARNING ALGORITHM IN 6 STEPS
Although it is not particularly necessary for Enterprise Architects to become junior data scientists, Enterprise Architects looking to bring measurable change to their enterprises must have a general knowledge of trending subjects in order to consult teams on best practices. Below are the steps to creating a machine learning algorithm.
Determine strong variables that you would like to later query for, including log-in frequencies, amount of distinct users, amount of power users, time since the last contact, net promoter score from last feedback etc. Get creative here and think about your business. For example, if your company creates multimedia content for customers, think of incorporating vital statistics about the content, including word counts and post reach. If your company produces marmalade, include the history of the types of jam available, and the average amount purchased in one transaction.
2. Create interfaces between your connected systems that store your data
To extract meaningful value from large sets of data, your enterprise needs many tools and capabilities - analytics, algorithms, and big data processing capabilities. Consider the following: A microservice framework, cloud based servers, platform as a service (PaaS), and containerization. It is important to have access to the most up-to-date information from all relevant systems in order to extract the most meaningful features. You also need to establish a common key for your customer among all the systems. See lessons learned section for more information.
3. Start simple
A simple database management system will suffice for most projects in the beginning (e.g. Amazon RDS, PostgreSQL, or MySQL). You will require such a database and it should be independent of your production environment.
4. Prepare and transform the data
Most algorithms require the input variables to not be dependent, therefore you must transform your input data. Methods to do that would include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA). Eliminating dependent variables helps preserve the quality of your data, which is a prerequisite to many methods. This step will improve the accuracy, of your model for later stages.
If you are not into transforming your data, you can still use Random Forests, which do not require you to have uncorrelated inputs (but they tend to perform better on uncorrelated data).
5. Choose a suitable machine learning algorithm
Inform yourself about machine learning algorithms and which suits your challenge. Commonly used machine learning algorithms that can be applied to almost any data problem:
- Linear Regression
- Logistic Regression
- Decision Tree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction Algorithms
- Gradient Boost & Adaboost
There are numerous methods and algorithms to choose from.
A word of advice: Start simple and get more complicated step by step.
6. Train, test, and re-evaluate the models
This includes dividing the data into three sets for training, testing, and validating. The training stage is used to train the initial machine learning model. The testing stage is for evaluating the trained model: How does the model perform on data which is yet unknown to it? During the testing phase, it is important to calculate accuracy, precision, and recall.
Use a confusion matrix, or an error matrix, which is a specific table layout that allows visualization of the performance of a supervised learning algorithm.
Validation stage - If you have trained different models via different machine learning algorithms, you can pit them against each other by performing the same analysis of accuracy, precision, and recall on the validation set.
In order to run the accuracy tests, it is important to have sufficient data to analyze - a few hundred customers is the minimum.