2.2 What Is Machine Learning?
Machine learning is a method for teaching computers to make and improve predictions or behaviours based on data.
Predicting the value of a house by learning from historical house sales can be done with machine learning. The book focuses on supervised machine learning, which includes all problems where we know the label or the outcome of interest (e.g. the past sale prices of houses) and want to learn to predict. Excluded from supervised learning are, for example, clustering tasks (=unsupervised learning), where we have no label, but want to find clusters of data points. Also excluded are things like reinforcement learning, where an agent learns to optimise some reward by acting in an environment (e.g. a computer playing Tetris). The goal in supervised learning is to learn a predictive model that maps features (e.g. house size, location, type of floor, …) to an output (e.g. value of the house). If the output is categorical, the task is called classification and if it is numerical, then regression. Machine learning is a set of algorithms that can learn these mappings from training data, which are pairs of input features and a target. The machine learning algorithm learns a model by changing parameters (like linear weights) or learning structures (like trees). The algorithm is guided by a score or loss function that is minimised. In the house value example, the machine minimises some form of difference between the estimated house sales price and the predicted sales price. A fully trained machine learning model can then be used to make predictions for new instances and be integrated into a product or process.
Estimating house values, recommending products, identifying street signs, counting people on the street, assessing a person’s credit worthiness and detecting fraud: All these examples have in common that they can and increasingly are realised with machine learning. The tasks are different, but the approach is the same: Step 1 is to collect data. The more, the better. The data needs to have the information you want to predict and additional information from which the prediction should be made. For a street sign detector (“Is there a street sign in the image?”) you would collect street images and label them accordingly with street sign yes vs. no. For a loan default predictor you need historical data from actual loans, the information if the customers defaulted on their loans and data that helps you predict, like the customers income, age and so on. For a house value estimator, you would want to collect data from historical house sales and information about the real estate like size, location and so on. Step 2: Feed this information into a machine learning algorithm, which produces a sign detector model, a credit worthiness model or a house value estimator. This model can then be used in Step 3: Integrate the model into the product or process, like a self-driving car, a loan application process or a real estate marketplace website.
Machines exceed humans in a lot of tasks, like playing chess (or, since recently, Go) or predicting the weather. Even if the machine is as good as a human at a task, or slightly worse, there remain big advantages in speed, reproducibility and scale. A machine learning model that has been implemented once, can do a task much faster than humans, will reliably produce the same results from the same input and can be copied endlessly. Replicating a machine learning model on another machine is fast and cheap. Training a second human to do a task can take decades (especially when they are young) and is very costly. A big disadvantage of using machine learning is that insights about the data and the task the machine is solving are hidden within increasingly complex models. You need millions of numbers to describe a deep neural network and there is no way to understand the model in its entirety. Other models, like the RandomForest, consist of hundreds of decision trees that “vote” to make predictions. Again, to fully understand how the decision was made, you would need to look into the votes and structures of each of the hundreds of trees. That just does not work out, no matter how clever you are or how good your working memory is. The best performing models are blends of multiple models (also called ensembles), which in itself cannot be interpreted, even if each single model would be interpretable. If you only focus on performance, you automatically will get more and more opaque models. Just have a look at interviews with winners on the kaggle.com machine learning competition platform: The winning models were mostly ensembles of models or very complex models like boosted trees or deep neural networks.