Data Mining Models: Tom’s 10 Data Tips

What is a model? A model is a deliberate simplification of reality. Models can take many forms. An appearance built to scale, a mathematical equation, a spreadsheet or a person, a scene and many other forms. In all cases, the model uses only Go from reality, that’s why it’s a simplification. And in all cases, the way in which the complexity of real life is reduced, is chosen with a objective. The purpose is to focus on particular features, at the expense of losing superfluous details.

If you ask my son, Carmen Elektra is the ultimate model. She replaces an image of women in general and embodies a particularly attractive one. A model for a wind tunnel may look like the real car, at least on the outside, but it doesn’t need a real engine, brakes, tires, etc. The purpose is to focus on aerodynamics, so this model only it must have an identical outer shape.

Data mining models reduce complex relationships in data. They are a simplified representation of characteristic patterns in the data. This can be for 2 reasons. Whether to predict or describe the mechanics, for example, “what characteristics of the application form are indicative of a future default credit card applicant?”. Or, secondly, to give an idea of ​​complex and high-dimensional patterns. An example of the latter could be customer segmentation. Based on the clustering of similar patterns of attributes from the database, groups are defined as: high income/high spending/credit need, low income/credit need, high income/frugal/no credit need, etc.

1.A The predictive model is based on the fact that the future is like the past

As Yogi Berra said: “Predicting is difficult, especially when it comes to the future.” The same applies to data mining. What is commonly known as “predictive modeling” is essentially a classification homework.

Based on the (grand) assumption that the future will resemble the past, we classify future events by their similarity to past events. We then ‘predict’ that they will behave as if they were alike in the past.

2. Even a ‘purely’ predictive model must always (be) explained (ed)

Predictive models are generally used to provide scores (probability of abandonment) or decisions (accept yes/no). Regardless, they should forever be accompanied by explanations that give an idea of ​​the model. This is for two reasons:

  1. buy-in from business stakeholders to act on predictions is of paramount importance and benefits from understanding
  2. quirks in the data make sometimes arise, and may become obvious from the explanation of the model

3. It’s not about the model, but about the results it generates

Models are developed with a purpose. Too often, data miners fall in love with their own methodology (or algorithms). Nobody cares. Clients (not customers) who should benefit from using a model are only interested in one thing: “What’s in it for me?”

Therefore, top of mind for a data miner should be, “How do I communicate the benefits of using this model to my client?” This requires patience, persistence and the ability to explain in business terms how the use of the model will affect the results of the company. Practice explaining this to your grandmother, and you will go a long way to becoming effective.

4. How is the ‘success’ of a model measured?

There are really two answers to this question. One important and simple, and one academic and tremendously complex. What counts most is the result in commercial terms. This can range from the response rate to a direct marketing campaign, the number of fraudulent claims intercepted, the average sale per lead, the probability of abandonment, etc.

The academic question is how to determine the improvement that a model gives on the best alternative course of business action. This turns out to be an intriguing and misunderstood question. This is a frontier of future scientific study and mathematical theory. The Bias-Variance Decomposition is one such mathematical frontier.

5.A The model predicts as well as the data it contains

The old “Garbage In, Garbage Out” (GiGo), is trite but true (sadly). But there is more to this topic. Across a wide range of industries, channels, products, and environments, we’ve found a common pattern. The input (predictive) variables can be ordered from transactional to demographic. From transient and volatile to stable.

In general, transactional variables that relate to (recent) activity have the greatest predictive power. Less dynamic variables, such as demographics, tend to be weaker predictors. The downside is that the model’s performance on predictive (“power”) based on behavioral and transactional variables generally degrades faster over time. Therefore, such models need to be updated or rebuilt more often.

6. Models should be monitored for performance degradation

Is firm forever, always monitor the implementation of the model by reviewing its effectiveness. If you don’t, it should be compared to driving a car with your blinders on. Reckless.

To monitor the performance of a model over time, check whether the prediction generated by the model matches the response patterns when it is implemented in real life. Although not rocket science, this can be tricky to achieve in practice.

7. The classification accuracy is Do not A sufficient indicator of the quality of the model

Contrary to common belief, even among data miners, no single classification accuracy number (R2, Gini coefficient, elevation, etc.) is valid for quantifying model quality. The reason behind this has nothing to do with the model. itselfbut rather with the fact that a model derives its quality from being applied.

The quality of the model predictions requires at least two Numbers: a number to indicate the accuracy of the prediction (these are commonly the only provided numbers), and another number to reflect your generalization. The latter indicates resilience to changing distributions of multiple variables, the degree to which the model will hold while reality changes very slowly. Therefore, it is measured by the multivariate representativeness of the input variables in the final model.

8. Exploratory models are only as good as the information they provide

There are many reasons why you want to give insight into the relationships found in your data. In all cases, the purpose is to make a large amount of data and an exponential number of desirable relationships. You knowingly ignore details and point out “interesting” and potentially actionable highlights.

The key here is, as Einstein already pointed out, to have a model that is as simple as possible, but not too simple. It should be as simple as possible to impose structure on complexity. At the same time, it should not be too simple so that the image of reality is too distorted.

9. Get a decent model fast, instead of a great one later

In almost all business environments, it is far more important to quickly implement a reasonable model than to work to improve it. This is for three reasons:

  1. A working model is to make money; a model under construction is not
  2. When a model is in place, it has the opportunity to “learn from experience”, the same applies to even a slight improvement – does it work as expected?
  3. The best way to manage models is by speeding up the update. There is no better practice than doing it… 🙂

10. Data mining models: what’s in it for me?

Who Needs Data Mining Models? As the world around us becomes increasingly digitized, the number of possible applications abounds. And as data mining software has come of age, you no longer need a Ph.D. in statistics to operate such applications.

In almost every case where data can be used to make smart decisions, there’s a good chance that models can help. When subscribers were replaced by dashboards (a particular type of data mining model) 40 years ago, no one could believe that such a simple set of decision rules could be effective. Early adopters have made fortunes since then.

Other readings

Some excellent books on data mining:

Dorian Pyle (2003) Business Modeling and Data Mining. ISBN No. 155860653-X

Dorian Pyle (1999) Data preparation for data mining. ISBN No. 1558605290

Michael Berry and Gordon Linoff (2000) Mastering Data Mining. ISBN No. 0471331236

Source Data Mining Models: Tom’s 10 Data Tips