May 18, 2016 at 10:56 am #16801
The Ensemble models are implemented in Kamanja with DAG’s (Directed Acyclic Graphs).
Combine Models in Parellel
For example, you can have 6 models at stage one, all outputting their independent score. Another model at stage 2 can use as input all the stage one scores plus the input data (or most important input fields).
Combine Models in Series or Sequence
Another common data mining implementation that uses DAG’s includes residual models or Gradient Boosting Machines. In R, see gbm. In Salford Systems, see TreeNet.
A simple residual model would be two stages. The first stage creates a forecast. The second stage model is given as a training variable, the error residual of the first model. The second model basically learns to correct the mistakes of the first model, because the final deployment involves summing the scores of the two models.
The Gradient Boosting Machines (GBM) algorithm uses not just two steps, like the residual model example above, but may use 500 – 2,000 steps, with all the scores summed together. GBM uses a DIFFERENT 50% (for example) random sample of the data at each step – to help with generalization. At each step, a small CART tree is trained, which may be limited to 6 leaves. This also helps generalization, because you end up with larger sample sizes in the leaves. On the page, http://dmg.org/pmml/products.html, if you search for “TreeNet”, you see that Salford Systems lists this as using the PMML composition.
Other structures – Mortgage Bond Pricing model use case
There is no need to be limited to those ensemble structures. It basically gets down to supporting a data flow to solve a problem. In one problem I was working on forecasting prices for a mortgage bond. A mortgage bond may be a bundle of 100 mortgages. From the bank’s view, a mortgage has a lower value the more the person prepays the mortgage. The less the total interest paid on the life of the loan, the lower the income value is to the bank.
The client was servicing 12% of the mortgages in the US, and had a rare visibility into the inherent “prepayment risk” of a subset of mortgages in a given bond. An individual investor would not have this visibility. The mortgage servicing side of the business (processing monthly payments) could share data with the banks financial trading floor (at least in the 90’s).
The data mining problem was to forecast, each month, the amount of mortgage prepayment that month. This was done repeatedly, for the remaining possible life of the loan, and then mapped back to a net present value. Because refinancing and other behaviors is so dependent on core interest rate (i.e. Libor or other indexes) environments, we did a “random walk”, assuming a small change in interest rate from month to month (along a Gaussian distribution). We did 3 such random walks for robustness.
While the behavior of prepaying mortgage sounds like one problem, the more we dived into our EDA (Exploratory Data Analysis) the more we realized there were a number of fundamental different behaviors.
a) some people never prepay
b) some people prepay a small fixed amount each month
c) some people make an extra payment(s) per year
d) some people prepay different amounts, based on different factors, all of which we may not understand
e) when a person refinances a mortgage, for purposes of bond pricing and “prepayment risk” this is considered the largest prepayment – the rest of the loan.
Ensemble Model for Mortgage Bond Pricing
The stage one model was intended to be a low resolution model, accurate at an overall level, to segment the records into 7 different subsets. It was a neural net with 7 targets. Then I had at the second stage, 7 “specialist models” focusing on a more consistent subset of human behavior. The second stage model training data had many training cases that could cause confusion removed.
[SIDEBAR: In my humble opinion, another way the 2008 financial crisis could have been minimized would be to provide better visibility into the components of a mortgage bond. For privacy, address and all identifiable information could be suppressed. The bundling of very high risk mortgages, at risk of going default, as the majority contents of a specific bond, without transparency – is a huge unknown risk. Risky financial instruments can still be on the market, but expectations need to be set. Even if this data is only available to a bond ranking agency who shares their risk rating publicly – improved transparency would be a huge improvement. ]
Kamanja DAGs can be unrelated to data mining modes. It can be related to data flow.