Outliers: Selection vs. Detection
A common method for detecting fraud is to look for outliers in data. It’s a fair approach: even if the detection doesn’t immediately imply fraud it can be a good candidate for further investigation. Still, how might we go about selecting hyper-parameters (or even the algorithm)? The hard part is that we have very little to go on. Just like clustering there’s no label. It is incredibly though to argue if a certain model is appropriate for a use-case. Luckily there’s a small trick that can help. How about we try to find outliers that simply correlate with fraudulent cases? It might be surprise to find out that scikit learn has support for this but it occurs via a slightly unusual pattern. Setup I will demonstrate an approach using this dataset from kaggle. It’s an unbalanced dataset meant for a fraud usecase. import numpy as np import pandas as pd import matplotlib.pylab as plt df = pd.read_csv( "creditcard.csv" ).rename( str .lower, axis = 1 ) X, y = df.dr...