By Miguel Gonzalez-Fierro, Microsoft.
Fraud detection is one of the top priorities for banks and financial institutions, which can be addressed using machine learning. According to a report published by Nilson, in 2017 the worldwide losses in card fraud related cases reached 22.8 billion dollars. The problem is forecasted to get worse in the following years, by 2021, the card fraud bill is expected to be 32.96 billion dollars.
In this tutorial, we will use the credit card fraud detection dataset from Kaggle, to identify fraud cases. We will use a gradient boosted tree as a machine learning algorithm. And finally, we will create a simple API to operationalize (o16n) the model.
We will use the gradient boosting library LightGBM, which has recently became one of the most popular libraries for top participants in Kaggle competitions.
Fraud detection problems are known for being extremely imbalanced. Boosting is one technique that usually works well with these kind of datasets. It iteratively creates weak classifiers (decision trees) weighting the instances to increase the performance. In the first subset, a weak classifier is trained and tested on all the training data, those instances that have bad performance are weighted to appear more in the next
$ sudo apt-get update $ sudo apt-get install cmake build-essential libboost-all-dev -y $ conda env create -n fraud -f conda.yaml $ source activate fraud (fraud)$ python -m ipykernel install --user --name fraud --display-name "Python (fraud)"
The first step is to load the dataset and analyze it.
For it, before continuing, you have to run the notebook data_prep.ipynb, which will generate the SQLite database.
5 rows × 31 columns
As we can see, the dataset is extremely imbalanced. The minority class counts for around 0.002% of the examples.
The next step is to split the dataset into train and test.
Training with LightGBM - Baseline
For this task we use a simple set of parameters to train the model. We just want to create a baseline model, so we are not performing here cross validation or parameter tunning.
The details of the different parameters of LightGBM can be found in the documentation. Also, the authors provide some advices on how to tune the parameters and prevent overfitting.
Once we have the trained model, we can obtain some metrics.
In business terms, if the system classifies a fair transaction as fraud (false positive), the bank will investigate the issue probably using human intervention. According to a 2015 report from Javelin Strategy, 15% of all cardholders have had at least one transaction incorrectly declined in the previous year, representing an annual decline amount of almost $118 billion. Nearly 4 in 10 declined cardholders report that they abandoned their card after being falsely declined.
However, if a fraudulent transaction is not detected, effectively meaning that the classifier predicts that a transaction is fair when it is really fraudulent (false negative), then the bank is losing money and the bad guy is getting away with it.
A common way to use business rules in these predictions is to control the threshold or operation point of the prediction. This can be controlled changing the threshold value in
binarize_prediction(y_prob, threshold=0.5). It is common to do a loop from 0.1 to 0.9 and evaluate the different business outcomes.
O16N with Flask and Websockets
The next step is to operationalize (o16n) the machine learning model. For it, we are going to use Flask to create a RESTfull API. The input of the API is going to be a transaction (defined by its features), and the output, the model prediction.
Aditionally, we designed a websocket service to visualize fraudulent transactions on a map. The system works in real time using the library flask-socketio.
To start the api execute
(fraud)$ python api.py inside the conda environment.
First, we make sure that the API is on
The fraud police is watching you
Now, we are going to select one value and predict the output.
Fraudulent transaction visualization
Now that we know that the main end point of the API works, we will try the /predict_map end point. It creates a real time visualization system for fraudulent transactions using websockets.
A websocket is a protocol intended for real-time communications developed for the HTML5 specification. It creates a persistent, low latency connection that can support transactions initiated by either the client or server. In this post you can find a detailed explanation of websockets and other related technologies.
/predict_map, the machine learning model evaluates the transaction details and makes a prediction. If the prediction is classified as fraudulent, the server sends a signal using
socketio.emit('map_update', location). This signal just contains a dictionary, called
location, with a simulated name and location of where the fraudulent transaction occurred. The signal is shown in
frauddetection.js. The websocket part is the following:
newLocation containing the location information, that is going to be saved in a global array called
mapLocations. This variable contains all the fradulent locations that appeared since the session started. Then there is a clearing process for amCharts to be able to draw the new information in the map and finally the array is stored in
map.dataProvider.images, which actually refresh the map with the new point. The variable
map is set earlier in the code and it is the amCharts object responsible for defining the map.
To make a query to the visualization end point:
Now you can go the map url (in local it would be http://localhost:5000/map) and see how the map is reshesed with a new fraudulent location every time you execute the previous cell. You should see a map like the following one:
Once we have the API, we can test its scalability and response time.
Here you can find a simple load test to evaluate the performance of your API. Please bear in mind that, in this case, there is no request overhead due to the different locations of client and server, since the client and server are the same computer.
The response time of 10 requests is around 300ms, so one request would be 30ms.
ERROR:asyncio:Creating a client session outside of coroutine client_session: aiohttp.client.ClientSession object at 0x7f16847333c8
Enterprise grade reference architecture for fraud detection
In this tutorial we have seen how to create a baseline fraud detection model. However, for a big company this is not enough.
In the next figure we can see a reference architecture for fraud detection, that should be adapted to the customer specifics. All services are based on Azure.
Original. Reposted with permission.
- Using GRAKN.AI to Detect Patterns in Credit Fraud Data
- AI for Fraud Detection – How does Mastercard do it? Learn how global leaders use AI
- Intuitive Ensemble Learning Guide with Gradient Boosting