Hah! One of the goals that I’ve made for 2020 is to write once a month. As you can see, I failed quite badly. Oh well… no excuse… I was really lazy even though I was usually at home due to the pandemic.

Anyway, another goal that I’ve made is to take at least one specialised certification in cybersecurity or data science. I didn’t take up any courses in the end because I wasn’t too sure how deep I have to learn to be sufficient for my daily work.

But I have to start somewhere, right?

So I decided to just work on a random project that I can find online… walkthrough it and learn about some of the pitfalls and strategies that can be used to solve certain issues in doing a data science project. Googled around and found out about kaggle.com. Then, I picked a random dataset (with lots of comments/notebooks of course) to get myself started in this.

Introduction

The aim of this project is to build a classifier so that we can detect credit card frauds. The dataset is obtained from https://www.kaggle.com/mlg-ulb/creditcardfraud.

I’ve learned from my workplace that the first step to start a data science project is to explore and understand the dataset. The purpose is to identify any data quality issues and ensure that what we need to clean or preprocess. Garbage in, garbage out as they always say.

Data Exploration

The dataset contains 284807 rows of credit card transactions. There are 31 columns in this dataset. Apart from “Time”, “Class” and “Amount”, the rest of the features are the product from Principal Component Analysis (PCA). I’m not sure what are these features but based on what I understand we just have to trust that they are processed accurately and correctly.

Checking for Missing Values

I’ve used the following cell for any null values in the dataset. There is no missing value, which is a good news (but I have to learn how to handle missing values utlimately… that shall come later).

df.isnull().values.any()

Describe the Features

The Amount column refers to the transactional amount for each credit card transaction.

count    284807.000000
mean         88.349619
std         250.120109
min           0.000000
25%           5.600000
50%          22.000000
75%          77.165000
max       25691.160000
Name: Amount, dtype: float64

The Time column refers to the seconds elapsed between each transaction the first transaction.

count    284807.000000
mean      94813.859575
std       47488.145955
min           0.000000
25%       54201.500000
50%       84692.000000
75%      139320.500000
max      172792.000000
Name: Time, dtype: float64

The Class column is a label where 0 means valid transaction and 1 means fraudulent transaction. We can run a quick value counts on this class to check on the number of transactions for each class.

pd.value_counts(df["Class"]).plot.bar()

This result shows that the dataset is highly imbalanced. There are 284,315 valid transaction and 492 fraudulent transaction.

This dataset is highly imbalanced.

Imbalanced dataset can cause overfitting if we use it directly to train the model. What this means is that it may cause the model to not generalize well since most of the transactions are skewed to a single class. The model might be biased and cause the generalization to fail. We will need to deal with this later… but let’s first continue to explore the data.

Checking the Distribution of the “Known” Features

I plotted a histogram on the features to observe the distribution of the features.

df["Amount"].sort_values().plot.hist(bins=100)

This shows the following distribution:

Here’s the distribution for Time:

df["Time"].sort_values().plot.hist(bins=100)

This shows that the Amount feature is highly imbalanced (skewed to the left) but the Time feature is much distributed.

Checking the Distribution of “Fraud” and “Valid” transcations

It seems like the fraudulent transactions are slightly more distributed as compared to the valid transaction. However, we will not know if there is any correlation between the amount and the class of transaction until we take a look at the correlation matrix later.

df[df["Class"] == 1]["Amount"].sort_values().plot.hist(bins=15)

df[df["Class"] == 0]["Amount"].sort_values().plot.hist(bins=15)

Dealing with Imbalanced Dataset

As mentioned just now, we can’t use this dataset directly to train our model because the labelled classes are imbalanced. There are more than 200k valid transaction and there are only around 500 fradulent transaction. This will highly likely cause the model to overfit when we use it for prediction.

I read up the 7 techniques to handle imbalanced data in this article and it seems like the best way is for us to resample the training set. We can choose to reduce the size of the abundant class. However, this must mean that the quantity of the data is sufficient in the first place. I’m not sure if 500 samples for each class is supposed to be a good size… It seems like it is dependable on the kind of features that are provided as well. Another thing that I think might cause a problem to the model is that we are losing a lot of information when we reduce 200k samples to only 492 samples for valid transaction. Not too sure if our model will be impacted. Hmm..

The other way is to over-sample the “fraudulent” transactions but I have difficulties understanding how over-sampling will work since the “fraudulent” transactions are not “real” dataset. I will leave this to the next session if I have a chance to balance the dataset again.

What I did is to take all 492 samples where Class == 1 and randomly select 492 samples where Class == 0. To randomly select the samples, we will have to randomize the samples first.

resamples = df.sample(frac=1)

fraud = resamples.loc[df["Class"] == 1] # take all the fraud transaction
non_fraud = resamples.loc[df["Class"] == 0][:492] # take 492 the non-fraud transaction

new_df = pd.concat([fraud, non_fraud])

pd.value_counts(new_df["Class"]).plot.bar()

Balanced Dataset

Now that I have a balanced dataset… I can move on to process the columns.

Scaling the “known” Features - Amount and Time

All other features have already gone through PCA except for Amount and Time. This means that these two features are not scaled properly to work with the rest of the features… this might cause problem to the model.

I used StandardScaler from sklearn package.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

new_df["scaled_amount"] = scaler.fit_transform(new_df["Amount"].values.reshape(-1, 1))
new_df["scaled_time"] = scaler.fit_transform(new_df["Time"].values.reshape(-1, 1))

scaled_df = new_df.drop(["Time", "Amount"], axis=1)
scaled_df.head(10)

Scaled time and amount columned

I read from some notebooks that StandardScaler might have outliners. Another Scaler that we can use is RobustScaler but I think I will explore after I’m done with the first round of model building.

Correlation Matrix

After balancing the dataset and scaling the two features, this is what we can get from the correlation matrix heatmap.

In the beginning before we rebalance and scale the features, this was what we have. We were unable to see any correlation between the features.

Training and Validating the Model

As a beginner, I’m not too sure what else I can do except from training the model… and try to understand if my dataset is sufficient to train the model.

Since we are building a classifier, I decided to start by using regression. I was trying to figure out what’s the difference b etween Logistic Regression and Linear Regression. So It seems like Logistics Regression is used when the variables are binary while Linear Regression is used when the variables are continuous. I returned most of this to my stats classes in NUS :)

I think there are some pitfalls in using logistic regression… but I don’t know enough yet to Google the issues so I decided to just train first, ask later.

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

X = scaled_df.drop("Class", axis=1)
y = scaled_df["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

This the numebr of samples in each set:

print("Number of training set samples: {0}".format(len(X_train)))
print("Number of testing set samples: {0}".format(len(X_test)))

Number of training set samples: 787
Number of testing set samples: 197

After I split the dataset into training and testing set, I run them using the logistic regression classifier.

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Validation

Validation is a necessary step to ensure that our model performance is good. It can be done by running our classifier with the test set that we have prepared beforehand.

According to this, validation should only be executed on the training data. The test set needs to be completely separated from the training set, so that we can continue to improve and tune the model later.

What I did next is to use the classifier to predict the test set as shown below.

y_pred = classifier.predict(X_test)

Confusion Matrix

First of all, we can use confusion matrix to see how the model performs on our test set.

Confusion matrix is used to visualise the performance of the model. In this case, 94 labeled as fraud has been successfully predicted as fraud. However, 7 of the transactions that are supposed to be fraud are not labeled as fraud. 95 of the transactions that are valid are labelled correctly but 1 valid transaction is predicted to be fradulent.

To improve our model performance, we will need to decrease the false negative rate (7 fraudulent transactions => fraud) and the false positive rate (1 valid transaction => valid).

F1-Score

There are a few metrics that can be used to measure the performance of the models. Here are the metrics that are generally used:

From https://towardsdatascience.com/cross-validation-430d9a5fee22

Accuracy = (TP + TN) /(TP + TN + FP + FN)
Precision = (TP) / (TP + FP)
Recall = (TP) / (TP + FN)
F1 Score = (2 x Precision x Recall) / (Precision + Recall)

Precision is the proportion of data points that our model said to be relevant are really relevant (that’s why it has false positives in the denominator). Recall is how much data points our model managed to find as relevant (that’s why false negative is in the denominator).

We cannot use just one metric (precision or recall) to decide on whether the model is performing well. A high precision and a low recall means that the model has managed to label the relevance but it means that it has missed out a lot of data points that are supposed to be relevance. A low precision and high recall means that the model has made a lot of mistakes in labeling the wrong data points even though it has not missed out those data points that are correct.

Hence, the F1 score helps to combine the two metrics such that both metrics are taken into consideration to know if the model is performing well or not.

This is the F1 score on the test set. The higher the F1 score the better it is.

from sklearn.metrics import f1_score

y_pred = classifier.predict(X_test)
score = f1_score(y_pred, y_test)

print(score)

0.9595959595959597

Conclusion

The score seems to be good but I suspect there are some underlying problems that I have not encountered.

Alright this is the first dataset I have tried… at least to understand how I should approach the problem in the data science manner and what kind of things that I need to look out when I validate the model.

I would need more practice to identify gaps and other possible pitfalls that I might oversight this time round.

Till then…