Anomaly Detection on E-Commer Data Part1

by. Jongwon Lee | 49 Views (11 Uniq Views) | 27 days ago
#ML #AD #Anomaly Detection
1. Pre-processing & EDA 2. Hypothesis Testing & Insight
1. Pre-processing & EDA
data_transaction.info()

Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   customerEmail                     623 non-null    object
 1   transactionId                     623 non-null    object
 2   orderId                           623 non-null    object
 3   paymentMethodId                   623 non-null    object
 4   paymentMethodRegistrationFailure  623 non-null    int64 
 5   paymentMethodType                 623 non-null    object
 6   paymentMethodProvider             623 non-null    object
 7   transactionAmount                 623 non-null    int64 
 8   transactionFailed                 623 non-null    int64 
 9   orderState                        623 non-null    object
dtypes: int64(3), object(7)

data_customer.info()

Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   customerEmail           168 non-null    object
 1   customerPhone           168 non-null    object
 2   customerDevice          168 non-null    object
 3   customerIPAddress       168 non-null    object
 4   customerBillingAddress  168 non-null    object
 5   No_Transactions         168 non-null    int64 
 6   No_Orders               168 non-null    int64 
 7   No_Payments             168 non-null    int64 
 8   Fraud                   168 non-null    bool  
dtypes: bool(1), int64(3), object(5)

- check missing values using "missingno"

import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
data_nan = (
    data_transaction
    .copy()
    .mask(np.random.random(data_transaction.shape) < .1)
)
msno.matrix(df=data_nan, color=(0.1, 0.6, 0.8), figsize=(10, 6))
plt.title('Transaction data when NA exists', fontsize=20)
2. Hypothesis Testing & Insight

(1) We want to see if some customers create accounts with duplicate info (email in this case).
We want to visualize Fraud count per customerEmail, use merge
(
    data_transaction.
    merge(
        data_customer,
        left_on='customerEmail',
        right_on='customerEmail',
        how='left'
    )
    .loc[lambda x : x['Fraud']==True]
    ['customerEmail'].value_counts()
    .plot(kind='barh', figsize=(10, 10))
)
(2) We want to see if uncommon email domain owners tend to have Fraud transactions
data_customer['email_domain'] = data_customer['customerEmail'].str.split('@').apply(lambda x : x[1])
(
    data_customer
    .groupby('email_domain')
    ['Fraud'].value_counts()
    .unstack()
    .sort_values(by=True, ascending=False)
    .loc[lambda x : x[True].notnull()]
)

Fraud False True
email_domain gmail.com | 16.0 | 15.0
yahoo.com | 23.0 | 6.0
hotmail.com | 17.0 | 6.0
oconnor.com | NaN | 1.0
patrick-decker.com | NaN | 1.0
randall-pacheco.biz | NaN | 1.0
rasmussen-alvarado.com | NaN | 1.0
rivera-parker.info | NaN | 1.0
rogers.com | NaN | 1.0
saunders-rhodes.com | NaN | 1.0
spears.biz | NaN | 1.0
...

Normally, email address should be unique per account but for this dataset, it isn't.
Also, majority of "made-up" domain name accounts' transactions are labeled as Fraud.
Thus, we will be using email_domain as a new feature.

Part 2 will be handling Feature Selection and Modeling.