## Data Split for default of credit card clients dataset 

We first load the dataset from the .xls file (downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/00350/)

In [10]:
import pandas as pd
import numpy as np

# here we use pandas to read xls file conveniently

In [11]:
data_df=pd.read_excel('default.xls')
data_df.dropna(inplace=True)

# here we see that there are 30000 instances in total and 25 features
# Note that data is DataFrame type
print(data_df.shape)

# now we transfer the data to numpy arrays
data_np = data_df.to_numpy()

# here we can see the name of the features
print(data_np[0])

# the feature names and IDs are irrelevant to training/test, thus we remove them
data_np = data_np[1:,1:]

# let us randomly permute the dataset
np.random.seed(0) # fix the random seed for ease of marking
data_np = np.random.permutation(data_np)

(30001, 25)
['ID' 'LIMIT_BAL' 'SEX' 'EDUCATION' 'MARRIAGE' 'AGE' 'PAY_0' 'PAY_2'
 'PAY_3' 'PAY_4' 'PAY_5' 'PAY_6' 'BILL_AMT1' 'BILL_AMT2' 'BILL_AMT3'
 'BILL_AMT4' 'BILL_AMT5' 'BILL_AMT6' 'PAY_AMT1' 'PAY_AMT2' 'PAY_AMT3'
 'PAY_AMT4' 'PAY_AMT5' 'PAY_AMT6' 'default payment next month']


now we want to check if the data is balanced with respect to the target value, i.e., default payment next month:

In [12]:
target = data_np[:,23]
print(target.sum())

6636


Here we clearly identiy only 6636 out of 30000 (around 22%) of clients will default next month. Thus we want to split the data carefully such that this ratio is mantained across training and test data.

Next we aplit the dataset to training (20000 instances) and test (10000 instances). Thus we need to have (6636*2)/3=4424 positive samples and 15576 negative samples in training set:

In [13]:
train_np = []
test_np = []
pos_count = 0
for i,sample in enumerate(data_np):
    if len(train_np)==20000:
        test_np.append(sample)
    else:
        if sample[23]==1:
            if pos_count<4424:
                train_np.append(sample)
            else:
                test_np.append(sample)
            pos_count+=1
        else:
            train_np.append(sample)
    
train_np = np.array(train_np)
test_np = np.array(test_np)

Then we verify the correctness of such split:

In [15]:
print(train_np.shape)
print(train_np[:,23].sum())
print(test_np.shape)
print(test_np[:,23].sum())

(20000, 24)
4424
(10000, 24)
2212


Finally, we save the data split using pickle

In [16]:
import pickle
with open("train_test_split.pkl", "bw") as fh:
    data = (train_np, test_np)
    pickle.dump(data, fh)