Example usage

In this document we will demonstrate how to use all of functions in the dsci_310_group_11 package.

We’ll first load the example wine data set to demonstrate the functions of our package.

import pandas as pd
df = pd.read_csv("data/winequality-red.csv")
df
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

Preprocess the data

We can use the preprocessor() function to preprocess the wine data. This function not only creates a new “target” variable for dataframe df, but also conduct a 70:30 data splitting. The first input of this function is the wine data set, and the second input decides whether training or testing data will be return. Putting 0 will return training data and 1 will return testing data.

from dsci_310_group_11_pkg.preprocess import preprocessor

training_data = preprocessor(df,0)
testing_data = preprocessor(df, 1)

training_data
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from dsci_310_group_11_pkg.preprocess import preprocessor
      3 training_data = preprocessor(df,0)
      4 testing_data = preprocessor(df, 1)

ModuleNotFoundError: No module named 'dsci_310_group_11_pkg'

Creating Pipeline

After we have the training dataset, we can now create a pipeline object (predictive model) using the pipe_build() function. But before using this function we first need to split the data into x_train (model features) and y_train (target variables).

x = training_data.drop(columns = ["target"])
y = training_data["target"]

pipe_build() takes three argument:

  • first argument takes a string that represent the model type, it can be “dummy”, “lr”, “svm”, “dtc”, or “bayes”.

    • “dummy” stands for dummy model

    • “lr” stands for logistic regression model

    • “svm” stands for support vector machine model

    • “dtc” stands for decision tree model

    • “bayes” stands for Gaussian Naive Bayes model

  • second argument is x_train data

  • third argument is y_train data

Let’s create a logistic regression model as example:

from dsci_310_group_11_pkg.pipeline import pipe_build

logistic_model = pipe_build("lr" , x , y)
logistic_model
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.01, class_weight='balanced',
                                    random_state=1234))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Hyperparameter Tuning

We can also use hp_optimizer() to conduct hyperparameter tuning. The input argument are the same as the pipe_build() function. However, instead of returning a model object with default parameter, this function will return the results of cross validation scores of different hyperparameters in a dataframe. The following code use logistic regression model as example.

from dsci_310_group_11_pkg.optimize import hp_optimizer 

logistic_model_hp = hp_optimizer("lr" , x , y)
logistic_model_hp
C mean_train_scores mean_cv_scores
0 0.0001 0.544906 0.545127
1 0.0010 0.924262 0.920480
2 0.0100 0.989499 0.987492
3 0.1000 1.000000 1.000000
4 1.0000 1.000000 1.000000
5 10.0000 1.000000 1.000000
6 100.0000 1.000000 1.000000
7 1000.0000 1.000000 1.000000
8 10000.0000 1.000000 1.000000
9 100000.0000 1.000000 1.000000

Correlation Table

This package contains various graphing functions. The correlation_table() function takes a dataframe and return a correlation table of each variable.

from dsci_310_group_11_pkg.grapher import correlation_table
ctb = correlation_table(df)
ctb

Bar Chart

The bar_chart() function displays a simple bar chart of the count of the quality variable. The input of this function is a dataframe object, and it returns a bar chart.

from dsci_310_group_11_pkg.grapher import bar_chart

bar = bar_chart(df)
bar

Model Metrics Graph

The class_report() function takes a model, corresponding data and displays the heatmap of different metrics.

  • First input is the model

  • Second input is the features data

  • Third input is the label data

The following use the training data as example.

from dsci_310_group_11_pkg.grapher import class_report

class_report(logistic_model, x, y)
<Axes: >
_images/e46c3222425a6be9d8489c762b72668929ded3808fe53d0e3a9be363063827d3.png

Decision Tree Visualization

vis_tree() function displays a visual example of a decision tree for conceptual purposes. The max_depth variable is limited to 3 so that the visualization is interpretable. It takes the following inputs:

  1. X_train - a dataframe object containing prediction features

  2. y_train - a series object containing target variables.

from dsci_310_group_11_pkg.grapher import vis_tree

vis_tree(x,y)
_images/868368f7d207789840527d6c4d596414efc147c7f472fc0d5408fbd263f2b140.png

Logistic Coefficient Visualization

show_coefficient() can visualize the coefficient of logistic regression model. It takes the following inputs and return a dataframe that contains coefficient for each feature:

  1. pipe - a pipeline object containing scikit-learn model transformers, and a scikit-learn model.

  2. X_train - a dataframe object containing prediction features.

from dsci_310_group_11_pkg.grapher import show_coefficients

show_coefficients(logistic_model, x)
features coefficients
11 quality 1.505752
10 alcohol 0.320946
9 sulphates 0.168452
8 pH 0.038052
0 fixed acidity 0.037339
2 citric acid 0.024162
3 residual sugar 0.023796
5 free sulfur dioxide 0.003958
4 chlorides -0.081225
7 density -0.106589
1 volatile acidity -0.189994
6 total sulfur dioxide -0.198800

Prediction Visualization

show_correct() can visualize the predicted result of chosen predictive model in a table. It takes the following inputs:

  1. pipe - a pipeline object containing scikit-learn model transformers, and a scikit-learn model.

  2. x - a dataframe object containing prediction features.

  3. y - a series object containing target variables.

The following use the training data as example.

from dsci_310_group_11_pkg.grapher import show_correct

show_correct(logistic_model, x, y)
correct
True     1112
False       7
Name: count, dtype: int64

Model Comparison Visualization

compare_scores() take one input ‘lst’ and return a bar chart comparing the accuracy scores of each ML model in the ‘lst’ list. The bar chart where the highlighted bar is the highest score. The following uses logistic regression model and dummy model as example.

from dsci_310_group_11_pkg.grapher import compare_scores

base = pipe_build('dummy', x, y)    
lr = pipe_build('lr', x, y)
svm = pipe_build('svm', x, y)
dtc = pipe_build('dtc', x, y)
nb = pipe_build('bayes', x, y)

# Score the models
basescore = base.score(x, y)
lrscore = lr.score(x, y)
svcscore = svm.score(x, y)
dtscore = dtc.score(x, y)
nbscore = nb.score(x, y)

cscores = [basescore, lrscore, svcscore, dtscore, nbscore]

compare_scores(cscores)