Example usage

In this document we will demonstrate how to use all of functions in the dsci_310_group_11 package.

We’ll first load the example wine data set to demonstrate the functions of our package.

import pandas as pd
df = pd.read_csv("data/winequality-red.csv")
df

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.700	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4	5
1	7.8	0.880	0.00	2.6	0.098	25.0	67.0	0.99680	3.20	0.68	9.8	5
2	7.8	0.760	0.04	2.3	0.092	15.0	54.0	0.99700	3.26	0.65	9.8	5
3	11.2	0.280	0.56	1.9	0.075	17.0	60.0	0.99800	3.16	0.58	9.8	6
4	7.4	0.700	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4	5
...	...	...	...	...	...	...	...	...	...	...	...	...
1594	6.2	0.600	0.08	2.0	0.090	32.0	44.0	0.99490	3.45	0.58	10.5	5
1595	5.9	0.550	0.10	2.2	0.062	39.0	51.0	0.99512	3.52	0.76	11.2	6
1596	6.3	0.510	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0	6
1597	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2	5
1598	6.0	0.310	0.47	3.6	0.067	18.0	42.0	0.99549	3.39	0.66	11.0	6

1599 rows × 12 columns

Preprocess the data

We can use the preprocessor() function to preprocess the wine data. This function not only creates a new “target” variable for dataframe df, but also conduct a 70:30 data splitting. The first input of this function is the wine data set, and the second input decides whether training or testing data will be return. Putting 0 will return training data and 1 will return testing data.

from dsci_310_group_11_pkg.preprocess import preprocessor

training_data = preprocessor(df,0)
testing_data = preprocessor(df, 1)

training_data

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from dsci_310_group_11_pkg.preprocess import preprocessor
      3 training_data = preprocessor(df,0)
      4 testing_data = preprocessor(df, 1)

ModuleNotFoundError: No module named 'dsci_310_group_11_pkg'

Creating Pipeline

After we have the training dataset, we can now create a pipeline object (predictive model) using the pipe_build() function. But before using this function we first need to split the data into x_train (model features) and y_train (target variables).

x = training_data.drop(columns = ["target"])
y = training_data["target"]

pipe_build() takes three argument:

first argument takes a string that represent the model type, it can be “dummy”, “lr”, “svm”, “dtc”, or “bayes”.
- “dummy” stands for dummy model
- “lr” stands for logistic regression model
- “svm” stands for support vector machine model
- “dtc” stands for decision tree model
- “bayes” stands for Gaussian Naive Bayes model
second argument is x_train data
third argument is y_train data

Let’s create a logistic regression model as example:

from dsci_310_group_11_pkg.pipeline import pipe_build

logistic_model = pipe_build("lr" , x , y)
logistic_model

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.01, class_weight='balanced',
                                    random_state=1234))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Hyperparameter Tuning

We can also use hp_optimizer() to conduct hyperparameter tuning. The input argument are the same as the pipe_build() function. However, instead of returning a model object with default parameter, this function will return the results of cross validation scores of different hyperparameters in a dataframe. The following code use logistic regression model as example.

from dsci_310_group_11_pkg.optimize import hp_optimizer 

logistic_model_hp = hp_optimizer("lr" , x , y)
logistic_model_hp

	C	mean_train_scores	mean_cv_scores
0	0.0001	0.544906	0.545127
1	0.0010	0.924262	0.920480
2	0.0100	0.989499	0.987492
3	0.1000	1.000000	1.000000
4	1.0000	1.000000	1.000000
5	10.0000	1.000000	1.000000
6	100.0000	1.000000	1.000000
7	1000.0000	1.000000	1.000000
8	10000.0000	1.000000	1.000000
9	100000.0000	1.000000	1.000000

Correlation Table

This package contains various graphing functions. The correlation_table() function takes a dataframe and return a correlation table of each variable.

from dsci_310_group_11_pkg.grapher import correlation_table

ctb = correlation_table(df)
ctb

Bar Chart

The bar_chart() function displays a simple bar chart of the count of the quality variable. The input of this function is a dataframe object, and it returns a bar chart.

from dsci_310_group_11_pkg.grapher import bar_chart

bar = bar_chart(df)
bar

Model Metrics Graph

The class_report() function takes a model, corresponding data and displays the heatmap of different metrics.

First input is the model
Second input is the features data
Third input is the label data

The following use the training data as example.

from dsci_310_group_11_pkg.grapher import class_report

class_report(logistic_model, x, y)

<Axes: >

_images/e46c3222425a6be9d8489c762b72668929ded3808fe53d0e3a9be363063827d3.png

Decision Tree Visualization

vis_tree() function displays a visual example of a decision tree for conceptual purposes. The max_depth variable is limited to 3 so that the visualization is interpretable. It takes the following inputs:

X_train - a dataframe object containing prediction features
y_train - a series object containing target variables.

from dsci_310_group_11_pkg.grapher import vis_tree

vis_tree(x,y)

_images/868368f7d207789840527d6c4d596414efc147c7f472fc0d5408fbd263f2b140.png

Logistic Coefficient Visualization

show_coefficient() can visualize the coefficient of logistic regression model. It takes the following inputs and return a dataframe that contains coefficient for each feature:

pipe - a pipeline object containing scikit-learn model transformers, and a scikit-learn model.
X_train - a dataframe object containing prediction features.

from dsci_310_group_11_pkg.grapher import show_coefficients

show_coefficients(logistic_model, x)

	features	coefficients
11	quality	1.505752
10	alcohol	0.320946
9	sulphates	0.168452
8	pH	0.038052
0	fixed acidity	0.037339
2	citric acid	0.024162
3	residual sugar	0.023796
5	free sulfur dioxide	0.003958
4	chlorides	-0.081225
7	density	-0.106589
1	volatile acidity	-0.189994
6	total sulfur dioxide	-0.198800

Prediction Visualization

show_correct() can visualize the predicted result of chosen predictive model in a table. It takes the following inputs:

pipe - a pipeline object containing scikit-learn model transformers, and a scikit-learn model.
x - a dataframe object containing prediction features.
y - a series object containing target variables.

The following use the training data as example.

from dsci_310_group_11_pkg.grapher import show_correct

show_correct(logistic_model, x, y)

correct
True     1112
False       7
Name: count, dtype: int64

Model Comparison Visualization

compare_scores() take one input ‘lst’ and return a bar chart comparing the accuracy scores of each ML model in the ‘lst’ list. The bar chart where the highlighted bar is the highest score. The following uses logistic regression model and dummy model as example.

from dsci_310_group_11_pkg.grapher import compare_scores

base = pipe_build('dummy', x, y)    
lr = pipe_build('lr', x, y)
svm = pipe_build('svm', x, y)
dtc = pipe_build('dtc', x, y)
nb = pipe_build('bayes', x, y)

# Score the models
basescore = base.score(x, y)
lrscore = lr.score(x, y)
svcscore = svm.score(x, y)
dtscore = dtc.score(x, y)
nbscore = nb.score(x, y)

cscores = [basescore, lrscore, svcscore, dtscore, nbscore]

compare_scores(cscores)