Hi there, okay this opening might be a little cheesy, but here we go. I always wanted to summarize my study, work and life in a way, so that when I look back I can see some mistakes I’ve already made, and achievements I’ve built.

As you can see from the homepage title, it’s mainly focusing on data, a.k.a., my job related subjects. Like how Computer Science world was generous about coding sharing, I am going to use this platform to share how I process data, visualize it, and finally derive business insights from it. I’m still a learner, but I will share the path I’ve walked during past 4 years. This sharing platform will also help me to boost my memories and step further on my data journey. Please interact with me by subscribing to this site or generously leaving a comment down below (wow, I do sound like a YouTuber).

I guess every introduction is more likely ended up with some personal fact. I have one cat, Mr. White, adopted in 2014, Ohio, now we’re here in cozy California. Mr. White is doing fine, I’m as well. I like to focus on my personal life, which means bedroom makeover is most likely my weekend activity. I do dance, but usually from Jun-Dec, it’s been like this for three years. I would always find myself being occupied by life for the first half of the year, and the second half, suddenly I’m all free. But when I do dance, I take it seriously, maybe sometimes too serious, I dance 15 hours per week. You’ve might heard of urban dance, if not, it’s likely to be in your college dance team competition. Like a normal urban dancer, I joined a dance team in my school UC Davis.

As Volunteer Life, oh, where should I start with. I volunteered in OSU hospital when I was still in pre-med program back then. During my time in Ohio, I was also the president of Columbus Tzuching Affiliation Club in OSU, which is a volunteer club and visit nursing home and support local disaster relieve. Mr. White was adopted when I was volunteer in an animal shelter. It’s always reminds me the truth of living is giving, and by ‘giving’ you will find your ‘receiving’.

Funnel and Subscription Retention Analysis

1. Funnel Analysis

Chose which metrics:

  • Click through rate (Growth):
    • (i.e. how many customer clicked on it after seeing the ad)
      • # clicked / # shown
    • Goal: bring customer to the site.
    • Pros:
      • able to identify demand from users
      • allow estimate to test and improve ad features to maximize CTR (since it’s only take account user behavior after seeing ad)
    • Cons:
      • No information of the 2nd layer and 3rd layer behavior, thus does not know what user does once they come to the site
      • Cost of bringing user to site is not considered
      • User conversion remains unsolved, even with high CTR
      • (might be click-bait ads)
  • Conversion Rate (Growth and Product)
    • (i.e. how many customer converted after seeing the ad)
    • # converted / # shown
    • Pros:
      • the final goal of an ad is to get customer, this metric is represents it
    • Cons:
      • No info about cost
      • Highly depend on product: Ad could be great, fair product, low conversion
      • Conversion might happen after a long time, even after visit multiple times(it’s hard to link conversion to click)
  • Profit, revenue from conversions (Business / Product)
    • Not useful for growth, used by large companies (main goal: being profitable)
    • total_revenue – clicked * avg_cost_per_click
    • Pros:
      • Business driven
    • Cons:
      • Profit might happen after a long time(hard to link profit to click)
      • Only consider users after they clicked on ad
      • No information about top funnel (people see ad did not click)

a. Data:

Two format of data is available for funnel analysis.

  1. each row represents a user visited website
    • column: user_id, layer1, layer2, layer3, date
    • Value: layer1, layer2, layer3 is binary
      • if one user was only able to get to layer 2, then it’s 1, 1, 0
  2. each row represents a campaign or ad
    • column: ad, layer1 , layer2, layer3, date
    • Value: layer1, layer2, layer3 is continuous variable indicates total amount of user able to get to that layer on that day

b. Steps:

  • data preparation: merge / logic check
    • note: always check with the flow of funnel
    • i.e. layer1 > layer2 > layer3
  • plot a total count of visit vs date
    • for each campaign / ads/ device/ feature you are interested in
    • check for sudden drop, it’s usually a bug, or error in data collection
      • if count after sudden drop went back to normal:
        • remove those drops using count > 0
      • if count after sudden drop do not went back to normal:
        • bug need to fix
# some useful code

# 1. plot date vs layer, hue = interested_feature
data = data.query('layer1 > 0')
g=sns.lineplot(x="date", y="layer1",hue="interested_feature", data=data, legend="")
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

# 2. plot date vs multiple layers, hue = interested_feature
#funnel as a time series
funnel = data.groupby(['date', 'interested_feature']).apply(
                    lambda x: pandas.Series({
                            'to_layer2' : x['layer2'].mean()/x['layer1'].mean(),
                            'to_layer3': x['layer3'].mean()/x['layer2'].mean(),
                            'to_layer4': x['layer4'].mean()/x['layer3'].mean()
# Make the dataset long for the plot
funnel = pandas.melt(funnel,id_vars=['date', 'interested_feature'])
# Plot it
# the row will be to_layer2, 3, 4
g = sns.FacetGrid(funnel, hue="interested_feature", row="variable", aspect=4, sharey=False)
g.map(sns.lineplot, "date", "value")
g.set_axis_labels("", "Conversion Rate")
plt.legend(loc='upper right')

  • draw business insights:
    • Note: sometimes difference might come from the other feature
      • i.e. when grouping, group by 2 features to investigate where the difference come from.
      • Remember, a feature could always be a proxy for the other
test = stats.ttest_ind(data.loc[data['interested_feautre'] == 'level1']['layer3'],
                       data.loc[data['interested_feautre'] != 'level1']['layer3'], equal_var=False)
  • Some useful index:
    • CTR: clicked number / shown number
    • Profit: Total Revenue – Clicked * Average cost per click
    • Count for layer interested
  • Some useful model:
    • RF for business insights
    • Time series for future prediction

c. General rule:

  • More mobile user than desktop
    • Growing mobile should be priority for company
  •  If one device has lower conversion rate:
    • Product
      • Product team look into the difference among different version and see if UI/UX can be improved
    • Marketing
      • Different platform may attract different users
      • Maybe a proxy for user distribution
      • If a platform has higher conversion rate but with low amount of user:
        • Maybe it’s only attracting power users, who is likely to convert
      • Note:
        • this will be more reasonable
        • if one platform has lower absolute amount of users
      • Solution:
        • Collect more data about users,
        • find if user distribution is the same across the devices
          • Same: product problem
          •  Not same: not product problem
  •  Sudden drop of Conversion Rate:
    • Bug / Bad product changes
  •  Expected Bouncer from home page to search page: 0.5
    • If rate is not reasonable for Search page to payment page:
    • Look into the characteristics of search queries
    • See if highly searched item is not available in our site
    • Understand lack of supply compared to user demand.

2. Subscription Retention Problem

  • Data:
    • Pick a interested billing cycling
    • Collect user information subscribed in that same billing cycling
    • Column: user_id, sign_up_date, service_rate, ….., billing cycles
    • Value: billing cycles is continuous, indicates how many billing cycle that user stays, a.k.a., how long that user subscribed to our service.
  • Terms:
    • Retention Rate for a billing cycle :
      • How many people is still subscript to service at the end of that billing cycle
    • Life Time Value (LTV)
      • Can be defined as one year average customer value
        • i.e. for data less than one year and model is really good,
        • Predict monthly retention
        • Sum it up for 12 month, get the yearly average customer value
        • (Note: if it’s subscription retention problem, need to add one month fee after sum)
  • Insights:
    • Build a table with interested feature + retention rate + count + revenue
      • Retention rate: percentage of current active user
      •  Count: total amount of  user
      • Revenue: average of customer revenue (avg(price * month for each customer))
  • Plot with y= i or ii or iii, x = interested feature, hue = different pricing
    • Focus on Revenue plot:
      • This will give general idea of LTV for each level
      • See which level is doing the best, and pay attention to abnormal levels
    • Combine with retention rate plot:
      • Is a particular level is doing well because of a high retention rate?
  • Goal:
    • Shift users towards the optimal price point ,
    • Thus increase LTV, revenue gain, more click on ads, grow faster
  • Solution
    • UI design:
      • Change the UI so that, for each level, the optimal price (highest LTV) is more prominent
    • Make the best price has most attractive features:
      • Look into how users from each level are using the product
      • Figure out the tool features they like the most,
        • So that best LTV price has the most attractive tool for each level
    • Coupon:Targeted / regional coupons
    • Example:
      • Levels with low count but high LTV and retention_rate:
      • Big opportunity, try to increase those customer
      • (discount, campaign)

Random Forest — Business Insights

Draw Business Insights from RF

1. Var Imp:

  • Look at the rank of important variables, if the top one are the least actionable variable, meaning that it’s impossible for company to change that variable, delete it and re-build RF
  • check whether the top variable are continuous or categorical variable
    • continuous variables tend to show up at the top in RF variable importance plots.
    • If a categorical variables shows up at the top, it usually means it’s really important

2. PDP plot:

  • For categorical features with multiple levels:
    • Always remember there is a base level that was not plotted here
    • If all level are high positive, that means all those levels have high values compare to the base level, which means the base level has lowest outcome value
  • For binary features:
    • The plot is usually straight forward
  • For continuous:
    • Check the trend, and make a division
    • i.e. people with more than 70k income (feature) tends to have higher success rate (outcome value)

3. Build a simple DT to check 2 or 3 important segments

import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
tree = DecisionTreeClassifier( max_depth=2,class_weight={0:1, 1:10}, min_impurity_decrease = 0.001)
tree.fit(train.drop(['outcome'], axis=1), train['outcome'])
#visualize it
export_graphviz(tree, out_file="tree_conversion.dot", feature_names=train.drop(['outcome'], axis=1).columns, proportion=True, rotate=True)
with open("decision tree.dot") as f:
    dot_graph = f.read()
s = Source.from_file("decision tree.dot")
  • Each nodes has 4 values:
    • The tree split
    • Gini index:
      • Represent impurity of the node, 0.5 the worst
      • The average weighted Gini Impurity decreases as we move down the tree
      • 0 means perfect classification, best possible value
      • It’s the probability that randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node.
    • Samples:
      • Proportion of events in that node, the higher the better
      • It means that node is very important capture many people
    • Value:
      • Proportion of class 0 and class 1 event
  • If a variable is in the tree throughout all levels,
    • it probably have information about the other features as well
    • Note: it’s often that one variable is way more important than the others, this happens because it’s highly correlated with the other variables, try to get to the bottom to those relationship between the most important var and the others. Or try to remove that feature and see which variable starts to matter.
    • Plot important feature vs outcome, to investigate pattern of how outcome was influenced by that feature

Random Forest — Method and Application (Python)

Advantage of RF:

  1. Only little time is needed for optimization (the default param are good enough)
  2. Strong with outliers, correlated variables
  3. For continuous variables, it’s able to segmentize it


  1. Create a bootstrapped dataset (Sample with replacement)
  2. Create a decision tree using the bootstrapped dataset
    But only use a random subset of variables at each split
    i.e. in each split, randomly consider a subset left-over variables
    that are not selected by the previous split
  3. Repeat above step to have 100 tree
  4. Prediction:
    classify a observation to a class that has the most vote by 100 tree result
  5. How to validate? OOB out-of-bag
    i. Run obs in OOB to see if trees classify it right or not, majority vote wins
    ii. The proportion of OOB samples that were incorrectly classified is the OOB error
  6. Now since we know how to validate RF, We can use this to choose number of variable to consider in step 2. Test using different number of variables, and compare OOB error
    Note: typically start by using sqrt(num of var) and try few above or below that value.

Step for RF Application:

  • a. Build Random Forest:
    1. create dummy for categorical var
    2. split into test and train
    3. train model
    4. get accuracy and cofusion matrix
    5. checking overfitting
      • (by comparing OOB error rate bt test and train)
    6. if OOB error and test error are similar, not overfitting
    7. (for Fraud) build ROC and look for possible cut-off point
      • max(1-class1 error(FP) – class0 error(FN))
  • b. Plot variable importance for insights of each variable
    1. Plot PDP (Partial Dependence Plots)
    2. for insights of each levels for each variable

a. Build Random Forest:

# useful packages
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from numpy.cĀ·ore.umath_tests import inner1d
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#dummy variables for the categorical ones
data_dummy = pd.get_dummies(data, drop_first=True)
#split into train and test to avoid overfitting
train, test = train_test_split(data_dummy , test_size = 0.34)
#build the model
rf = RandomForestClassifier(n_estimators=100, max_features=3, oob_score=True)
rf.fit(train.drop('outcome_column', axis=1), train['outcome_column'])
#let's print OOB accuracy and confusion matrix
print("OOB accuracy is", rf.oob_score_, "\n", 
"OOB Confusion Matrix", "\n",
pd.DataFrame(confusion_matrix(train['converted'], rf.oob_decision_function_[:,1].round(), labels=[0, 1])))
#and let's print test accuracy and confusion matrix
"Test accuracy is", rf.score(test.drop('outcome_column', axis=1),
"Test Set Confusion Matrix", 
                              rf.predict(test.drop('outcome_column', axis=1)),
                              labels=[0, 1]))

Since Accuracy for Test and Train are similar, we are not over-fitting

Question 1 : If response variable is continuous, how to define accuracy?
Answer: Change accuracy standard

If the prediction is within 25% of the actual value, we say it’s predicting right
I.e. if a given person salary is 100K,
we consider the model to predict correctly if the prediction is within 25K.
We can look at this as a sort of accuracy when the label is continuous.

accuracy_25pct =  ((rf.predict(test.drop('outcome_column',
print("We are within 25% of the actual outcome in ",
      accuracy_25pct.round(2)*100, "% of the cases", sep="")
Question 2: How to know the model is actually learning things
Answer: (if so, insights generated from RF will be fairly reliable, and for sure directionally true)
print(np.percentile(train['outcome_column'], np.arange(0, 100, 10))) 

b. Variable Importance (check each variable)

  1. Var Imp is useful to determine whether rebuild is needed.
    is the most important var the least actionable? (i.e. total page visited by user)
    • if so, drop that variable from data and refit RF
  2. continuous variable tends to be very important in RF, if categorical varaible stands on the top of important list, means it’s a really important variable
  3. If two variable is likely correlated, check with pearson correlation. RF not going to pick same information twice, thus robust to correlated variable, that’s why it’s so popular.
# Var Imp
# rf is the random forest model we previously built
feat_importances = pd.Series(rf.feature_importances_,
                             index=train.drop('outcome_column', axis=1).columns)

# PDP for column1, which has 3 levels ('level1', 'level2', 'level3')
from pdpbox import pdp, info_plots
pdp_iso = pdp.pdp_isolate( model=rf, 
                          dataset=train.drop(['outcome_column'], axis=1),      
                          model_features=list(train.drop(['outcome_column'], axis=1)), 
                          feature=['level1', 'level2', 'level3'], 
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='column1')

# Pearson Corr
from scipy.stats import pearsonr
print("Correlation between A and B is:", 
      round(pearsonr(data.A, data.B)[0],2))

Linear Model in Python

Basic Work Flow:

  1. check distribution for each variable for assumptions
  2. check for outliers and any transformation need to perform (log?)
  3. fit a linear regression model
  4. Analyze model fitting (R square)
  5. predict (un-transform the result if transformation is performed)
# useful packages
import pandas as pd
import json
import statistics
import urllib
import numpy as np
import scipy
import matplotlib.pyplot as plt
# assume features are log-trandformed
X = data["feature1", "feature2"]
X = sm.add_constant(X)
y = data["outcome"]

model = sm.OLS(y, X)
result = model.fit()

print("R Squared is:", round(result.rsquared,3))

Xpred = np.log(range(8, 11))
Xpred = sm.add_constant(Xpred)
print("Predicted results for 10 is :",
  round(np.exp(result.predict(Xpred))[2], 3))

EDA and Feature Engineering

# useful packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import warnings

Data Preparation

Before landing a model for optimization or recommendation model, we need to make sure our data is in “ready-to-go” status. Here, I summarized some ways to clean data for future reference.

  1. Descriptive Stat
  2. Query and Merge
  3. Group and Plot Data
  4. fill na, replace and assign values
  5. Data Transformation on Column (Log, datetime)
  6. Check Logic
  7. Import Json, parse url and Export File, API

1. Descriptive Stat

data.describe(include = "all")

this will provide basic distribution of each variable. We can use this to observe some extreme or unreasonable values, and process it before further investigations.

# some useful command to get the general overview of a given dataset
# first 5 rows of a dataset
# dimension of a dataset
# return the first observation on "source column"
# change column name using index
data.columns.values[index] = "new_col_0_name"
# change column name using name 
data.rename(columns = {'old_name':'new_name'}, inplace = True)

2.Query and Merge

data[data["column"] > criteria_value]
data.loc[data.column > criteria_value]
data.query("column == 'criteria_value'")
data.query("column != \"unwanted_value\"").drop(["unwanted_column"], axis = 1)
# merge data1 with data2 
data=pandas.merge(left=data1, right=data2, how='left',
                 left_on = "column 1 in data 1", right_on = "column 2 in data 2")
# check for duplication
print(len(data1) == len(data))

3.Group and Plot Data

# get ... (algorithm) of xyz for each abc
data.groupby("abc column")["xyz column"].algorithm
# plot histogram 
# plot trendline

# get count for each level for one categorical column
# get the mean of outcome for each level for one categorical column

# get the mean of outcome for one continuous column
#group data by id and date, C is continuous variable, D is str
data = data.groupby[("id", "Date")].apply(lambda x: pandas.Series({
                            "max": x["C"].max(),
                            "sum": x["C"].sum(),
                            "mean": x["C"].mean(),
                            "count": x["C"].count(),
                            "cumsum": x["C"].cumsum(),
                            "to_numeric": pd.to_numeric(x["C"]),
                            "pct_C_is_1": np.average(x["C"] == 1),
                            "avg_log": np.average(np.log(x["C"]))
                            "previous_value": x["C"].shift(1),
                            "pct_of_changed": x["C"]/x["C"].shift(1)-1,
                            "num_of_items":  len(x[D].split(',')),
# check the top largest level count

# check the count for each level
# size() and value_counts only difference are the latter one will return in descending order

# create a new column as count for each levels for one categorical column
# this might be useful as feature enginnering

# use agg to compute more
data.groupby(["categorical_col1, categorical_col2"])["outcome"].agg({'mean', 'count'}).reset_index()

Note: agg applies function to each group, apply applies function to each column in group

4.fill na, replace and assign values

# find na per column

# fill na for data1 using data2 
# exist non_overlapping row and col indices
# exact the same row and col indices

If few columns have same amount to NA, it means they are likely to be linked. Check with logic if this makes sense. If so, replace NA to other value.

# Column A and B have same amount of NA, theoretically it is linked with Column C == 1 or not
(data.A.isnull() == (data.C == 1).value_counts)
(data.B.isnull() == (data.C == 1).value_counts)
# if the results are both True and same count value, then theory is correct
data.A.fillna(-1, inplace = True)
data.B.fillna(-1, inplace = True)
# if column C is binary, we can delete it, since it's no longer carry information
# o/w, we kept it 
del data['C']

# binary response column for value is na or not,  in one column
# fill na for each column with specific value depend on whether that value is na or not
for i in range(4: 8):
    data.iloc[:,i] = np.where(data.iloc[:,i].isna(), 0, 1)

# check for no more NA

Just like we investigated NA, we should also pay attention to zeros in column, for example is is_employed ==1, yearly_salary is not likely to be 0. Check how many cases is like that, and dig deeper to figure out.

print(data.query('is_employed == 1 & yearly_salary == 0').shape[0])
# replace value using dictionary
# here we are replacing a categorical column with continuous value
# useful for feature engineering
codes = {"level1": 1, "level2": 2, "level3": 3, "level4": 4}
# replace vs map, map is quicker
#for map, to avoid the situation where there is no valid key you can pass na_action='ignore'
data["replace_column"] = data["replace_column"].map(codes)
data["replace_column"] = data["replace_column"].replace(codes)

# assign value in column2 based on value in column1
data.loc[data.column1 == criteria_value, "column2"] = "new_value"
# data_cp[data_cp.age == 25]["source"] = "Ads" does not work

# assign value based on index
data["column"][row_num] = "new_value"

# assign value for entire column
data["new_column"] = new_value

5.Data Transformation on Column (Log, datetime, str)

a. Log transformation

Note: if original data have negative value, log transformation might not be ideal.

After log, going from 6th to 5th weight less than from 2nd to 1st

data["column"] = np.log(data["column"])

b. Datetime Transformation

data['timestamp'] = pd.to_datetime(data['timestamp'], unit = 's')
# extract hour
data['hour']=data['timestamp'].apply(lambda x: x.hour)
sns.countplot(x='hour', data=data)
# Extract week of the year and day of the week from date
data['weekday']  = pd.to_datetime(data['date']).dt.strftime('%A')
data['weekyear'] = pd.to_datetime(data['date']).dt.strftime('%U').astype(int)
# or
data['weekyear'] = pandas.to_datetime(data.date).dt.week

c. str transformation

# if we want to combine column A and B, both column has str format
new_col = data.A + " and " + data.B

# count items separate by comma
data['count'] = purchase['column'].apply(lambda x: len(x.split(',')))

6.Check Logic

  • check if id is unique
  • check for negative values
  • for funnel analysis, check direction of funnel
  • for records generated by consecutive date, check if it’s actually consecutive
# 1. check if id is unique
# if True, then is unique

# 2. check for negative values for column 1
print("There are", len(data.query('column1<0')),
      "events with negative values")

# 3. for funnel analysis, always check with the direction of funnel
# order down in funnel : layer1, layer2, layer3
print((data['layer1'] >= data['layer2']).unique())
print((data['layer2'] >= data['layer3']).unique())
# or
print("There are", len(data.query('layer1<layer2 | layer2<layer3')),
      "events where the funnel doesn't make any sense")

# 4. check if observation collected from each id are from consecutive days
consecutive_days = data.groupby('id')['date'].transform(lambda x: pandas.to_datetime(x).diff())
# if consecutive, return: array([1], dtype='timedelta64[D]')

7.Import and Export Data (Json, URL, API )

  • Import Data (Json, URL)

For Json, check the structure first then build a df row by row based on each entry

#Import Json:
import json
with open('code/city_searches.json', 'r') as f:
    data_json = json.load(f)

#extract all the values from the json
data_js = pd.DataFrame(columns=["col1", "col2", "col3"])

for i in range(len(data_json)):
    data_js.loc[i, "col1"]   = data_json[i]["section_name"][indices] 
    data_js.loc[i, "col2"]   = data_json[i]["section_name"][indices] 
    data_js.loc[i, "col3"]   = data_json[i]["section_name"][indices] 

Fro URL, use following standard code to convert it to data frame.

data = pd.read_csv("location_of_url_file", delimiter = "", header = None, names = ['url'])
data_wide = pd.DataFrame()
for i in range(len(data)):
    params_extracted = pd.DataFrame([parse.parse_qs(parse.urlparse(data.url[i]).query)])
    data_wide = data_wide.append(params_extracted, sort = True)
data_wide = data_wide.reset_index()
  • Create and Export Data
# create dataframe
data = pd.DataFrame({'column1': list_1, 'column2': list_2})

# export to csv
export_csv = data.to_csv("location", index = None, header = True)
# remember to add '.csv' at the end of the path
  • API
# API useful packages
import requests
import requests_cache
requests_cache.install_cache('github_cache', backend='sqlite', expire_after=180)

# different API has different format, using for loop to get value
new_list = ()
for i in range(len(list)):
     API_value = .......(...key = '****'...).....

More Groupby: