Building Process Performance Data Models for Target setting, mid-course corrective actions for a CMMI Maturity Level 5 Process
Published on by Vimal Octavius PJ
PPM Regression Matrix Plot Box Plot Scatter plot Control Chart I-MR Chart
241 min READ
My experience in building data models, from a Project's Target setting, prediction and mid-course corrective actions. These actions were followed to obtain a CMMI Maturity Level 5 Certification
Summarizing the steps involved in bulding Process Performance Models:
Before we begin, it's a good practice to create a document describing the Data Preparation Guidelines and Steps involved so the teams are prepared. These may include:
1. Identify Y and study variation:
Let's say our project Y is Customer Satisfaction in a ticketing process. An I-MR control chart can be used to track the process variability based on the samples taken over a period of time.
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Datasets/PPM-case-study-v1.csv')
df.head()
Week_Num | Customer_Satisfaction | Resolved_First_Call | Average_Speed_To_Answer | Experience_0-2Yrs_Percent | |
---|---|---|---|---|---|
0 | 9 | 0.963415 | 0.814433 | 0.971769 | 0.666667 |
1 | 10 | 0.986486 | 0.848485 | 0.966065 | 0.710000 |
2 | 11 | 0.943182 | 0.878049 | 0.985775 | 0.500000 |
3 | 12 | 0.964706 | 0.835294 | 0.974433 | 0.600000 |
4 | 13 | 0.967391 | 0.817073 | 0.950240 | 0.650000 |
df.describe()
Week_Num | Customer_Satisfaction | Resolved_First_Call | Average_Speed_To_Answer | Experience_0-2Yrs_Percent | |
---|---|---|---|---|---|
count | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 |
mean | 22.000000 | 0.943501 | 0.762812 | 0.934111 | 0.555626 |
std | 7.937254 | 0.027951 | 0.062703 | 0.026929 | 0.142547 |
min | 9.000000 | 0.886667 | 0.602041 | 0.877095 | 0.250000 |
25% | 15.500000 | 0.920998 | 0.718288 | 0.912943 | 0.439286 |
50% | 22.000000 | 0.951807 | 0.771739 | 0.936737 | 0.600000 |
75% | 28.500000 | 0.963415 | 0.807143 | 0.952319 | 0.673333 |
max | 35.000000 | 0.988372 | 0.878049 | 0.985775 | 0.750000 |
df['Diff'] = abs(df['Customer_Satisfaction'].diff()) #returns difference of the value and the preceding value
df.head()
Week_Num | Customer_Satisfaction | Resolved_First_Call | Average_Speed_To_Answer | Experience_0-2Yrs_Percent | Diff | |
---|---|---|---|---|---|---|
0 | 9 | 0.963415 | 0.814433 | 0.971769 | 0.666667 | NaN |
1 | 10 | 0.986486 | 0.848485 | 0.966065 | 0.710000 | 0.023072 |
2 | 11 | 0.943182 | 0.878049 | 0.985775 | 0.500000 | 0.043305 |
3 | 12 | 0.964706 | 0.835294 | 0.974433 | 0.600000 | 0.021524 |
4 | 13 | 0.967391 | 0.817073 | 0.950240 | 0.650000 | 0.002685 |
df.shape
(27, 6)
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot as plt
%matplotlib inline
data = df['Customer_Satisfaction']
qqplot(data, line='s')
plt.show()
from scipy import stats
stats.anderson(data,dist='norm')
AndersonResult(statistic=0.39960035339271727, critical_values=array([0.517, 0.589, 0.707, 0.824, 0.98 ]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
The Statistic returned, 0.399 is less than the critical value 0.707 at significance level 5%. Hence we fail to reject the null hypothesis that Data is Normal
This is confirmed by the normality test below. Since the p value is > .05.
stats.normaltest(data)
NormaltestResult(statistic=1.2484339335029948, pvalue=0.5356807201523828)
The Table of Constants for control chart
gives you the below values for constants:
# Constant for n = 2
E2 = 2.66
D3 = 0
D4 = 3.267
mr = df['Diff']
mr_bar = mr.mean()
mr_bar
0.022679839896153835
i = df['Customer_Satisfaction']
i_bar = i.mean()
#I chart control limits
uclx = i_bar + E2 * mr_bar
lclx = i_bar - E2 * mr_bar
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
ax.plot(i,color = 'blue')
ax.axhline(i_bar,color = 'black',linestyle = '--')
ax.axhline(lclx, color = 'red',linestyle = '--')
ax.axhline(uclx, color = 'red',linestyle = '--')
ax.set_xlabel('Observation')
ax.set_ylabel('Customer Satisfaction')
ax.set_title('Control chart I-MR I')
plt.show()
# MR chart control Limits
uclr = D4 * mr_bar
lclr = D3 * mr_bar
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
ax.plot(mr,color = 'blue')
ax.axhline(mr_bar,color = 'green',linestyle = '--')
ax.axhline(lclr, color = 'red',linestyle = '--')
ax.axhline(uclr, color = 'red',linestyle = '--')
ax.set_xlabel('Observation')
ax.set_ylabel('Range')
ax.set_title('Control Chart I-MR MR')
plt.show()
List down Xs which might influence Y significantly Select Xs based on people, process and product factor Xs should be tweakable and controllable in nature
Xs | Category | Data Type |
---|---|---|
First Call Resolution | Process | Continuous |
Project Experience | People | Continuous |
% Reopened Tickets | Process | Continuous |
Homogeneity of Xs, points to look out for:
fig, ax = plt.subplots()
ax.boxplot([df['Resolved_First_Call'],df['Average_Speed_To_Answer']])
ax.set_xticklabels(['Resolved_First_Call','Average_Speed_To_Answer'])
ax.set_ylabel('Customer_Satisfaction')
ax.set_title('Visualizing Homogeneity of Xs')
plt.show()
Model Building and verification:
#from sklearn import linear_model
#regr = linear_model.LinearRegression()
#regr.fit(X,Y)
import seaborn
df_X_Y = df.drop(columns = ['Week_Num','Diff'], axis=1)
seaborn.pairplot(df_X_Y,kind="reg")
<seaborn.axisgrid.PairGrid at 0x7f6d0d142b90>
Remove any X's if there's no apparent linear relationship
We see the Xs are highly correlated with each other, could lead to multicollinearity. Let's drop Average Speed to Answer (There are other methods to handle multicollinearity, for the sake of simplicity and illustration let's drop one highly correlated X)
x = df[['Resolved_First_Call','Experience_0-2Yrs_Percent']]
Y = df['Customer_Satisfaction']
import statsmodels.api as sm
X = sm.add_constant(x) #Our model needs an intercept so we add a column of 1s:
print(X)
const Resolved_First_Call Experience_0-2Yrs_Percent 0 1.0 0.814433 0.666667 1 1.0 0.848485 0.710000 2 1.0 0.878049 0.500000 3 1.0 0.835294 0.600000 4 1.0 0.817073 0.650000 5 1.0 0.818182 0.690000 6 1.0 0.814286 0.660000 7 1.0 0.783784 0.720000 8 1.0 0.797297 0.680000 9 1.0 0.795181 0.650000 10 1.0 0.800000 0.690000 11 1.0 0.783019 0.680000 12 1.0 0.770492 0.666667 13 1.0 0.771739 0.428571 14 1.0 0.777778 0.640000 15 1.0 0.765957 0.750000 16 1.0 0.760870 0.500000 17 1.0 0.709302 0.450000 18 1.0 0.744898 0.460000 19 1.0 0.727273 0.490000 20 1.0 0.746835 0.590000 21 1.0 0.675676 0.400000 22 1.0 0.674419 0.390000 23 1.0 0.682353 0.390000 24 1.0 0.703297 0.300000 25 1.0 0.697917 0.250000 26 1.0 0.602041 0.400000
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
ols_model = sm.OLS(Y,X)
print(ols_model)
<statsmodels.regression.linear_model.OLS object at 0x7f6d0c95a6d0>
ols_results = ols_model.fit()
print(ols_results)
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x7f6d0c8eb210>
print(ols_results.summary())
OLS Regression Results ================================================================================= Dep. Variable: Customer_Satisfaction R-squared: 0.869 Model: OLS Adj. R-squared: 0.858 Method: Least Squares F-statistic: 79.42 Date: Sun, 27 Feb 2022 Prob (F-statistic): 2.62e-11 Time: 02:36:42 Log-Likelihood: 86.198 No. Observations: 27 AIC: -166.4 Df Residuals: 24 BIC: -162.5 Df Model: 2 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------- const 0.8186 0.029 28.478 0.000 0.759 0.878 Resolved_First_Call 0.0399 0.047 0.856 0.400 -0.056 0.136 Experience_0-2Yrs_Percent 0.1699 0.021 8.279 0.000 0.128 0.212 ============================================================================== Omnibus: 1.694 Durbin-Watson: 1.814 Prob(Omnibus): 0.429 Jarque-Bera (JB): 1.429 Skew: -0.413 Prob(JB): 0.489 Kurtosis: 2.233 Cond. No. 38.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It would be good if we can get a model with Condition Number <20 further, to avoid multicollinearity increasing the variance of the regression coefficients
Let's take a look at the Test set:
df_test = pd.read_csv('/content/drive/MyDrive/Datasets/PPM-case-study-v1-test.csv')
df_test
Week_Num | Customer_Satisfaction | Resolved_First_Call | Average_Speed_To_Answer | Experience_0-2Yrs_Percent | |
---|---|---|---|---|---|
0 | 36 | 0.92 | 0.68 | 0.92 | 0.45 |
1 | 37 | 0.88 | 0.66 | 0.90 | 0.38 |
2 | 38 | 0.92 | 0.69 | 0.92 | 0.40 |
3 | 39 | 0.90 | 0.71 | 0.89 | 0.31 |
4 | 40 | 0.87 | 0.68 | 0.88 | 0.25 |
5 | 41 | 0.92 | 0.61 | 0.88 | 0.42 |
Let's try and predict the value of Y (Customer Satisfaction) for the above test data set
x_test_set = df_test[['Resolved_First_Call','Experience_0-2Yrs_Percent']]
x_test = sm.add_constant(x_test_set)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
ypred = ols_results.predict(x_test)
print(ypred)
0 0.922247 1 0.909555 2 0.914151 3 0.899659 4 0.888267 5 0.914353 dtype: float64
df_predicted_column = pd.DataFrame({'Predicted_Customer_Satisfaction': ypred})
df_predicted_column
Predicted_Customer_Satisfaction | |
---|---|
0 | 0.922247 |
1 | 0.909555 |
2 | 0.914151 |
3 | 0.899659 |
4 | 0.888267 |
5 | 0.914353 |
We can now compare the predicted vs actual from the test data set. If the Mean Absolute Error (MAE) is greater than an pre-determined acceptable level (like +/-3%) as an example, then the model needs to be re-evaluated.
If the MAE is acceptable we may proceed with PPM usage and Mid Course Corrective actions as required
Using the Formula returned by the Regression Equation, Different Scenarios can be created. Each scenario whould have a projected value of Y (Customer Satisfaction) with it's corresponding value of X's (Resolved on First Call and Percentage of People with 0-2 years of Experince in the Project involved in working those tickets)