data:image/s3,"s3://crabby-images/0932e/0932e833b68ec9589171dd71e6313c31aa783039" alt="The Data Analysis Workshop"
Temporal Factors
Factors such as day of the week and month may also be indicators for absenteeism. For instance, employees might prefer to have their medical examinations on Friday when the workload is lower, and it is closer to the weekend. In this section, we will analyze the impact of the Day of the week and Month of absence columns, and their impact on the employees' absenteeism.
Let's begin with an analysis of the number of entries for each day of the week and each month:
# count entries per day of the week and month
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
x='Day of the week', \
order=["Monday", "Tuesday", \
"Wednesday", "Thursday", "Friday"])
ax.set_title("Number of absences per day of the week")
plt.savefig('figs/dow_counts.png', format='png', dpi=300)
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
x='Month of absence', \
order=["January", "February", "March", \
"April", "May", "June", "July", \
"August", "September", "October", \
"November", "December", "Unknown"])
ax.set_title("Number of absences per month")
plt.savefig('figs/month_counts.png', format='png', dpi=300)
The output will be as follows:
data:image/s3,"s3://crabby-images/35621/35621c5d7e7c3c801f59c9117fdbc799f54f50eb" alt=""
Figure 2.50: Number of absences per day of the week
The number of absences per month can be visualized as follows:
data:image/s3,"s3://crabby-images/eecba/eecba8b894a4ab47eae6ef827335a440233ed0d7" alt=""
Figure 2.51: Number of absences per month
From the preceding plots, we can't really see a substantial difference between the different days of the week or months. It seems that fewer absences occur on Thursday, while the month with the most absences is March, but it is hard to say that the difference is significant.
Now, let's focus on the distribution of absence hours among the days of the week and the months of the year. This analysis will be performed in the following exercise.
Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year
In this exercise, you will be looking at the hours during which the employees were absent for days of the week and months of the year. Execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:
- Consider the distribution of absence hours among the days of the week and months of the year:
# analyze average distribution of absence hours
plt.figure(figsize=(12,5))
sns.violinplot(x="Day of the week", \
y="Absenteeism time in hours",\
data=preprocessed_data, \
order=["Monday", "Tuesday", \
"Wednesday", "Thursday", "Friday"])
plt.savefig('figs/exercise_206_dow_hours.png', \
format='png', dpi=300)
plt.figure(figsize=(12,5))
sns.violinplot(x="Month of absence", \
y="Absenteeism time in hours",\
data=preprocessed_data, \
order=["January", "February", \
"March", "April", "May", "June", "July",\
"August", "September", "October", \
"November", "December", "Unknown"])
plt.savefig('figs/exercise_206_month_hours.png', \
format='png', dpi=300)
The output will be as follows:
Figure 2.52: Average absent hours during the week
The violin plot for the average absent hours over the year can be visualized as follows:
Figure 2.53: Average absent hours over the year
- Compute the mean and standard deviation of the absences based on the day of the week:
"""
compute mean and standard deviation of absence hours per day of the week
"""
dows = ["Monday", "Tuesday", "Wednesday", \
"Thursday", "Friday"]
for dow in dows:
mask = preprocessed_data["Day of the week"] == dow
hours = preprocessed_data["Absenteeism time in hours"][mask]
mean = hours.mean()
stddev = hours.std()
print(f"Day of the week: {dow:10s} | Mean : {mean:.03f} \
| Stddev: {stddev:.03f}")
The output will be as follows:
Figure 2.54: Mean and standard deviation of absent hours per day of the week
- Similarly, compute the mean and standard deviation based on the month, as follows:
"""
compute mean and standard deviation of absence hours per day of the month
"""
months = ["January", "February", "March", "April", "May", \
"June", "July", "August", "September", "October", \
"November", "December"]
for month in months:
mask = preprocessed_data["Month of absence"] == month
hours = preprocessed_data["Absenteeism time in hours"][mask]
mean = hours.mean()
stddev = hours.std()
print(f"Month: {month:10s} | Mean : {mean:8.03f} \
| Stddev: {stddev:8.03f}")
The output will be as follows:
Figure 2.55: Mean and standard deviation of absent hours per month
- Observe that the average duration of the absences is slightly shorter on Thursday (4.424 hours), while absences during July have the longest average duration (10.955 hours). To determine whether these values are statistically significant—that is, whether there is a statistically significant difference regarding the rest of the days/months—use the following code snippet:
# perform statistical test for avg duration difference
thursday_mask = preprocessed_data\
["Day of the week"] == "Thursday"
july_mask = preprocessed_data\
["Month of absence"] == "July"
thursday_data = preprocessed_data\
["Absenteeism time in hours"][thursday_mask]
no_thursday_data = preprocessed_data\
["Absenteeism time in hours"][~thursday_mask]
july_data = preprocessed_data\
["Absenteeism time in hours"][july_mask]
no_july_data = preprocessed_data\
["Absenteeism time in hours"][~july_mask]
thursday_res = ttest_ind(thursday_data, no_thursday_data)
july_res = ttest_ind(july_data, no_july_data)
print(f"Thursday test result: statistic={thursday_res[0]:.3f}, \
pvalue={thursday_res[1]:.3f}")
print(f"July test result: statistic={july_res[0]:.3f}, \
pvalue={july_res[1]:.3f}")
The output will be as follows:
Thursday test result: statistic=-2.307, pvalue=0.021
July test result: statistic=2.605, pvalue=0.009
- Summarize and visualize the data as follows:
preprocessed_data.head().T
preprocessed_data["Service time"].hist()
The output will be as follows:
Figure 2.56: Statistics of data
- Visualize the plot as follows:
Figure 2.57: Histogram for preprocessed data
Note
To access the source code for this specific section, please refer to https://packt.live/2AIFO1X.
You can also run this example online at https://packt.live/37y5omt. You must execute the entire Notebook in order to get the desired result.
Since the p-values from both the statistical tests are below the critical value of 0.05, we can conclude the following:
- There is a statistically significant difference between Thursdays and other days of the week. Absences on Thursday have a shorter duration, on average.
- Absences during July are the longest over the year. Also, in this case, we can reject the null hypothesis of having no difference.
From the analysis we've performed in this exercise, we can conclude that our initial observations about the difference in absenteeism during the month of July and on Thursdays are correct. Of course, we cannot claim that this is the cause, but only state that certain trends exist in the data.
Activity 2.01: Analyzing the Service Time and Son Columns
In this activity, you will extend the analysis of the absenteeism dataset by exploring the impact of two additional columns: Service time and Son.
This activity is based on the techniques that have been presented in this chapter—that is, distribution analysis, hypothesis testing, and conditional probability estimation.
The following steps will help you complete this activity:
- Import the data and the necessary libraries:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
- Analyze the distribution of the Service time column by creating a kernel density estimation plot (use the seaborn.kdeplot() function). Perform a hypothesis test for normality (that is, a Kolmogorov-Smirnov test with the scipy.stats.kstest() function). The KDE plot will be as follows:
Figure 2.58: KDE plot for service time
- Create a violin plot of the Service time column and the Reason for absence column. Draw a conclusion about the observed relationship.
The output will be as follows:
Figure 2.59: Violin plot for the Service time column
- Create a correlation plot between the Service time and Absenteeism time in hours columns, similar to the one in Figure 2.47. The output will be as follows:
Figure 2.60: Correlation plot for service time
- Analyze the distributions of Absenteeism time in hours for employees with a different number of children (the Son column).
The output will be as follows:
Figure 2.61: Distribution of absent time for employees with a different number of children
Note
The solution to the activity can be found on page 494.
From this analysis, we can infer that the number of absence hours for employees with a greater number of children lies in the range of 10-15 hours. Employees with less than three children appear to be absent in a varying range of 1-20 hours. To be specific, employees with no children still have a varying number of absent hours within the range of 10-15 hours, owing to other reasons, which now opens up a new area of analysis. On the contrary, employees with one child are absent only for an average of 5 hours. Employees with two children have an average of 15-25 absent hours, which could be analyzed further.
Thus, we have successfully drawn measurable conclusions to help us understand employee behavior in an organization to tackle unregulated absenteeism and take necessary measures to ensure the optimal utilization of human resources.