Temporal Factors_The Data Analysis Workshop-QQ阅读现言女生网

上QQ阅读APP看书，第一时间看更新

Temporal Factors

Factors such as day of the week and month may also be indicators for absenteeism. For instance, employees might prefer to have their medical examinations on Friday when the workload is lower, and it is closer to the weekend. In this section, we will analyze the impact of the Day of the week and Month of absence columns, and their impact on the employees' absenteeism.

Let's begin with an analysis of the number of entries for each day of the week and each month:

# count entries per day of the week and month

plt.figure(figsize=(12, 5))

ax = sns.countplot(data=preprocessed_data, \

x='Day of the week', \

order=["Monday", "Tuesday", \

"Wednesday", "Thursday", "Friday"])

ax.set_title("Number of absences per day of the week")

plt.savefig('figs/dow_counts.png', format='png', dpi=300)

plt.figure(figsize=(12, 5))

ax = sns.countplot(data=preprocessed_data, \

x='Month of absence', \

order=["January", "February", "March", \

"April", "May", "June", "July", \

"August", "September", "October", \

"November", "December", "Unknown"])

ax.set_title("Number of absences per month")

plt.savefig('figs/month_counts.png', format='png', dpi=300)

The output will be as follows:

Figure 2.50: Number of absences per day of the week

The number of absences per month can be visualized as follows:

Figure 2.51: Number of absences per month

From the preceding plots, we can't really see a substantial difference between the different days of the week or months. It seems that fewer absences occur on Thursday, while the month with the most absences is March, but it is hard to say that the difference is significant.

Now, let's focus on the distribution of absence hours among the days of the week and the months of the year. This analysis will be performed in the following exercise.

Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year

In this exercise, you will be looking at the hours during which the employees were absent for days of the week and months of the year. Execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

Consider the distribution of absence hours among the days of the week and months of the year:
# analyze average distribution of absence hours
plt.figure(figsize=(12,5))
sns.violinplot(x="Day of the week", \
               y="Absenteeism time in hours",\
               data=preprocessed_data, \
               order=["Monday", "Tuesday", \
                      "Wednesday", "Thursday", "Friday"])
plt.savefig('figs/exercise_206_dow_hours.png', \
            format='png', dpi=300)
plt.figure(figsize=(12,5))
sns.violinplot(x="Month of absence", \
               y="Absenteeism time in hours",\
               data=preprocessed_data, \
               order=["January", "February", \
                      "March", "April", "May", "June", "July",\
                      "August", "September", "October", \
                      "November", "December", "Unknown"])
plt.savefig('figs/exercise_206_month_hours.png', \
            format='png', dpi=300)
The output will be as follows:

Figure 2.52: Average absent hours during the week
The violin plot for the average absent hours over the year can be visualized as follows:

Figure 2.53: Average absent hours over the year
Compute the mean and standard deviation of the absences based on the day of the week:
"""
compute mean and standard deviation of absence hours per day of the week
"""
dows = ["Monday", "Tuesday", "Wednesday", \
        "Thursday", "Friday"]
for dow in dows:
    mask = preprocessed_data["Day of the week"] == dow
    hours = preprocessed_data["Absenteeism time in hours"][mask]
    mean = hours.mean()
    stddev = hours.std()
    print(f"Day of the week: {dow:10s} | Mean : {mean:.03f} \
| Stddev: {stddev:.03f}")
The output will be as follows:

Figure 2.54: Mean and standard deviation of absent hours per day of the week
Similarly, compute the mean and standard deviation based on the month, as follows:
"""
compute mean and standard deviation of absence hours per day of the month
"""
months = ["January", "February", "March", "April", "May", \
          "June", "July", "August", "September", "October", \
          "November", "December"]
for month in months:
    mask = preprocessed_data["Month of absence"] == month
    hours = preprocessed_data["Absenteeism time in hours"][mask]
    mean = hours.mean()
    stddev = hours.std()
    print(f"Month: {month:10s} | Mean : {mean:8.03f} \
| Stddev: {stddev:8.03f}")
The output will be as follows:

Figure 2.55: Mean and standard deviation of absent hours per month
Observe that the average duration of the absences is slightly shorter on Thursday (4.424 hours), while absences during July have the longest average duration (10.955 hours). To determine whether these values are statistically significant—that is, whether there is a statistically significant difference regarding the rest of the days/months—use the following code snippet:
# perform statistical test for avg duration difference
thursday_mask = preprocessed_data\
                ["Day of the week"] == "Thursday"
july_mask = preprocessed_data\
            ["Month of absence"] == "July"
thursday_data = preprocessed_data\
                ["Absenteeism time in hours"][thursday_mask]
no_thursday_data = preprocessed_data\
                   ["Absenteeism time in hours"][~thursday_mask]
july_data = preprocessed_data\
            ["Absenteeism time in hours"][july_mask]
no_july_data = preprocessed_data\
               ["Absenteeism time in hours"][~july_mask]
thursday_res = ttest_ind(thursday_data, no_thursday_data)
july_res = ttest_ind(july_data, no_july_data)
print(f"Thursday test result: statistic={thursday_res[0]:.3f}, \
pvalue={thursday_res[1]:.3f}")
print(f"July test result: statistic={july_res[0]:.3f}, \
pvalue={july_res[1]:.3f}")
The output will be as follows:
Thursday test result: statistic=-2.307, pvalue=0.021
July test result: statistic=2.605, pvalue=0.009
Summarize and visualize the data as follows:
preprocessed_data.head().T
preprocessed_data["Service time"].hist()
The output will be as follows:

Figure 2.56: Statistics of data
Visualize the plot as follows:

Figure 2.57: Histogram for preprocessed data

Note

To access the source code for this specific section, please refer to https://packt.live/2AIFO1X.

You can also run this example online at https://packt.live/37y5omt. You must execute the entire Notebook in order to get the desired result.

Since the p-values from both the statistical tests are below the critical value of 0.05, we can conclude the following:

There is a statistically significant difference between Thursdays and other days of the week. Absences on Thursday have a shorter duration, on average.
Absences during July are the longest over the year. Also, in this case, we can reject the null hypothesis of having no difference.

From the analysis we've performed in this exercise, we can conclude that our initial observations about the difference in absenteeism during the month of July and on Thursdays are correct. Of course, we cannot claim that this is the cause, but only state that certain trends exist in the data.

Activity 2.01: Analyzing the Service Time and Son Columns

In this activity, you will extend the analysis of the absenteeism dataset by exploring the impact of two additional columns: Service time and Son.

This activity is based on the techniques that have been presented in this chapter—that is, distribution analysis, hypothesis testing, and conditional probability estimation.

The following steps will help you complete this activity:

Import the data and the necessary libraries:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Analyze the distribution of the Service time column by creating a kernel density estimation plot (use the seaborn.kdeplot() function). Perform a hypothesis test for normality (that is, a Kolmogorov-Smirnov test with the scipy.stats.kstest() function). The KDE plot will be as follows:

Figure 2.58: KDE plot for service time
Create a violin plot of the Service time column and the Reason for absence column. Draw a conclusion about the observed relationship.
The output will be as follows:

Figure 2.59: Violin plot for the Service time column
Create a correlation plot between the Service time and Absenteeism time in hours columns, similar to the one in Figure 2.47. The output will be as follows:

Figure 2.60: Correlation plot for service time
Analyze the distributions of Absenteeism time in hours for employees with a different number of children (the Son column).
The output will be as follows:

Figure 2.61: Distribution of absent time for employees with a different number of children

Note

The solution to the activity can be found on page 494.

From this analysis, we can infer that the number of absence hours for employees with a greater number of children lies in the range of 10-15 hours. Employees with less than three children appear to be absent in a varying range of 1-20 hours. To be specific, employees with no children still have a varying number of absent hours within the range of 10-15 hours, owing to other reasons, which now opens up a new area of analysis. On the contrary, employees with one child are absent only for an average of 5 hours. Employees with two children have an average of 15-25 absent hours, which could be analyzed further.

Thus, we have successfully drawn measurable conclusions to help us understand employee behavior in an organization to tackle unregulated absenteeism and take necessary measures to ensure the optimal utilization of human resources.