
Initial Analysis of the Reason for Absence
Let's start with a simple analysis of the Reason for absence column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.
The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the countplot() function from the seaborn package:
# get the number of entries for each reason for absence
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=preprocessed_data, x="Reason for absence")
ax.set_ylabel("Number of entries per reason of absence")
plt.savefig('figs/absence_reasons_distribution.png', \
format='png', dpi=300)
The output will be as follows:

Figure 2.6: Number of entries for all reasons for absence
Note that we also used the Disease column as the hue parameter. This helps us to distinguish between disease-related reasons (listed in the ICD encoding) and those that aren't. Following Figure 2.3, we can assert that the most frequent reasons for absence are related to medical consultations (23), dental consultations (28), and physiotherapy (27). On the other hand, the most frequent reasons for absence encoded in the ICD encoding are related to diseases of the musculoskeletal system and connective tissue (13) and injury, poisoning, and certain other consequences of external causes (19).
In order to perform a more accurate and in-depth analysis of the data, we will investigate the impact of the various features on the Reason for absence and Absenteeism in hours columns in the following sections. First, we will analyze the data on social drinkers and smokers in the next section.