A Study of Relationship to Absentees and Score Using Machine Learning Method: A Case Study of Linear Regression Analysis

Absenteeism from classrooms amongst students is an international problem that does not only affect Indian students. This research is focuses on absentees of student in class and score and has been carried out by using linear regression analysis. Linear regression analysis is one of excellent method of machine learning. The descriptive, student's t-test, Pearson correlation, and regression models were used in this study's statistical analysis. According to the results of this study, there are considerable variations between absentees and score (t-test=-4.06075,p<0.05). The study also discovered that absenteeism from class had a negative link with the score (r=-0.6088). To investigate the impact of class absentees on student score, a regression model was created. This study will benefit both the college administration and the students by raising awareness of the disadvantages of not attending classes. Keyword: Linear Regression model, Machine Learning, Descriptive Statistical Analysis, Statistical Analysis


I. Introduction
I In today's technologically advanced world, a major difficulty in primary and middle schools, intermediate colleges, post graduate colleges and university is the rising tendency of student absenteeism. At the present time India also facing these type student absentees' problems. Absenteeism is described as the habit of failing to show up for class or an event without a valid justification, and the term absentee is used to characterize someone who does this regularly. It has been observed via different studies and blind observation that students who do not attend classes receive worse grades, but individuals who have a greater attendance rate than their peers receive higher grades. A student's grade is determined by his/her attendance.
Various studies have shown that successful students have a better attendance record as well as a higher grade. According to Ahmat and Zahari's research, there is a negative relationship between absenteeism and grades, implying that more absence equals a lower grade [1]. The study looks into the link between students' attendance in class and their total marks in a variety of computer science programmes [2].
Zhang and Wang's research reveals that a positive correlation exists between the desired variables using a linear regression model [3]. On the basis of test score and classroom attendance data over two years, regression analysis is introduced, and correlation curves between test score and classroom attendance are plotted [3]. The regression model revealed a robust link between semester GPA and attendance %, as well as overall GPA [4]. Similarly, Narula and Nagar's research demonstrates a high link between attendance and grades using machine learning methods [5]. In a Finnish University, the research examines the relationship between attendance and performance using the clustering method [6]. The research examines how absenteeism affects student outcomes using administrative panel data from California to estimate the pandemic's impact [7].
The outcomes of this study demonstrated that greater performance in professional assessment tests by medical undergraduate students has a negative link with absenteeism and a positive correlation with high attendance percentage [8]. All of this evidence demonstrates that attendance and grades have a substantial beneficial relationship using machine learning based regression analysis.
The assumptions have been presented as hypotheses in this study, and a proper statistical model has been utilised to prove them, as well as relationships proposed between them to show how they will change with the changing of each and every factor.
In this research, we performed several statistical tests to draw conclusions about the relationships, and we also used a regression model to predict the least square best equation to show how grade varies depending on each element. The Pearson correlation coefficient is a measure that is used to determine how closely the components are related to each other assuming they are related at all. This research provides a machine learning based method of linear regression model that depicts how a student's score changes as a result of factor absenteeism.

II. Machine Learning
The ML is the subclass of artificial intelligence and one of its applications. Machine learning provides the ability to automatically improve and learn from systems without having to be explicitly programmed. The method of machine learning is focused on the development of computer programs and it accesses the data and uses it to learn itself. The machine learning process starts with data and experimental observations like classification information of training data and testing data, examples, direct experience, instruction and pattern in the data. With the help of learning data, machine learning algorithms make deter decisions in the future based on the example. The main purpose of machine learning methods is to permit the computers learn routinely without human interference or support and regulate activities consequently. We have presented two machine learning-based linear regression and statistical methods to the data analysis purpose of in this proposed research work.

III. Linear Regression Analysis
Linear regression is the relationship between independent variable (d) and dependent variable (s). The equation of linear regression is given below: Where are the unknown parameters and known as gradient or slope of line and intercept on y axis and is a normal random variable with a mean of zero and an unknown standard deviation. Note that this model is being offered for the entire population of students enrolled in this course, not just those enrolled this semester and especially not just those in the sample. The parameters and are all related to this enormous population. The and parameters are also referred to as regression model parameters. These parameters can be learned using the least square approach, artificial neural networks, evolutionary algorithms, and other applicable learning approaches. Least square estimation and the gradient descent algorithm can be used to learn the parameters and .

IV. Experimental Results and Discussion
In general, educators believe that class attendance has a considerable impact on course achievement, all other conditions being equal. An education researcher chooses a multiple part basic computer science course at a large university to evaluate the relationship between attendance and performance. Throughout the semester, the course instructors agree to keep an accurate record of attendance. In this proposed study, we have taken data from Dr. Rammanohar Lohia Avadh University of BCA final year students for experimental purpose. The sample size of data is 30. Here, we have taken 30 students randomly out of 60 students at the end of the semester. Two measurements are taken for each student in the 30 sample which are given below: 1. The number of days (d) the student was unable to attend class. 2. End semester score (s). Table 1 shows the dataset of 30 students and figure 1 depicts scatter plot of 30 students' data.  1  85  12  2  76  13  1  64  14  3  40  15  5  66  16  4  89  17  1  97  18  1  99  19  0  90  20  0  97  21  2  92  22  0  89  23  2  69  24  1  86  25  3  79  26  1  79  27  0  90  28  6  65  29  1  80  30 2 98  Table 2 shows the calculation of Calculation of and . In table 2 we have added the numbers in each column which will use in generating the trendline or predicted line.  The above data table 2 gives the following expression: We start by determining the least squares regression line or predicted line, which is the line that best fits for the data. Its yintercept and slope are given below: - The least squares regression line for this data, rounded to two decimal places, is:  The figure 2 also shows that a decreasing trend, indicating that students with more absences perform worse on the final exam on average. The total of the squared errors (Sum of Square Error) of this line's goodness of fit to the scatter plot is:

5373.068
This is a huge amount. As a result, it isn't particularly useful in and of itself, but we utilised it to calculate a crucial statistic: The correlation coefficient r is: There is a moderately negative connection between the two variables. We can observe that there is a negative relationship between absentees and student scores. It means absentees of the student is increase then score will be decrease. Scores and absentees are two important data points in this study. The hypothesis developed for the first case involving absenteeism and student scores. Let us consider here the hypothesis which ais given below:

Null Hypothesis
: Absentees does not affect Scores.

Alternative Hypothesis : Absentees affect the Scores
We have developed two hypotheses, and now we will use a statistically independent t-test to see if the null or alternate hypothesis is correct. The independent t-test is performed to see if there is any correlation between absentee and student score. The t-test are used to test the hypothesis that the regression coefficients produced in basic linear regression are accurate. The two-sided hypothesis that the true slope, , equals some constant value, , is tested using a statistic based on the t distribution. Let's look at the test of hypotheses using the generally used 5% level of significance. The hypothesis test statements are written as follows: From the "Critical Value of" with degree of freedom so the rejection region is The value of the standardized test is: This is located in the rejection zone. In favor of , we reject . At the 5% level of significance, the statistics support the conclusion that is negative, implying that as the number of absentees grows, the average final exam score declines. As previously stated, the figure is a point estimate of how much one additional absentee affects the average final exam result. The average reduces by around 4.54227 points for each subsequent absentee.
The frequency of absentees has been visualized using a histogram in figure 3.

Figure 3: Histogram absentees with their frequency
We can observe from this histogram that the majority of students have absentees in the range of zero percent to three. For , we can expand this point estimate to a confidence interval. "Critical Values of" with degrees of freedom, at the 95 percent confidence level. Based on our sample data, the 95 percent confidence interval for is: We are 95 percent positive that, among all students who have ever taken this course, the average final test score drops by to points for each extra class missed. If we focus on the sub-population of all students who had exactly five absences. We may estimate the average final test score for those students using the least squares regression equation .
This is also our best estimate of a student's final exam grade if he/she is absent five times. The average final test score for all students with five absences has a 95% confidence interval of: According to this confidence interval, the true mean final test score for all students who miss class precisely five times over the semester is expected to be between 59.92 and 75.14. If a student misses exactly five classes during the semester, his final exam score is predicted to be in the interval with 95 percent certainty. This prediction interval indicates that this student's final exam score will most likely fall somewhere between 38.16 and 96.90. Unlike the 95 percent confidence interval for the average score of all students with five absences, which provided useful information, this interval is so large that it reveals almost nothing about the final exam score of any particular student. The existence of the extra summand 1 under the square sign in the prediction interval can have a dramatic effect in this case. Finally, the coefficient of determination, , estimates the fraction of the variability in students' final exam scores that is explained by the linear relationship between that score and the number of absences. Since we've already calculated , we can readily deduce: As a result, the regression model explains 37 percent of the variability in the yield data, demonstrating a good fit of the regression model. Despite the fact that there is a strong link between attendance and final test performance. Although we can estimate the average score of students who miss a specific number of classes with reasonable accuracy, the number of absentees accounts for less than half of the entire range in exam scores in the sample. This is hardly surprising, given that student exam performance is influenced by a variety of factors other than attendance.
A residual plot is a graph in which the residuals are displayed on the vertical axis and the independent variable is displayed on the horizontal axis. A linear regression model is appropriate for the data if the dots in a residual plot are randomly distributed across the horizontal axis; otherwise, a nonlinear model is more suited. Table 3 shows the output of linear regression model and residuals. Figure 4 also depicts that the residual plot displays a haphazard pattern. Some residual points are positive, while others are negative. This random pattern implies that the data is well-fit by proposed a linear model.

V. Conclusion
The findings of this study revealed that absence has a major impact on academic achievement. The research was carried out in-depth, with a lot of data visualization and statistical modelling included in the publication. The findings revealed a moderately negative relationship between the number of absences and the final score ( This study also discovered a significant difference in mean final exam scores between students who missed less than and equal to 22% of their classes and students who missed more than 23% of their classes ( . The key finding was that if a student misses one class, their final test grades are projected to drop by 4.54 percent on average. It is believed that the findings of this study would help the colleges and university plan for students who will graduate on time. Furthermore, this study has the potential to raise student knowledge about the impact of missing courses on their academic performance.