If we let be the actual number of observations of type i, then the test statistic used is
Karl Pearson found that if the null hypothesis is true, then as the sample size becomes very large, the distribution of Q is roughly the chi-square distribution with m-1 degrees of freedom. Thus, after determining a significance level of , let c be the quantile of the chi-square distribution with m-1 degrees of freedom. Thus, if , then the null hypothesis should be rejected. However, before the null hypothesis is completely rejected, it is necessary to be certain that there is no other reasonable alternative distribution that better fits the observed data [2].
CHAPTER 4
PRELIMINARY ANALYSIS
Upon receiving the data set, a brief analysis was done with a primary focus on comparing the student-athlete’s Stetson University grade point average versus the percentage of tuition covered by athletic scholarships. Rather than examining all student-athletes at Stetson University over the past seven years, I decided to take a small sample of the data set. I chose to look specifically at baseball.
LINEAR REGRESSION
The data set included multiple variables that were determined to have a possible influence on the student-athlete’s Stetson University grade point average. I decided to see if there was a correlation between the percentage of cost to attend Stetson University covered by athletic scholarships granted and the grade point averages. Therefore, I pulled the baseball players from the data set and plotted the percentages versus the grade point average of the baseball players using Microsoft Excel. See Figure 13 for this plot.
Figure 13. Baseball Percentage vs. GPA
Is there a model that could predict the baseball player’s grade point average based on the percentage of cost of schooling he was receiving? I chose to try a few different models to see if there was any correlation between these two variables. I tried various polynomial regression models; however, there was no significant difference between the models. Thus, the result of the linear model is in Figure 14.
Figure 14. Baseball Regression Models
Looking at the linear regression equation and plot in Figure 14, the model appears to fit the data rather poorly. A correlation coefficient of a 0.4352 does not imply a strong correlation between percentage of tuition covered by athletic scholarship and Stetson grade point average. Typically a correlation coefficient of a 0.7 or higher is considered a good correlation. However, this is not enough to discard this model.
RESIDUAL ANALYSIS
Before discarding the model, we will look at the corresponding residual plot. Here are those residual values plotted against the corresponding x-values.
Figure 15. Baseball Residual Plots
Since these residual values are randomly dispersed in the plot above, this suggests that the chosen model is not as poor of a fit as originally stated. However, there may possibly be a better model other than the polynomial equations first chosen. It is also possible that these two variables are simply not strongly correlated. This is something that will be examined further next semester.
GOODNESS-OF-FIT TEST
With the baseball sample taken from the complete data set of all student-athletes, I wanted to apply the goodness-of-fit test to see if the underlying distribution was approximately normal. Since I focused on the percentage of tuition covered by athletic scholarships and the grade point average in the regression models, I decided to see if the baseball players’ grade point averages were normally distributed.
I started by binning the data. I created three categories for the observed grade point averages to fall in. They were
Bin 1: GPA is in the interval [1.5, 2.25)
Bin 2: GPA is in the interval [2.25, 3.0)
Bin 3: GPA is greater than or equal to 3.0
The frequency of each bin is listed below in Table 4.
Bin
|
Frequency
|
1
|
25
|
2
|
32
|
3
|
30
|
Total
|
87
|
Share with your friends: |