In this final Block, we reach a crucial point in our teaching, an aspect of research often poorly handled by our students in their final projects. Statistical inference, as mentioned at the very beginning of this Module, is a set of methods that allow drawing conclusions about a population from a sample of observed data. Inference is based on several concepts that we have already covered in the previous two Blocks (probability theory, sampling distributions) and some others that we will see in this Block.
Note on the organization of the Block.
The teachings are inseparable, which is why we opted to discuss the Block as a whole, containing six
Teachings. Of course, the logic remains the same as in the previous teachings.
Thus, we organized this Block into teachings to facilitate the follow-up and understanding of this final subject's content.
We will begin with an in-depth study of estimation and sampling, highlighting concepts like sampling fluctuation, exhaustive and non-exhaustive samples, as well as the distinctions between independent and paired samples. We will also examine the random variable, covering methods of point estimation and confidence interval estimation. We will introduce the binomial distribution as a fundamental model for analyzing discrete variables.
The hypothesis testing will be another key component of this course, where we will explore concepts such as the standard error of the mean and the central limit theorem, essential for understanding one-tailed and two-tailed hypotheses. Tools like contingency tables and the contingency coefficient will also be presented to assess relationships between categorical variables. We will cover the concepts of significance level, theoretical frequencies, and observed frequencies, which are essential for testing the validity of hypotheses.
The analysis of variance (ANOVA) will be discussed in detail, focusing on the F-statistic, effect size, intergroup and intragroup differences, as well as the development of the table of variance sources. This segment of the course will help understand how variations within the data can be attributed to different factors.
The section on correlation and linear regression will introduce tools for examining relationships between quantitative variables. We will study the correlation coefficient, mathematical regression models, and learn how to interpret scatter plots (point clouds), as well as fitting curves and regression lines. The concepts of independent variable and dependent variable, as well as the coefficient of determination, will also be covered to understand the strength and nature of linear relationships between variables.
Finally, we will conclude with non-parametric tests, where methods such as the Mann-Whitney test, Wilcoxon test, Kruskal-Wallis test, and the Spearman test will be explored. These tests offer robust alternatives to parametric tests when the underlying assumptions are not met.
Session Objectives
In this session, we will focus on the key concepts of statistical inference. The objectives of this session are defined to help you master the essential notions that will be explored throughout this course.
Understand Estimation and Sampling:
Grasp estimation methods and different types of sampling, including concepts of sampling fluctuation, exhaustive and non-exhaustive samples.
Differentiate Types of Samples:
Learn to distinguish between independent and paired samples, and understand their importance in statistical analysis;
Master Estimation Methods:
Acquire skills in point estimation and confidence interval estimation, and understand the use of the binomial law for discrete variable analysis;
Explore Hypothesis Testing:
Get introduced to the concepts of standard error of the mean, central limit theorem, and learn how to formulate and test one-tailed and two-tailed hypotheses;
Use Contingency Tables:
Understand how to analyze relationships between categorical variables using contingency tables, the coefficient of contingency, and the notions of theoretical and observed frequencies;
Analyze Variance:
Study the F-statistic, effect size, between-group and within-group differences, and learn how to construct and interpret a table of variance sources;
Understand Correlation and Linear Regression:
Familiarize yourself with correlation, regression, and learn how to use the correlation coefficient, construct scatter plots and trend lines, and interpret the regression line;
Explore Non-Parametric Tests:
Discover Mann-Whitney, Wilcoxon, Kruskal-Wallis, and Spearman tests as alternatives to parametric tests in specific contexts.
Concepts and Themes to Cover During the Block
For this Block, we will cover the following concepts (the title of each lesson is included before each group of concepts): Estimation sampling, sampling fluctuation, estimation, exhaustive sample, non-exhaustive sample, independent samples, paired samples, random variable, point estimation, confidence interval estimation, the binomial law. Hypothesis Testing : standard error of the mean, central limit theorem, one-tailed hypothesis, two-tailed hypothesis. The \(\chi2\) Test : contingency table, coefficient of contingency, significance level, theoretical frequency, observed frequency. Analysis of Variance: F-statistic, effect size, between-group difference, within-group difference, table of variance sources. Correlation & Linear Regression: correlation, regression, coefficient of contingency, mathematical model, scatter plot (scatter diagram), trend line (in the case of a time series), regression line (least squares line), independent variable, dependent variable, incidence, linear relationship, coefficient of determination. Non-Parametric Tests: The Mann-Whitney Test, The Wilcoxon Test, The Kruskal-Wallis Test, The Spearman Test.
Block Presetation
Introduction to Statistical Inference
Inference is used to understand and/or make decisions regarding a given phenomenon. The set of rules and techniques used to make inferences is collectively known as statistical tests. The purpose of statistical tests is to verify the validity of a previously established hypothesis. Statistical inference can also be introduced by considering it as a process of drawing general conclusions (about the population) from the somewhat imperfect, imprecise measurement of information derived from it (thus from the sample or samples). Hypothesis tests cannot prove the truth of a hypothesis but can confirm its falsity.
Inference error in statistics occurs when conclusions drawn from a sample are incorrectly generalized to the entire population. There are two main types of inference errors:
Type I Error (Alpha Error)
This error occurs when a null hypothesis is rejected when it is actually true. In other words, it incorrectly concludes that there is an effect or a difference when there is not. The significance level (alpha) is the probability of making this error.
Type II Error (or Beta Error)
This error occurs when a null hypothesis is not rejected when it is actually false. This means that an existing effect or difference is missed. The statistical power (1 - beta) is the probability of not making this error.
These errors are inevitable in statistics because they are related to the inherent uncertainty of sampling. The goal is to minimize these errors as much as possible by choosing an appropriate significance level and using sufficiently large samples.
In another context, verifying a hypothesis refers to confronting it with a null hypothesis. We speak of acceptance or rejection of the null hypothesis when the observed differences between samples or between a sample and its population are greater than a so-called typical difference.
Estimation
The sample is a way to apprehend the population, as we do not have direct access to it [keeping in mind that even when access is possible, it often involves high costs, a lot of time, and a risk of producing invalid data]. Extracting several samples (of size \(n\)) does not solve the problem, as the results obtained will vary from one sample to another, which is referred to as sampling fluctuation.
In this session, we will address two types of estimation: point estimation and confidence interval estimation. It goes without saying that we have limited the content of this session to information relevant to research in the humanities and social sciences. Other categories and types of estimation are covered in more advanced manuals.
Definition III.1.1: Sampling Fluctuation
Sampling fluctuation refers to the variability in results that can occur when repeating the sampling process multiple times on the same population. In other words, if multiple samples are taken from a given population, the statistics calculated for each sample may vary from one sample to another.
This fluctuation is due to the fact that each sample may contain different individuals and, consequently, may provide slightly different estimates of the population parameters. This is why the results of a study based on a sample are generally accompanied by a margin of error or a confidence interval to account for this variability.
Example: If the average height of a group of 100 students randomly selected from all students at a university is measured, this average could slightly vary if the measurement is repeated with another group of 100 students. This variability is what is called sampling fluctuation.
To obtain information about the population from the sample, we proceed with what we call estimation. A sample is non-exhaustive if the selection of the \(n\) individuals constituting the sample is done with replacement; otherwise, it is called exhaustive.
A relevant sample is one that is representative of the population from which it was drawn, faithfully reproducing the categories of interest in the study and being randomly selected.
Definition III.1.2: Estimation
Estimation is the process by which sample data is used to infer or predict the value of an unknown population parameter.
Estimation can take the form of a point estimate, which provides a single estimated value (such as the sample mean), or an interval estimate, which gives a range of plausible values for the parameter, often with a certain level of confidence.
Independent Samples, Paired Samples. Samples are said to be independent when they consist of different individuals. Samples are paired when individuals are associated in pairs [provide a historical example].
Example 2: Suppose we want to compare the effectiveness of two different advertising campaigns for promoting a technological product.
Let's assume a company wants to compare the impact of two distinct advertising campaigns on the sales of a new smartphone. It could run the first campaign for a group of users in city A and the second campaign for another group of users in city B.
These two groups are considered independent because the users in the two different cities are different individuals, and each group is exposed to a different campaign.
Example 3: We want to analyze the perception of the same advertising campaign before and after a modification
Suppose now that a company launches a social media advertising campaign to promote a streaming service. After collecting initial feedback, the company decides to modify the campaign's message. To evaluate the effect of this modification, they measure the opinions of the same users before and after the change.
The samples are paired because the same individuals are surveyed at two different times (before and after the modification), and the responses are associated for each user.
1.1. Unbiased Point Estimation
Let \(X_{n}\) be the random variable associated with a sample of size \(n\).
Point estimation is the method of providing a single value, called the point estimator, to estimate an unknown parameter of a population. This estimation is based on data obtained from a representative sample of that population.
Mathematically, \(X_{n}\) is said to be an unbiased estimator (without bias) of a parameter \(\theta\) if \(E~(X_{n})= \theta \) (if not, the estimator is biased). If \(\displaystyle \lim_{n \to \infty} \) \(V (X_{n})\), the estimator is said to be convergent.
Unbiased Point Estimation of a Mean and a Variance
Context. Let there be a quantitative characteristic (x) (with mean \(\mu\) and variance \(\sigma^2\)) that we want to estimate from a population \(P\).
Notation. Consider a sample of size \(n\) [whose values are {\(x_{1}, x_{2},..... x_{10}, x_{11}, ...., x_{n}\)} and whose random variables associated with each variable are: {\(X_{1}, X_{2},..... X_{10}, X_{11}, ...., X_{n}\)}
The two random variables can be defined as follows:
\(\bar{X}\): taking values as the means of samples of size \(n\): $$\bar{X} = \frac {1}{n} \sum_{i=1}^{n} X_{i}$$
\(\sigma_{e}^{2}\): Sample Variance taking values as the variances of the samples: $$\sigma_{e}^{2} = \frac{1}{n} \sum_{i=1}^{n} (X_{i} - \bar{X}) ^2$$
By definition, we consider that the rules allowing unbiased estimation of the mean and variance can be noted as follows:
Context. Let there be a characteristic (x) for which we want to estimate the proportion from a population \(P\).
Notation. Consider the random variable \(F\) [whose values are {\(f_{1}, f_{2},..... f_{10}, f_{11}, ...., f_{n}\)} and whose associated random variables are: {\(F_{1}, F_{2},..... F_{10}, F_{11}, ...., F_{n}\)} in a sample of size \(n\), drawn with replacement.
Assuming an individual is drawn from this population, we want to estimate the probability \(p\) that they possess this characteristic:
$$ E (F) = p \longrightarrow V(f) = \frac {p (1-p)} {n} $$
1.2. Confidence Interval Estimation
When estimating a mean, variance, or proportion unbiasedly, we would like to know with what degree of certainty we are confident in our estimation.
Definition III.1.3: The Confidence Interval
A confidence interval is a range of values calculated from sample data that, with a certain level of confidence, is used to estimate an unknown population parameter.
Unlike point estimation, which provides a single value, the confidence interval provides a range of possible values for the parameter and indicates the precision of this estimation.
We choose a number \(\alpha\) \(\in\) \( ] 0 , 1 [ \) that helps us determine an interval \( ] a , b [\) so that we have a probability \(\alpha\) of being wrong in stating that \(p\) belongs to this interval (we refer to \(\alpha\) as a confidence coefficient, and to \( (1- \alpha)\) as a confidence level).
Thus, to determine a confidence interval, we need to introduce a random variable for which we know the probability distribution.
Estimating a Mean with a Confidence Interval
When it comes to estimating the mean, two possibilities are to be considered:
Case of a Gaussian Population (\(\sigma ~\) known)
A population is said to be Gaussian (\(\sigma ~\) known), which mathematically means:
$$ If ~~X~~ follows~~ a~~ normal~~ distribution,~~ \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt(n)}) $$
Therefore, for a risk \(\alpha\), we obtain the reduced deviation \(z_\alpha\) from the Table 3 (The Standard Normal Distribution) to determine the confidence interval:
\(-z_{\alpha} \lt \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \lt z_{\alpha}\) \( \longrightarrow \)\(\mu \in \left ] \bar{x} - z_{\alpha} \frac{\sigma} {\sqrt{n}} , \bar{x} + z_{\alpha} \frac{\sigma} {\sqrt{n}} \right [ \)
Case of a Gaussian Population (\(\sigma ~\) unknown)
A population is said to be Gaussian (\(\sigma ~\) unknown), which mathematically means:
$$ If ~~T~~ follows~~ the~~ Student~~ t~~ distribution~~ with~~ V = n-1~~ degrees~~ of~~ freedom~~ :~~ T = \frac{\bar{X} - \mu}{\frac{S}{\sqrt{n}}} $$
Therefore, for a risk \(\alpha\), we obtain \(t_\alpha\) from the Table 4 (The Student t Distribution) to determine the confidence interval:
\(\mu \in \left ] \bar{x} - t_{\alpha} \frac{s} {\sqrt{n}} , \bar{x} + t_{\alpha} \frac{s} {\sqrt{n}} \right [ \)
Note:Case of a Large Sample
When \(n \geq 30 \), the confidence interval is noted as follows:
Estimation of a Variance with a Confidence Interval
To estimate the variance of a population \(\sigma^2\) from a sample of size \(n\) with a sample variance \(s^2\), the confidence interval for \(\sigma^2\) at the confidence level of \( ( 1-\alpha )\), two cases are possible: Case 1:\(n \leq 31 \)
The confidence interval is given as follows:
$$ \left ] \frac{(n-1) s^{2}}{b} , \frac{(n-1) s^{2}}{a} \right [ $$
Case 2:\(n \geq 31 \)
The confidence interval is given as follows:
$$ \sigma^{2} \in \left ] \frac {2 (n-1) s^{2}} {(\sqrt{2n-3} + z_{\alpha})^2} ~~ , ~~ \frac {2 (n-1) s^{2}} {(\sqrt{2n-3} - z_{\alpha})^2} \right [ $$
Example
A graduate student wants to estimate the average daily time spent by students on social media for their thesis. To do this, they conduct a survey of 30 students. The results show that the average usage time is 2.5 hours per day and the standard deviation is 0.8 hours.
Calculation of the Confidence Interval
The student wants to calculate a confidence interval at 95% for the average daily time spent on social media by all students.
Step 1. Summary of Data:
Sample mean (\(\bar{x}\)) : 2.5 hours ;
Sample standard deviation (\(\sigma\)) : 0.8 hours ;
Sample size (\(n\)) : 30 ;
Confidence level : 95%.
Step 2. Calculate the \(t\)-Score:
Since the sample size is relatively small (n < 30), we use the Student's \(t\) distribution (see Table 4 in the Statistical Appendix) rather than the normal distribution. For a 95% confidence level and 29 degrees of freedom \( (n - 1)\), the corresponding \(t\)-score from the distribution table is: \(2.045\).
Step 3. Calculate the Confidence Interval:
The formula for the confidence interval is:
$$ \bar{x} \pm t \cdot \frac{\sigma}{\sqrt{n}} $$
Substituting the terms into the formula with our data, we get:
Margin of error = \(t \cdot \frac{s}{\sqrt{n}} = 2{,}045 \cdot \frac{0{,}8}{\sqrt{30}}\)
With:
$$ \sqrt{30} \approx 5{,}477 $$
$$ \frac{0{,}8}{5{,}477} \approx 0{,}146 $$
Margin of error = \(2{,}045 \cdot 0{,}146 \approx 0{,}299\)
The confidence interval is:
$$ \bar{x} \pm \text{Margin of Error} = 2{,}5 \pm 0{,}299 $$
Step 4. Calculate the Interval Bounds:
Lower bound: \(2{,}5 - 0{,}299 = 2{,}201\)
Upper bound: \(2{,}5 + 0{,}299 = 2{,}799\). Interpretation
The 95% confidence interval for the average daily time spent by students on social media is [2.201; 2.799] hours. This means that we are 95% confident that the true average daily time spent on social media by all students lies within this range.
Hypothesis Testing
In this session, as well as in the following ones, we will work on hypothesis tests. Therefore, we have chosen to combine two sessions into this lesson: an introduction to hypothesis testing (which, due to its importance in human and social sciences, deserved a complete session of instruction) and the Student's t-test (which, in its simplest form, did not require an entire session).
We will begin our session with an introduction to hypothesis testing, and the subsequent lessons will be dedicated to this part of data analysis.
2.1. Hypothesis Testing
A hypothesis test is a statistical procedure used to assess the validity of a hypothesis about a population based on a sample of data. The process involves the following steps:
Definition III.2.1: Hypothesis Testing
A hypothesis test is a technique used to make decisions or draw conclusions from sample data. It is particularly useful when one wants to evaluate the validity of a claim or hypothesis about an entire population based solely on information obtained from a sample of that population.
A hypothesis test has its own procedural vocabulary, which we will explain in the following sections. It mainly involves the following phases:
Formulation of Hypotheses
Null Hypothesis \((H_0)\) :
This is the baseline hypothesis that assumes there is no effect or difference. For example, \(H_0\) might state that the mean of a population is equal to a specific value;
Alternative Hypothesis \((H_1)\) :
This is the hypothesis that the test aims to prove. It suggests a difference or effect, such as the mean being different from this specific value.
Choosing the Significance Level (\(\alpha\))
This is the probability of rejecting the null hypothesis when it is true. A commonly used significance level is \(0.05\), which means accepting a \(5\%\) risk of making a Type I error (incorrectly rejecting \(H_0\)).
Calculating the Test Statistic
A test statistic is calculated from the sample data. This statistic follows a specific theoretical distribution (e.g., normal, t, F, or \(\chi^2\)) under the null hypothesis.
Determining the \(p\)-Value
The \(p\)-value is the probability of obtaining a test statistic as extreme or more extreme than the observed value, assuming that \(H_0\) is true. If this \(p\)-value is less than or equal to \(\alpha\), \(H_0\) is rejected.
Decision Making
If \(p \leq \alpha\) : The null hypothesis is rejected, suggesting that the data provide sufficient evidence to support the alternative hypothesis;
If \(p > \alpha\) : The null hypothesis is not rejected, meaning that the data do not provide sufficient evidence against \(H_0\).
Example
An advertising agency claims that its latest online campaign increases the average reach of brand social media posts by \(20\%\). To verify this claim, a hypothesis test can be conducted by analyzing the reach of posts before and after the campaign for a sample of posts. If the test reveals that the observed increase is significantly different from \(20\%\), one might reject the null hypothesis (\(H_0\): average increase = \(20\%\)) in favor of the alternative hypothesis (\(H_1\): average increase \(\neq 20\%\)).
2.2. The \(t\)-Test
The \(t\)-test is used to determine whether two samples are statistically different [either from a single population or from two different populations]. We owe this test to Gosset, whose work involved making inferences from small samples.
Definition III.2.2: The \(t\)-Test
The \(t\)-test is a statistical test used to determine if the difference between the means of two groups is significant. It is often used when the data is sampled from small populations and assumes that the data follow a normal distribution.
Gosset's work involved constructing a population that follows a normal distribution and then calculating its mean \(\mu\). Gosset used sampling with replacement (as seen in Block 2: Session 2.3.) and extracted around a hundred small and even-sized samples. For each sample, its mean \(M_i\) was calculated and compared with the known mean of the population by estimating the difference \((M_{i} - \mu)\).
Gosset's original idea was to postulate that: since the samples are drawn from the same population, Gosset assumed that the difference between the sample means and the population mean is \(0\). However, Gosset stated that sampling error means the result is not zero; therefore, he calculated the standard error of the mean defined by the following formula:
$$ S_m = \frac {s}{\sqrt{N}} $$
The calculation proposed by Gosset describes the distance between the sample mean and the population mean relative to the standard error of the mean.
Thus, Gosset constructed a new distribution, the \(t\) distribution, using a very large number of small samples, resulting in a unimodal distribution.
The \(t\)-test is used for small samples \(n \leq 30\); as we saw earlier, the standard error of the mean tends to decrease as the sample size increases. When a sample size is \(n \geq 30\), the distribution of the mean tends to resemble the normal distribution.
There are three types of \(t\) tests, which we will explain in the following sections:
2.2.1. \(t\)-Test for One Sample
Compares the mean of a single sample to a hypothesized or theoretical value. For example, it can be used to check if the mean of test scores in a sample is different from a known or expected average value.
\[
t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
\]
\(\bar{x}\) = Sample mean
\(\mu_0\) = Hypothetical value (here \(20\%\))
\(s\) = Sample standard deviation
\(n\) = Sample size
Example
A company wants to determine if the average customer satisfaction after a communication campaign is different from \(80\%\). It collected responses from \(25\) customers, with an average response of \(82\%\) and a standard deviation of \(5\%\).
The result \(t = 2\) indicates how many standard deviations the sample mean is away from the hypothetical value. To determine if this difference is significant, we need to compare this value to a critical value of the \(t\)-test based on the chosen significance level (e.g., 0.05) and the degrees of freedom (here, \(n - 1 = 24\)).
Critical Value at 0.05 for 24 Degrees of Freedom:
Here is an excerpt from the critical values table (Table 4 from the Statistical Appendix)
Table of Critical Values
Degrees of Freedom (\(V\))
Critical Value \( t_{0.05} \)
20
2.086
25
2.060
30
2.042
40
2.021
60
2.000
For 24 degrees of freedom, the critical value of \( t \) at a significance level of 0.05 (two-tailed) is approximately 2.064.
Comparison:
Calculated \( t \) value: 2
Critical value for 24 degrees of freedom at 0.05: 2.064
The calculated \( t \) value (2) is less than the critical value (2.064). Therefore, we do not reject the null hypothesis.
Conclusion:
Although the average response after the campaign is higher than the hypothetical value of \(80\%\), the difference is not large enough to be considered significant at the 0.05 significance level. This means we cannot conclude with certainty that the campaign had a significant impact on customer satisfaction compared to the hypothetical value.
Note:
The procedure remains the same for other tests to come (independent or paired samples).
2.2.2. \(t\) Test for Two Independent Samples
Compares the means of two independent groups. For example, it can be used to compare the average scores of two groups of students who have followed different teaching methods.
The formula is:
\[
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
\]
\(\bar{x}_1\) and \(\bar{x}_2\) = Means of the two samples
\(s_1^2\) and \(s_2^2\) = Variances of the two samples
\(n_1\) and \(n_2\) = Sizes of the two samples
Example
A teacher wants to compare the effectiveness of two teaching methods on exam results. They have two groups of students: Group A who used method 1 and Group B who used method 2 .
The results are as follows:
Group A: Mean (\(\bar{x}_1\)) = 75, Variance (\(s_1^2\)) = 16, Size (\(n_1\)) = 30
Group B: Mean (\(\bar{x}_2\)) = 70, Variance (\(s_2^2\)) = 25, Size (\(n_2\)) = 35
Steps:
Calculate the \(t\) statistic
Using the formula:
\[
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
\]
Substituting the values:
\[
t = \frac{75 - 70}{\sqrt{\frac{16}{30} + \frac{25}{35}}}
\]
Decision Rule:
For a two-tailed test at \( \alpha = 0{,}05 \) with \( df = 61 \), the critical value of \( t \) is \( t_{crit} \approx \pm 2{,}000 \).
Since \( t \approx 4{,}48 \) is greater than \( t_{crit} = 2{,}000 \), we reject the null hypothesis \( H_0 \) and conclude that there is a significant difference between the means of the two groups at the \( \alpha = 0{,}05 \) significance level.
2.2.3. \(t\)-Test for Two Paired Samples
Compares the means of two related or paired groups, such as measurements before and after a treatment within the same group of subjects. The formula is:
\[
t = \frac{\bar{d}}{s_d / \sqrt{n}}
\]
\(\bar{d}\) = Mean of the differences between pairs
\(s_d\) = Standard deviation of the differences
\(n\) = Number of pairs
Example
Suppose we want to assess the impact of communication training on employees' skills. We measure their communication skill level before and after the training on a scale from 0 to 10.
The scores before and after the training for 8 employees are as follows:
Employee 1: Before = 4, After = 7
Employee 2: Before = 5, After = 8
Employee 3: Before = 6, After = 8
Employee 4: Before = 5, After = 7
Employee 5: Before = 7, After = 9
Employee 6: Before = 6, After = 8
Employee 7: Before = 5, After = 7
Employee 8: Before = 6, After = 9
Calculate the differences between pairs: \(Differences\) (After - Before):
Employee 1: 7 - 4 = 3
Employee 2: 8 - 5 = 3
Employee 3: 8 - 6 = 2
Employee 4: 7 - 5 = 2
Employee 5: 9 - 7 = 2
Employee 6: 8 - 6 = 2
Employee 7: 7 - 5 = 2
Employee 8: 9 - 6 = 3
Calculate the mean of the differences (\(\bar{d}\)):
Sum of the differences: \(3 + 3 + 2 + 2 + 2 + 2 + 2 + 3 = 19\)
Number of pairs (\(n\)) = 8
Mean of the differences:
\[
\bar{d} = \frac{19}{8} = 2.375
\]
Calculate the standard deviation of the differences (\(s_d\)):
For a two-tailed test at \( \alpha = 0.05 \) with \( df = n - 1 = 7 \), the critical value of \( t \) is \( t_{crit} \approx \pm 2.365 \) (according to the Student's t-distribution table).
Since \( t \approx 12.98 \) is greater than \( t_{crit} = 2.365 \), we reject the null hypothesis \( H_0 \) and conclude that the communication training has a significant impact on employees' skills at the \( \alpha = 0.05 \) significance level.
The \(\chi^2\) Test
3.1. Introduction and Reminders
The \(\chi^2\) test is a goodness-of-fit test aimed at comparing a theoretical distribution to an experimental distribution.
Principle
In a population with \(N\) observations, we define \(s\) Events \( [ E_{1},E_{2},..........E_{s}] \). Theoretically, this also involves defining the probabilities of these events \( [ p_{1},p_{2},..........p_{s}] \), and the observed frequencies in a sample \( [ Ft_{1},Ft_{2},..........Ft_{s}] \)
A goodness-of-fit test involves comparing the observations to the theoretical model. To do this, we will calculate the theoretical frequencies\( F_{t} = n \cdot p_{i}\), with the principle of having the same total frequency in both Theory and Observation.
Before determining the relationship between variables, the researcher must demonstrate that one variable depends on the other. To do this, they must determine the nature of the variables involved.
For social science researchers, it is primarily about determining the influence of segmentation variables (factual or socioeconomic variables) on dependent variables. Hypothesis tests help confirm whether the observed relationship between two variables is significant or if it is due to chance.
One question that beginner researchers often ask is how to choose the variables to study. A large part of the information that helps the researcher choose the association between variables comes from previous stages of the research. However, the researcher still needs to perform other types of associations based on their field investigation.
In social sciences, there are generally three types of dependency relationships: causality, concomitance (co-occurrence), and interdependence.
Causality
We speak of a causal link between two variables \((x\) and \(y\)) if a change in one \((x\) for example) causes a change in the other \((y\)). \(x\) is the independent variable and \(y\) is the dependent variable. For example, if an online advertising campaign (\(x\)) leads to an increase in website visits (\(y\)), we could say there is a causal link between these two variables.
The presence of a significant link between two variables alone does not determine causality. To conclude such a link, one must create a situation where only the variation of the independent variable causes the variation of the dependent variable. For example, to prove that increasing the frequency of social media posts (\(x\)) causes an increase in user interactions (\(y\)), one would need to eliminate the influence of other factors like content quality or current trends on social media. Such an experiment is often impossible to conduct in social sciences.
Concomitance
Concomitance refers to the situation where two variables \(x\) and \(y\) vary together. This variation may sometimes be due to external factors. In such cases, a causal link between the two variables cannot be established. For example, a simultaneous increase in smartphone sales (\(x\)) and social media usage (\(y\)) could be due to a general trend towards greater digital connectivity, without a direct causal link between the two.
Interdependence
Two variables are said to be interdependent if they influence each other. For example, the quality of content shared on a social network (\(x\)) and the number of shares or likes (\(y\)) are interdependent: quality content leads to more shares and likes, while popular content attracts more quality authors.
This presentation of the types of relationships between variables refers to some concepts we covered in the previous session (Hypothesis Testing). We will now review and supplement these concepts with additional ones that will help us fully understand the \(\chi^2\) test. We will turn to the various types of relationships in hypothesis testing.
Types of Relationships in Hypothesis Testing
In hypothesis testing, there are generally three types of relationships between variables:
Null Relationship
No relationship is expected between the variables. For example, in a study on the impact of web interface colors (\(x\)) on user satisfaction (\(y\)), it may be found that there is no statistically significant relationship between the interface color and perceived satisfaction, which would indicate a null relationship. This could suggest that other factors (such as usability or loading speed) are more crucial for satisfaction.
Almost Total Relationship
Every change in an independent variable leads to a direct change in the dependent variable. For example, in an online advertising campaign, an increase in the advertising budget (\(x\)) may lead to a direct increase in the number of clicks (\(y\)) on the ads. If the hypothesis test shows a very strong relationship between these two variables, it is called an almost total relationship. This could imply that the budget is a predominant factor in attracting user attention.
Relative Relationship
The change in the independent variable leads to a limited change in the dependent variable. For example, a study might show that an increase in the number of posts by a company on social media (\(x\)) results in a slight increase in engagement rate (\(y\)), such as likes and shares. Here, the relationship is relative, as other factors, such as content quality or posting time, can also affect the engagement rate.
These definitions allow us to revisit the meaning of the test that we covered in the previous session. We will provide a more detailed explanation of the concept of statistical testing and other related concepts.
Note This detour is made with the intention of clarifying the vocabulary related to statistical tests, so we chose not to section this part of the session.
Test and Risk
A hypothesis test relies on the logic that a choice must be made between various possible and competing hypotheses, without having sufficient information regarding this choice, which involves a risk in the hypothetical decision we will make following the test.
We talk about a two-tailed test when two hypotheses \(H_0\) (Null Hypothesis) and its rival \(H_1\) (Alternative Hypothesis) coexist and are competing. For example, in a study on the impact of a new user interface (\(x\)) on user satisfaction (\(y\)), a two-tailed test might examine the null hypothesis that the interface has no significant effect on satisfaction (\(H_0\)) against the alternative hypothesis that it does have an effect, whether positive or negative (\(H_1\)).
We talk about a one-tailed test when the alternative hypothesis \(H_1\) posits a change in one direction only compared to the null hypothesis. For example, in a digital marketing campaign, one might test the null hypothesis \(H_0\) that adding a "Buy Now" button does not increase the conversion rate (\(y\)) compared to a version without the button. The one-tailed test would examine the alternative hypothesis \(H_1\) that adding this button does indeed increase the conversion rate.
Since information is incomplete, any decision-making involves risks. The concept of risk is fundamental to hypothesis testing in statistical inference:
If we decide that H0 is false, the risk of being wrong is denoted \(\alpha\) and is called the Type I error; Example: In an online advertising campaign, if we incorrectly reject the null hypothesis \(H_0\) which states that adding advertising videos has no effect on sales, we might wrongly conclude that the videos have a positive effect, while in reality, they have no significant impact.
If we decide that H0 is true, the risk of being wrong is denoted \(\beta\) and is called the Type II error; Example: In an analysis of click-through rates on a banner ad, if we incorrectly accept the null hypothesis \(H_0\) which states that the banner does not improve the click-through rate, we overlook the alternative hypothesis \(H_1\) that it does indeed increase clicks, and miss an opportunity to improve advertising performance.
The mathematical theory of hypothesis testing as we know it today is the result of the work of J. Neyman and E.S. Pearson (1928 and 1933), who clarified the main concepts of hypothesis testing, which we will define below.
3.2. The Two Families of Hypothesis Tests
There are two main orientations of statistical tests: parametric tests and non-parametric tests.
A parametric test is a hypothesis test based on the idea of a parametric form of distributions related to the underlying populations.
For example, the Student's t-test is a parametric test used to compare the means of two independent groups when the data follow a normal distribution. In information and communication sciences, this test could be used to compare the average user satisfaction scores between two versions of a mobile application.
A non-parametric test is a hypothesis test for which specifying the parametric form of the distribution is not required. In a non-parametric test, the observations must be independent; the selection of one individual in a sample should not influence the choices of other individuals.
For example, the Wilcoxon test is a non-parametric test used to compare two paired groups when the data do not necessarily follow a normal distribution. In information and communication sciences, this test could be applied to compare user satisfaction scores before and after a website redesign, without assuming that these scores follow a normal distribution.
3.3. The Independence Test
An independence test is a hypothesis test aimed at determining whether two variables are independent or not.
The principle of independence tests is to compare the empirical (observed) distribution with the theoretical distribution using a statistical indicator.
The main independence tests most used for quantitative variables are: the Spearman and Kendall coefficients. For categorical qualitative variables, the most commonly used is the Chi-square test, \(\chi^2\).
The following sections will explain the \(\chi^2\) test. We will start with the Contingency Table.
Definition III.3.1: The \(\chi^2\) Independence Test
The \(\chi^2\) independence test (pronounced Chi-square) aims to determine the independence of two categorical qualitative variables from the same sample. Decision-making is done using a contingency table.
The \(\chi^2\) independence test is the result of the work of statisticians and mathematicians K. Pearson, G.U. Yule, and R.A. Fisher. It is Fisher who contributed to the development of the concept of degrees of freedom.
3.4. The Contingency Table (Joint Distribution)
To study the relationship between two variables, a joint distribution table (also called a contingency table or a two-way table) must be established. The contingency table is derived from the compilation of raw data collected from the survey field.
A contingency table (also referred to as a table of contingency) is a table in which different characteristics (attributes) of the population (or sample) are classified. The purpose of a contingency table is to study and discover relationships (if any) between the considered attributes.
The use of the term contingency is attributed to the British mathematician Karl Pearson (1857-1936), for whom contingency is a measure of the total deviation from independence; the stronger the measure of contingency, the stronger the quantity of association or correlation between the attributes (Pearson, 1904).
Rules for Creating and Presenting a Contingency Table
Let \((T)\) be a two-way table, \(x\) and \(y\) two categorical qualitative variables with \(alpha\) and \(beta\) categories, respectively. A contingency table is represented as illustrated in the figure below:
Categories of Variable Y
Y1
Y2
......
Yb
Total
Categories of Variable X
X1
n11
n12
....
n1b
n1.
X2
n21
n22
....
n2b
n2.
.....
.....
.....
.....
.....
.....
Xα
nα1
nα2
....
nαb
nα.
Total
Total
n.1
n.2
....
n.b
n..
Table III.3.1:Contingency Table.
Note:
\(n_{ij}\): represents the observed frequency for category i of variable X and category j of variable Y.
\(n_{i.}\): represents the sum of observed frequencies for category i of variable X.
\(n_{.j}\): represents the sum of observed frequencies for category j of variable Y.
\(n..\): indicates the total number of observations.
Remark:
A contingency table can be multidimensional (with more than two variables); the elements of the table will then be denoted by \(n_{ijk}\) and represent the observed frequency for category \(i\) of variable \(X\), category \(j\) of variable \(Y\), and category \(k\) of variable \(Z\).
Contingency Table, An Example
The following table shows the relationship between blog reading frequency (frequent, occasional, never) and trust level in the media (high, medium, low) within a sample of 200 people.
High (A)
Medium (B)
Low (C)
Total
Frequent (X)
40
30
10
80
Occasional (Y)
30
50
20
100
Never (Z)
10
5
5
20
Total
80
85
35
200
Note:
This table cross-tabulates two qualitative variables: blog reading frequency (row variable) and trust level in the media (column variable). The title of this table is: Relationship Between Blog Reading Frequency and Trust Level in the Media Among a Sample of 200 People (full and precise titles are preferred).
The table shows the distribution of responses in each category. For example, out of the 80 people with a high level of trust in the media, 40 frequently read blogs.
Conditional Frequency
The conditional frequency measures the proportion of a modality of one variable relative to a specific modality of the other variable. It can be calculated by row, column, or total, depending on the analysis context.
Row Conditional Frequency: Among the 80 people who frequently read blogs, 50% (40/80) have a high level of trust in the media.
Column Conditional Frequency: Among the 85 people with a medium level of trust in the media, 35.29% (30/85) frequently read blogs.
Theoretical Frequency
The theoretical frequency allows testing the hypothesis of independence between the two variables. If the variables were independent, the theoretical frequency for each cell could be calculated using the following formula:
$$ f_{t_{ij}} = \frac {t_i \times t_j}{n} $$
Where:
t_i: Total of the row corresponding to modality i.
t_j: Total of the column corresponding to modality j.
n: Total number of observations.
This formula is used to calculate theoretical frequencies in any distribution, assuming the variables are independent. By comparing the theoretical frequencies \(f_t\) with the observed frequencies \(f_o\), one can assess whether the variables are actually independent or not.
3.5. The Steps of the Chi-Square Test of Independence
The chi-square test, also known as the test of independence, is a hypothesis test used to determine whether two qualitative variables are independent of each other.
3.5.1. Formulating the Hypotheses
To perform the independence test, we need to formulate two hypotheses: a null hypothesis and an alternative hypothesis.
The null hypothesis \((H_0)\) states that there is no association between the two variables considered; it is accepted by default;
The alternative hypothesis \((H_1)\) asserts that there is a dependency between the variables studied.
Example: A researcher in Information and Communication Sciences wants to determine the relationship between preferred media type (Television, Internet, Radio) and education level (Bachelor's, Master's, PhD) among communication students. The researcher formulates the following hypotheses:
\(H_0\) : The preferred media type is independent of the students' education level.
\(H_1\) : The preferred media type depends on the students' education level.
Note: The results of the hypothesis test help choose between the two hypotheses (i.e., \(H_0\) or \(H_1\)) and interpret the relationship between the two considered variables.
To follow and understand the example, we have included a contingency table representing the researcher's data.
Media Type
Education Level
Observed Total
Theoretical Total
Bachelor's
Master's
PhD
Television
14.4
15
19.2
20
9.6
10
43.2
45
43.2
Internet
24.0
25
28.8
30
19.2
20
72.0
75
72.0
Radio
9.6
10
14.4
15
19.2
20
43.2
45
43.2
Total
50
65
50
165
Table III.3.2: Table showing observed frequencies and theoretical frequencies for the variables Preferred Media Type and Education Level.
Note:
Each cell displays the theoretical frequency at the top and the observed frequency at the bottom.
3.5.2. Choosing the Significance Level (Alpha Threshold)
A hypothesis test is imperfect regardless of its sophistication because it relies on probabilities. As mentioned earlier, two types of errors can occur in a hypothesis test: Type I Error and Type II Error .
Example: If the researcher chooses a significance level of \( \alpha = 0.05 \), this means that we accept a 5% probability of rejecting the null hypothesis \( H_0\) when it is true (Type I Error). In other words, we are 95% confident that we do not incorrectly reject \(H_0\) if it is correct.
3.5.3. Checking the Application Conditions
The sample must be random;
At least one of the two variables is qualitative: if one of the variables is quantitative, its values are treated as categories of a qualitative variable;
Construct the contingency table and then calculate the theoretical frequencies using the formula seen earlier;
The sample size must be equal to or greater than 30;
Each of the theoretical frequencies must be greater than or equal to 5 (\(f_{t_{ij}} \geq5 \));
Each individual must belong to one and only one category of each variable [one row and one column of the contingency table].
Example: Suppose the researcher has a sample of 150 communication students with the variables Preferred Media Type (Television, Internet, Radio) and Education Level (Bachelor's, Master's, PhD) .
We construct a contingency table and calculate the theoretical frequencies for each cell, ensuring that all theoretical frequencies are above 5 and that each student is classified into a single category for each variable.
3.5.4. Calculating the Chi-Square Value and Comparing with Critical Values
The Chi-Square Distribution
To determine the existence (or absence) of a relationship between the variables, we need to compare the observed frequencies with the theoretical frequencies assuming that the variables are independent.
The chi-square distribution is expressed as follows:
If it is zero, this means that the two variables are independent (because the theoretical frequencies are equal to the observed frequencies);
The higher the \(\chi^2\), the greater the probability that the two variables are dependent. A high \(\chi^2\) means that the deviation between the theoretical frequencies and the observed frequencies is high. Conversely, a low \(\chi^2\) means that the likelihood of the two variables being independent is high and that the deviation is due to sampling error.
In our hypothetical example, suppose we have calculated the chi-square value and obtained \(\chi^{2} = 10.2\). We have a significance level of 0.05 and degrees of freedom \(d = (3-1) \times (3-1) = 4\). We look up the critical value in the chi-square table for \(d = 4\) and \( \alpha = 0.05\), which is approximately 9.488. Since \(\chi^{2} = 10.2\) is greater than 9.488, we reject \(H_0\) and conclude that there is a significant dependence between the preferred media type and the education level among communication students.
The chi-square distribution depends on two parameters: \(α :\) the significance level and \(D\) (DF) : number of degrees of freedom, which is equal to the number of central cells in the contingency table and is given by the relation:
$$ d = (number~~ of~~ modalities~~ of~~ x - 1) \times (number~~ of ~~modalities~~ of~~ y - 1)$$
We determine the critical value: we do this using the chi-square table (Table 5), the critical value depends on the significance level and the number of degrees of freedom;
Statement of the decision rule: we reject \(H_0\) if the calculated \(\chi^{2}\) is greater than the critical \(\chi^{2}\) [to be discussed];
Conclusion:
Decision-making and interpretation of the result based on the context.
In our case, we have a calculated value of \(\chi^{2} = 10.2\) and a critical value of 9.488, so we reject \(H_0\). This means that there is sufficient evidence to assert that the preferred media type is dependent on the education level among communication students, which could influence media choices for targeted communication campaigns.
3.6. The Strength of the Relationship Between Variables:
To calculate this, we use the contingency coefficient (\(c\)):
$$ C = \sqrt {\frac {\chi^{2}} {n+\chi^{2} } } $$
The value of \(C\) ranges from 0 to 1; the closer the value is to 1, the stronger the relationship between the variables.
For the coefficient of contingency to be applicable, it must meet certain conditions, which are:
Both variables must be normally distributed across the population;
Both variables each have three or more categories;
The sample size is relatively large (more than 30);
The Chi-square test is significant.
Calculating the Contingency Coefficient \(C\) for Our Example
Using the previously calculated value of \(\chi^2\):
\( \chi^2 = 0.275 \)
\( n = 165 \)
The contingency coefficient \(C\) is equal to 0.041, indicating a weak relationship between the variables "Preferred Media Type" and "Level of Education." There is little to no significant relationship between these variables in this example.
3.7. Correction Factor for the Contingency Coefficient:
The calculation of the contingency coefficient must undergo a correction that accounts for the size of the table in terms of rows and columns.
We notice that the new value of the coefficient does not differ greatly from the previous one, indicating the weak relationship between the two variables.
Brief Conclusion
This brief conclusion in the form of a bullet list should be considered in certain specific cases of calculating \(\chi^2\):
When the cross-tabulation consists exactly of two rows and two columns, the Fisher's exact test and Yates' corrected Chi-square test are used;
When the table consists of more than two rows and two columns, Pearson's Chi-square or the Likelihood Ratio Chi-square is used;
When the two variables are quantitative, the Cochran-Mantel-Haenszel test is used;
Pearson's correlation test handles only the relationship between quantitative variables;
If the variables are nominal: in addition to the Chi-square test, the contingency coefficient (c) [seen earlier], the Phi coefficient [for dichotomized or dichotomous variables], and Cramer's V coefficient [also seen in Block II, Session 2] can be used;
The risk test: applies only to a cross-tabulation with two rows and two columns.
Analysis of Variance
Introduction and Context
Unlike the T-test, ANOVA allows for the analysis of differences between two or more groups, regardless of their size. There is no technical limit to the number of groups that can be involved in the test.
As we will discuss below, ANOVA performs a comparison between two types of differences (between-group differences and within-group differences) to account for these differences. One of the most powerful tools in statistics is calculating variance, so we will calculate two types of variances, between-group variance and within-group variance, and it is this comparison that gives the name: analysis of variance.
We will now explain the terms of the F-statistic, namely: between-group variance and within-group variance.
The between-group variance can be considered as a mean difference between the means of each group and the mean of the means (the latter is called the grand mean as it represents the mean of all groups).
The within-group variance is, as we saw in Block 2, Session 1.3, the difference between each observation and the mean of its own group.
Following this order, we arrive at the following formula:
\(\overline{X}_i\) : Mean of observations in group i.
\(\overline{X}_\text{total}\) : Mean of all observations (grand mean).
\(N\) : Total number of observations.
\(X_{ij}\) : Observation j in group i.
Let's now attempt to explain the formula for the \(F\) statistic in more detail, using an example in information and communication sciences to facilitate understanding.
A study focuses on the effectiveness of three communication strategies on social media: visual strategy (group 1), textual strategy (group 2), and mixed strategy (group 3). We measure user engagement (likes, shares, comments) for each strategy. In the tutorial session dedicated to this topic, we will provide a more detailed example, using raw data.
4.2. The Grand Mean \(\overline{X}_\text{total}\)
Calculating the grand mean serves as the best estimate of the mean of all groups, since the null hypothesis asserts that the groups all come from the same population. Therefore, \(\overline{X}_\text{total}\) represents, theoretically, the best estimate of \(\mu\), the population mean (some authors prefer the term: grand mean).
We calculate the grand mean as follows:
$$\overline{X}_\text{total} = \sum_{i=1}^{k} \frac{\overline{X_i}}{k}$$
where \(\overline{X}_\text{total}\) is the grand mean, \(\overline{X_i}\) is the mean obtained in each group i, and k is the number of groups.
In our example, suppose the means obtained for each strategy are \(\overline{X}_1 = 25\), \(\overline{X}_2 = 30\), and \(\overline{X}_3 = 20\). The grand mean is therefore:
$$\overline{X}_\text{total} = \frac{25 + 30 + 20}{3} = 25$$
This mean constitutes the best estimate we have of the population mean \(\mu\).
4.3. The Between-Group Difference
This difference is called the between-group sum of squares \(SC_{inter}\).
Since we have the result of the grand mean \(\overline{X}_\text{total}\), we can calculate the difference between the mean of each group and this grand mean: \(\overline{X_i} - \overline{X}_\text{total}\). We can also calculate the sum of all the differences obtained: \(\sum \overline{X_i} - \overline{X}_\text{total}\). Of course, we must weight each difference to give importance to the groups with more observations: \(\sum n_{i} \overline{X_i} - \overline{X}_\text{total}\), assuming that samples with more observations provide a more accurate estimate.
Once this sum is calculated, we obtain what we call: the between-group sum of squares:
$$ \sum_{i=1}^{k} n_i \left(\overline{X}_i - \overline{X}_\text{total}\right) $$
In our example, if we have 10 observations for each strategy, then:
$$ SC_{inter} = 10(25 - 25)^2 + 10(30 - 25)^2 + 10(20 - 25)^2 = 0 + 500 + 500 = 1000 $$
To avoid having a zero sum, we square each difference to obtain the between-group mean square \(MS_{inter}\):
$$ MS_{inter}= \sum_{i=1}^{k} n_i \left(\overline{X}_i - \overline{X}_\text{total}\right)^2$$
In our example:
$$ MS_{inter} = 1000 $$
We cannot use this raw mean because it involves two elements: the difference between each mean and the grand mean, as well as the number of groups. To address this, we calculate the mean square between groups by dividing \(MS_{inter}\) by the degrees of freedom between groups: \(df_{inter} = k-1\) (\(k\) being the number of groups). Thus, we obtain:
$$ CM_{inter} = \frac{MS_{inter}}{df_{inter}} = \frac{1000}{3-1} = 500 $$
4.4. The Within-Group Difference
In each sample, there is variation in the observations. This variability can be calculated within each group (sample) by calculating the within-group sum of squares using the following formula:
Note: The double summation \(\sum \sum\) means that we first sum the squared differences between each observation \(X_{ij}\) and the mean of its own group \(\overline{X}_i\), then sum all the obtained quantities. This step calculates the within-group sum of squares. This value can then be divided by the total number of observations (N) minus the number of groups (k), to obtain the mean of the within-group sum of squares \(CM_{intra}\). In our example, if we have:
For the visual strategy: \(SC_{intra_1} = 200\)
For the textual strategy: \(SC_{intra_2} = 300\)
For the mixed strategy: \(SC_{intra_3} = 250\)
The total within-group sum of squares is therefore:
$$SC_{intra} = 200 + 300 + 250 = 750$$
The mean of the within-group sum of squares is therefore:
$$CM_{intra} = \frac{750}{30 - 3} = 27.78$$
4.5. The \(F\) Test and Its Significance
Once the between-group and within-group variances are calculated, we can compute the F statistic, which is the ratio of the two:
$$ F = \frac{CM_{inter}}{CM_{intra}} = \frac{500}{27.78} \approx 18.00 $$
A high F statistic indicates that the observed differences between group means are greater than those expected by chance, suggesting that the groups likely do not come from the same population.
Explanation of the F Table and Interpretation of Results
To interpret the F statistic we have calculated (\(F \approx 18.00\)), it is necessary to compare it to a critical value from the F distribution table [ Table 7: Fisher-Snedecor Distributions (\(\alpha = 0.05\)), Statistical Annex ], also known as the Fisher-Snedecor table. This table provides the critical value based on the between-group degrees of freedom (\(df_{inter} = k - 1\)) and within-group degrees of freedom (\(df_{intra} = N - k\)) as well as a significance level \(\alpha\), usually set at 0.05.
In our example:
\(df_{inter} = 3 - 1 = 2\)
\(df_{intra} = 30 - 3 = 27\)
\(\alpha = 0.05\)
Here is an excerpt from the F table for \(\alpha = 0.05\):
\(v_{2}\) \(v_{1}\)
1
2
3
20
4.351
3.492
3.098
21
4.324
3.466
3.072
22
4.300
3.443
3.049
Consulting this table, for \(\alpha = 0.05\), \(df_{inter} = 2\), and \(df_{intra} = 27\), we find a critical value \(F_{critical}\) of approximately 3.35. Since our \(F \approx 18.00\) is well above this critical value, we reject the null hypothesis, suggesting that the differences between the groups are statistically significant. In other words, the strategy used to attract users' attention (visual, textual, or mixed) has a significant effect on the click-through rate.
Illustrative Graph
A box plot allows you to visualize the distribution of data for each group:
Figure III.4.1.: Box Plot of the Distribution.
Note. Factorial Analysis of Variance
Factorial ANOVA generalizes the one-way ANOVA procedure to multiple factors, accounting for both main and interaction effects (referred to as interaction) on a dependent variable from multiple independent variables [the independent variables can have a theoretically unlimited number of levels and samples of all sizes].
For the purposes of this instruction, we will limit ourselves to two-way ANOVA in the planned guided work session to examine the impact of each of the two independent variables as well as their joint impact on the dependent variable.
Correlation & Linear Regression
Correlation
In a study, the researcher may need to examine the relationship between two quantitative variables. This involves checking the potential connection by comparing the two variables through graphical representation or numerical calculation. In this teaching, we discuss how to account for the relationship between two quantitative variables. If there is a relationship between two quantitative variables, we aim to establish the existence of a linear relationship and express it using a mathematical model.
5.1. The Relationship Between Two Quantitative Variables
In research, the relationship between two quantitative variables may arise from the theory under consideration. In this case, there is an initial hypothesis to verify its validity or accuracy in our own investigation. On the other hand, it may involve examining a possible relationship between the two variables whose logical progression has intrigued the researcher enough to test this relationship.
In data analysis, the concept of correlation refers to a process through which we can quantify the degree of association between variables.
Note: A detailed analysis supported by the use of data analysis software makes it easier to update potential relationships between research variables.
If a relationship between two quantitative variables is revealed, it is called correlation.
The existence of a correlation between two quantitative variables \(x\) and \(y\) allows us to predict \(y\) from \(x\).
We then say that \(y\) is a function of \(x\); mathematically, this can be written as:
$$ y = \Large{f} \normalsize{(x)}$$
\(y = \Large{f} \normalsize{(x)}\) is a mathematical function where \(y\) is the independent variable and \(x\) is the dependent (explanatory) variable.
We use Pearson's correlation as it is the most sophisticated and provides information on both the magnitude and direction of the relationship.
Pearson's correlation is a method aimed at producing a coefficient that reflects the degree of association between two quantitative variables (variables measured on interval or ratio scales). It produces a coefficient whose value ranges from -1 to +1.
Note:Pearson's correlation measures the degree of consistency between the standard Z values obtained from two measurements, which gives another formula for the coefficient: \( r_{x,y} = \frac {\sum_{i=1}^{N} Z_{x_i} Z_{y_i}} {N-1} \). We will discuss this variant in the session dedicated to this teaching.
Therefore, the starting point should be to determine the intensity of the relationship between the variables.
5.2. The Scatter Plot
In examining the relationship between two quantitative variables, the scatter plot (which some prefer to call the scatter diagram) is the appropriate graphical tool.
By graphically representing the pairs \((x_i ;y_i)\), we obtain what is called a scatter plot\((x_1 , y_1) , (x_2 , y_2) , (x_3 , y_3) , ...., (x_n , y_n)\).
Definition III.5.1: The \(t\) Test
A scatter plot is an essential graphical tool in statistical analysis, particularly used to explore and visualize the relationships between two quantitative variables. Each point in this plot represents an observation in the dataset, where the horizontal position of the point corresponds to the value of the first variable (often denoted \(X\)) and the vertical position to the value of the second variable (often denoted \(Y\)).
The scatter plot can be enhanced with additional elements such as trend lines, confidence intervals, or colors to differentiate subgroups of data, allowing for a more in-depth analysis of the studied relationships.
It is from the observation provided by the scatter plot that we will proceed to write the formula for the regression line, also known as the trend line or fitting line (this line comes from the least squares method).
The following are examples of scatter plots with the type and nature of the relationships they show.
Types and Nature of Relationships in a Scatter Plot
Positive Correlation
Note: This chart shows a positive correlation where values increase together. This means that as variable X increases, variable Y also tends to increase.
Negative Correlation
Note: This chart shows a negative correlation where values vary in opposite directions. This means that as variable X increases, variable Y tends to decrease.
Linear and Positive Relationship
Note: This chart illustrates a positive linear relationship. The points are distributed along an ascending line, indicating a proportional increase in the variables.
Linear and Negative Relationship
Note: This chart illustrates a negative linear relationship. The points are distributed along a descending line, indicating a proportional decrease in one variable relative to the other.
Non-Linear Relationship
Note: This chart shows a non-linear relationship. The points follow a curve, indicating a quadratic relationship between the variables, where changes in Y are not proportional to those in X.
No Relationship
Note: This chart shows no relationship. The points are scattered without any apparent trend, indicating that there is no significant correlation between variables X and Y.
5.3. The Strength of the Relationship
A correlation between two variables \(x\) and \(y\) exists if the values of \(x\) \( (x_{1}, x_{2}, ....... x_{i}) \) are consistently close to each other and the values of \(y\) \( (y_{1}, y_{2}, ....... y_{j}) \) are similarly close to each other.
The Fit Curve
The fit curve is the one that we plot and that best approximates the points (see previous examples).
The curve is called a regression (or estimation ) line if it is straight. The regression line characterizes the linear relationship between the variables.
The linear relationship between two variables \(x\) and \(y\) is expressed by the following formula:
$$ y = a x + b $$
where \(a\) and \(b\) are constants to be defined.
The Direction of a Linear Relationship
A relationship is called positive when both variables vary in the same direction; this relationship is characterized by an increasing line, and the relationship is said to be direct;
The relationship is negative (inverse) if the regression line is decreasing (sloping down from left to right);
A regression line that tends to be horizontal indicates the absence of any relationship. This does not mean that the two variables are not related by another form of relationship.
5.4. Covariance
To determine if two quantitative variables are related, we calculate the covarianceCov (x, y).
The covariance of the pair (x, y) is the average of the products of the deviations from the means \(\bar{x}\) and \(\bar{y}\).
After analyzing the scatter plot representing the two variables, we calculate the linear correlation coefficient (denoted \(r\)) which measures the deviation between the regression line and the scatter plot.
The closer the points are to the line, the higher the coefficient will be, and vice versa. Correlation helps to determine the existence of a coincidence, a relationship, between two variables.
In statistics, the concept of correlation refers to a process through which we can quantify the degree of association between variables.
The dispersion of points around the regression line is measured by the residual variance around the regression line. We use Pearson's correlation, as it is the most developed.
Pearson's Correlation
Definition III.5.2: Pearson's Correlation
Pearson's correlation is a method aimed at producing a coefficient that reflects the degree of association between two variables. Pearson's correlation applies to interval or ratio scale variables, producing a coefficient that ranges from -1 to +1.
Pearson's correlation measures the degree of coincidence between the Z scores obtained from two measurements.
Calculation of Pearson Correlation
We have two formulas, one using Z-scores and the other using the variables.
Calculation of Pearson Correlation using Z-scores:
The Coefficient of Non-Determination:
Coefficient of Non-Determination (\(1 - R^2\)): It measures the proportion of the total variance in the dependent variable that remains unexplained after accounting for the effect of the independent variables. In other words, it indicates the amount of variance in the dependent variable that cannot be attributed to the regression model.
Example
Suppose we want to analyze the relationship between the daily usage duration of a social media platform and user satisfaction levels. Satisfaction is measured on a scale from 0 to 10, where 0 indicates no satisfaction and 10 indicates maximum satisfaction. Usage duration is measured in hours per day.
The data for 6 users is as follows:
User 1: Hours = 2, Satisfaction = 4
User 2: Hours = 3, Satisfaction = 5
User 3: Hours = 4, Satisfaction = 6
User 4: Hours = 5, Satisfaction = 7
User 5: Hours = 6, Satisfaction = 8
User 6: Hours = 7, Satisfaction = 9
Calculation of Cross Products:
User 1: 2 × 4 = 8
User 2: 3 × 5 = 15
User 3: 4 × 6 = 24
User 4: 5 × 7 = 35
User 5: 6 × 8 = 48
User 6: 7 × 9 = 63
Calculation of Sums:
Sum of hours: \(2 + 3 + 4 + 5 + 6 + 7 = 27\)
Sum of satisfactions: \(4 + 5 + 6 + 7 + 8 + 9 = 39\)
Sum of cross products: \(8 + 15 + 24 + 35 + 48 + 63 = 193\)
Sum of squares of hours: \(2^2 + 3^2 + 4^2 + 5^2 + 6^2 + 7^2 = 4 + 9 + 16 + 25 + 36 + 49 = 139\)
Sum of squares of satisfactions: \(4^2 + 5^2 + 6^2 + 7^2 + 8^2 + 9^2 = 16 + 25 + 36 + 49 + 64 + 81 = 271\)
Calculation of Pearson Correlation Coefficient (\(r\)):
The formula for the Pearson correlation coefficient is:
Calculation of the Coefficient of Determination (\(R^2\)):
The coefficient of determination is:
\[
R^2 = r^2 = 1^2 = 1
\]
Interpretation: A coefficient of determination of 1 means that 100% of the variance in user satisfaction is explained by the duration of platform use. This indicates a perfect relationship between usage hours and satisfaction.
Simple Linear Regression
Linear regression is a practical application of correlation. Linear regression is the technique used to predict the position of a specific variable \( y \) based on the correlation between two variables \( x \) and \( y \).
5.6. Linear Fit by the Least Squares Method: Linear Regression
The least squares method involves determining the equation of the line that minimizes the sum of the squares of the deviations between each point in the cloud and the line.
Depending on whether the deviations are measured parallel to the y-axis or the x-axis, the regression line of \( y \) on \( x \) is obtained: \( x = a y + b \).
The Regression Line of \( y \) on \( x \)
The regression line of \( y \) on \( x \) \( y = a x + b \), also noted as line \( D_{y/x} \) is defined as the line that minimizes the sum of the squares of the distances between each point in the cloud and \( D_{y/x} \), with distances measured parallel to the y-axis.
The goal of this operation is to determine the coefficients \( a \) and \( b \), which will be discussed in the following section.
Let \( p \) be a point in the cloud with coordinates \( x_{i}; y_{i} \). Let \( y'_{i} \) be the ordinate of the point on the regression line with abscissa \( x_{i} \):
$$ y'_{i} = a x_{i} + b $$
The square of the distance between \( p \) and the line \( D_{y/x} \) is \( (y_{i} - y'_{i})^2 \).
The sum of the squares of the distances between the various points in the cloud and the line \( D_{y/x} \) is \( \sum\limits_{i=1}^{n} (y_i - y'_{i})^2 \) with \( y' = a x_{i} + b \).
To determine the equation of the regression line of \( y \) on \( x \), we need to minimize \( \sum\limits_{i=1}^{n} (y_i - a x_{i} - b)^2 \). This sum is a function of the two variables \( a \) and \( b \).
Mathematically, the solution is: \( a = \frac {Cov(x, y)}{V(x)} \) and \( b = \bar{y} - a \bar{x} \).
The equation of the regression line becomes:
$$ Y = \frac{Cov(x, y)}{V(x)} x + (\bar{y} - a \bar{x}) $$
Note: The regression line has the following characteristics:
The slope of the line \( \frac{Cov(x, y)}{V(x)} \) has the same sign as the covariance (variance is always positive);
It passes through the mean point \( (\bar{x}, \bar{y}) \): \( \bar{y} = a \bar{x} + b \).
The Regression Line of \( x \) on \( y \)
The regression line of \( x \) on \( y \) \( x = a' y + b' \), also noted as line \( D_{x/y} \) is defined as the line that minimizes the sum of the squares of the distances between each point in the cloud and \( D_{x/y} \), with distances measured parallel to the x-axis.
The goal of this operation is to determine the coefficients \( a' \) and \( b' \), which will be discussed in the following section.
Let \( p \) be a point in the cloud with coordinates \( x_{i}; y_{i} \). Let \( x'_{i} \) be the abscissa of the point on the regression line with ordinate \( y_{i} \): \( x'_{i} = a' y_{i} + b' \).
The square of the distance between \( p \) and the line \( D_{x/y} \) is \( (x'_{i} - x_{i})^2 \).
The sum of the squares of the distances between the various points in the cloud and the line \( D_{x/y} \) is \( \sum\limits_{i=1}^{n} (x_i - x'_{i})^2 \) with \( x' = a' y_{i} + b' \).
To determine the equation of the regression line of \( x \) on \( y \), we need to minimize \( \sum\limits_{i=1}^{n} (x_i - a' y_{i} - b')^2 \). This sum is a function of the two variables \( a' \) and \( b' \).
Mathematically, the solution is: \( a' = \frac {Cov(x, y)}{V(y)} \) and \( b' = \bar{x} - a' \bar{y} \).
The equation of the regression line becomes:
$$ X = \frac{Cov(x, y)}{V(y)} y + (\bar{x} - a' \bar{y}) $$
Note: The regression line has the following characteristics:
The slope of the line \( \frac{Cov(x, y)}{V(y)} \) has the same sign as the covariance (variance is always positive);
It passes through the mean point \( (\bar{x}, \bar{y}) \): \( \bar{y} = a' \bar{x} + b' \).
The two regression lines \( D_{y/x} \) and \( D_{x/y} \) intersect at the mean point \( (\bar{x}, \bar{y}) \).
Using the correlation coefficient, the mean, and the standard deviation of each variable, the formulas can be simplified as follows:
For a population:
$$ Slope = b = r \left(\frac {\sigma_y} {\sigma_x}\right) $$
$$ y \text{-intercept} = a = \mu_{y} - b \mu_{x} $$
For a sample:
$$ Slope = b = r \left(\frac {\sigma_y} {\sigma_x}\right) $$
$$ y \text{-intercept} = a = \bar{y} - b \bar{x} $$
Example
Suppose we want to study the relationship between the time spent on a website and the number of pages viewed by visitors. Time spent is measured in minutes, while the number of pages viewed is an integer.
The data for 8 visitors are as follows:
Visitor 1 : Time = 5 minutes, Pages = 8
Visitor 2 : Time = 15 minutes, Pages = 22
Visitor 3 : Time = 25 minutes, Pages = 32
Visitor 4 : Time = 35 minutes, Pages = 40
Visitor 5 : Time = 45 minutes, Pages = 50
Visitor 6 : Time = 55 minutes, Pages = 60
Visitor 7 : Time = 65 minutes, Pages = 68
Visitor 8 : Time = 75 minutes, Pages = 75
Calculation of Cross Products:
Visitor 1 : 5 × 8 = 40
Visitor 2 : 15 × 22 = 330
Visitor 3 : 25 × 32 = 800
Visitor 4 : 35 × 40 = 1400
Visitor 5 : 45 × 50 = 2250
Visitor 6 : 55 × 60 = 3300
Visitor 7 : 65 × 68 = 4420
Visitor 8 : 75 × 75 = 5625
Calculation of Sums:
Sum of times: \(5 + 15 + 25 + 35 + 45 + 55 + 65 + 75 = 295\)
Sum of pages: \(8 + 22 + 32 + 40 + 50 + 60 + 68 + 75 = 355\)
Calculation of Coefficient of Determination (\(R^2\)):
The coefficient of determination is:
\[
R^2 = r^2 \approx 1.03^2 \approx 1.06
\]
Interpretation: A coefficient of determination greater than 1 is abnormal and indicates there may be an error in the calculations or data.
Non-Parametric Tests
A non-parametric test is conducted to analyze data that do not necessarily follow a normal distribution or when the conditions for applying parametric tests are not met. Unlike parametric tests, which rely on assumptions about the distribution of the data, non-parametric tests are less constrained by these assumptions and can be applied to ordinal data or small sample sizes.
Non-parametric tests do not require the estimation of population parameters.
In the human and social sciences, we use four types of non-parametric tests:
Mann-Whitney Test
This test allows us to compare the means of two independent samples [the non-parametric equivalent of the Student's t-test]; Wilcoxon Test
The Wilcoxon test allows us to compare the means of two paired samples; Kruskal-Wallis Test
This test allows us to compare the means of several samples [the non-parametric equivalent of one-way ANOVA]; Spearman Test
The Spearman test is a non-parametric correlation test.
To simplify the reading of this teaching material, we have adopted a straightforward writing plan consisting of: providing a brief introduction to the test, explaining its theoretical functioning, and then giving an explanatory example. In our students' final works, the normality assumption is accepted.
6.1. The Mann-Whitney Test
Context
Let there be two samples, independent and non-exhaustive, \(E_1\) and \(E_2\), with sample sizes \(n_1\) and \(n_2\), respectively.
We want to compare the two means with the null hypothesis \( H_{0} : \mu_{1} = \mu_{2}\).
Conditions and Test Procedures
To perform the Mann-Whitney test, proceed as follows:
Rank all values from both samples in ascending order, taking the rank of each value;
Assign each value of \(E_1 \cup E_2\) its rank in the ordering: if there are ties, assign each a rank equal to the average of the ranks they occupy;
For each element \(x_i\) in \(E_1\), count the number of elements in \(E_2\) that are after \(x_i\);
Let \(m_1\) be the sum of all ranks associated with all elements of \(E_1\), and do the same for the other sample;
Note \(M = \min (m_{1}, m_{2}) \).
Decision Rule
Let \(M\) be the random variable that takes the value \(m\) as the result of the random experiment, we will proceed as follows: Consult the test table: in the appendix, tables 8 & 9 provide the value \(m_{\alpha}\) based on \(n_1\), \(n_2\), and \(\alpha\) such that under the null hypothesis \(H_0\): \(P (M \leq m_{\alpha}) = \alpha\), for \(\alpha = 0.05\) and \(\alpha = 0.01\).
Reject the null hypothesis if \(m \leq m_\alpha\); If \(n_1\) and \(n_2\) are outside the tables: if \(H_0\) is true, \(M\) approximately follows the normal distribution: \(\sim \mathcal{N}(\mu, \sigma)\)
With:
$$ \mu = \frac{n_{1}~n_{2}}{2} ~~~~ and ~~~~ \sigma = \sqrt{\frac{n_{1}~n_{2} (n_{1} + n_{2} + 1)}{12}} $$
Calculate the value of the standard normal variable: \(z = \frac{m-\mu}{\sigma} \) and conclude (see table 8) to reject \(H_0\) if \( | z | > z_\alpha \).
Example
In a study, a researcher tests, on a scale of 10, the perception scores of a YouTube channel after an advertising campaign:
Group 1 (\(E_1\)): 7, 8, 6, 9, 7 Group 2 (\(E_2\)): 6, 5, 6, 7, 8
■ Combine and rank the two groups: 5, 6, 6, 6, 7, 7, 7, 8, 8, 9;
■ Assign ranks: 1, 3, 3, 3, 6, 6, 6, 8, 8, 10;
■ Calculate the sum of ranks for \(E_1\): \(R_1 = 6 + 6 + 6 + 8 + 10 = 36\);
■ Calculate the sum of ranks for \(E_2\): \(R_2 = 3 + 1 + 3 + 3 + 8 = 18\);
■ The smaller of the two sums is \(M = \min(36, 18) = 18\).
We will then compare this value to the Mann-Whitney table ( Appendix Statistics, Table: 8 ) to determine if the difference is significant.
We find that \(M = 18\) is greater than the critical value in the table: \(2\) (for \(n_1 = 5\), \(n_2 = 5\) and \(\alpha = 0.05\)), so we accept the null hypothesis, indicating that the advertising campaigns had a different impact on the perception of the channel.
6.2. The Wilcoxon Test
Context
Let there be two paired samples (where each value in one sample is associated with a value from the other sample).
We assume the null hypothesis \(H_{0} : \mu_{1} = \mu_{2}\).
Conditions and Test Procedures
To perform the Wilcoxon test, proceed as follows:
Start by calculating the differences between the paired values, removing zero differences; let \(N\) be the number of non-zero differences;
Rank these differences by ascending absolute values (ignoring the sign for ranking purposes);
Assign each difference its rank in this ordering; if there are ties, assign each a rank equal to the average of the ranks they occupy;
Calculate \(w_+\), the sum of ranks of positive differences, and \(w_-\), the sum of ranks of negative differences;
Let \(w = \min (w_{+}, w_{-})\), the smaller of the two values \(w_+\) and \(w_-\).
Decision Rule
Let \(W\) be the random variable that takes the value \(w\) after the random experiment:
If \(N \leq 25\), Table 10 provides the value \(w_\alpha\) based on \(N\), such that under \(H_0\), \(P (W \leq w_{\alpha}) = \alpha\) for \(\alpha = 0.05\) and \(\alpha = 0.01\); reject the null hypothesis if \(w \leq w_{\alpha}\);
If \(N \geq 25\), when \(H_0\) is true, \(W\) approximately follows the normal distribution \(\sim \mathcal{N}(\mu, \sigma)\), with:
$$ \mu = \frac{N (N+1)}{4} ~~~~ and ~~~~ \sigma = \sqrt{\frac{N (N+1) (2 N+1)}{24}} $$
Calculate the value of the standard normal variable: \(z = \frac{w - \mu}{\sigma}\) and conclude, using Table 10, to reject \(H_0\) if \(|z| > z_{\alpha}\).
Example
In a study, a Master's student tests the response times (in seconds) of users before and after the introduction of a new user interface. The results are as follows:
Before (seconds): 12, 15, 14, 10, 13 After (seconds): 10, 14, 13, 9, 12
■ Calculate the differences: -2, -1, -1, -1, -1;
■ Rank the absolute values of the differences: 1, 1, 1, 1, 2;
■ Assign ranks: 2.5; 2.5; 2.5; 2.5; 5;
■ Calculate the sum of ranks for positive differences \(w_+\) = 0, since there are no positive values;
■ Calculate the sum of ranks for negative differences \(w_-\) = 2.5 + 2.5 + 2.5 + 2.5 + 5 = 15;
■ The smaller of the two sums is \(w = \min(0, 15) = 0\).
We will then compare this value to the Wilcoxon table ( Appendix Statistics, Table: 10 ) to determine if the difference is significant.
We note that \(W = 0\) is equal to the critical value in the table: \(0\) (for \(N = 5\) and \(\alpha = 0.05\)), so we reject the null hypothesis, indicating that the new user interface has improved user response times.
6.3. The Kruskal-Wallis Test
Context
Let there be \(k\) independent and non-exhaustive samples: \(E_{1}, E_{2},........E_{k}\) with sizes: \(n_{1},n_{2},........n_{k}\).
The principle is to compare the \(k\) experimental means, which amounts to testing the null hypothesis \(H_{0} : \mu_{1} = \mu_{2} = ...... \mu_{k}\).
Test Conditions and Procedures
To perform the Kruskal-Wallis test, proceed as follows:
Rank all the values from these \(k\) samples in ascending order, then determine the rank of each value, following the same procedure as previous tests in case of ties;
For each sample \(E_i\), let \(r_i\) denote the sum of the ranks of the values in that sample;
Calculate the quantity:
$$ h = \frac{12}{n (n+1)} \left( \sum_{i=1}^{k} \frac{r_{i}^{2}}{n_{i}} \right) - 3 (n+1) $$
Note: \(n = \sum_{i=1}^{k} n_{i} \) represents the total sample size.
Decision Rule
Let \(H\) be the random variable that takes the value \(h\) at the end of the random experiment:
If the \(n_i\) are sufficiently large (classic threshold: \(n_i > 5 \) for all \(i\)), then, if \((H_0)\) is true, \(H\) follows the \(\chi^2\) distribution with \(k-1\) degrees of freedom;
In Table 5, read the value \(\chi_{\alpha}^{2}\) such that \( P (H \geq \chi_{\alpha}^{2} ) = \alpha \) and reject \((H_0)\) if \(h \geq \chi_{\alpha}^{2}\);
If the \(n_i\) are not sufficiently large, there are tables providing the value \(h_{\alpha}\), such that \(P ( H \geq h_{\alpha}) = \alpha \);
Reject \((H_0)\) if \(h \geq h_{\alpha}\). Table 13 provides \(h_\alpha\) for \(\alpha = 0.05\) and \(\alpha = 0.01\), for cases with three samples of sizes less than or equal to \(5\).
Example
A study is conducted to compare the effectiveness of three different awareness campaigns about safe social media usage. Each campaign is launched in a different region, and after one month, the participants' awareness level is assessed through a score out of 100. The three samples are independent, with sizes \(n_1 = 4\), \(n_2 = 5\), and \(n_3 = 6\). The goal is to compare the three campaigns to determine if there is a significant difference in average awareness levels. The null hypothesis is \( H_{0} : \mu_{1} = \mu_{2} = \mu_{3}\).
The Kruskal-Wallis statistic is therefore:
$$ h = \frac{12}{15 \times 16} \left( \frac{52^2}{4} + \frac{45^2}{5} + \frac{31.5^2}{6} \right) - 3 \times 16 $$
$$ h = \frac{12}{240} \times \left( 676 + 405 + 165.375 \right) - 48 $$
$$ h = \frac{12}{240} \times 1246.375 - 48 $$
$$ h = 62.31875 - 48 $$
$$ h = 14.31875 $$
Decision
For a significance level \(\alpha = 0.05\) and \(k-1 = 2\) degrees of freedom (see Table 13 of the Statistical Appendix), the critical value of \(\chi_{\alpha}^{2}\) is approximately 5.991. Since \(h = 14.31875\) is greater than 5.991, **we reject the null hypothesis** \(H_0\) and conclude that there is a significant difference in participants' awareness levels according to the campaign used.
6.4. Spearman's Rank Correlation Coefficient
Context
In a population, consider two random variables \(X\) and \(Y\), and we want to test \(H_0\): Absence of correlation between \(X\) and \(Y\).
We generally have \(n\) pairs \((x_{i}, y_{i})\) of values of \(X\) and \(Y\) determined simultaneously.
In this case, we rank separately, in ascending order, the values \(x_{1}, x_{2}, \ldots, x_{n}\) and \(y_{1}, y_{2}, \ldots, y_{n}\).
Conditions and Procedures for the Test
To perform Spearman's rank correlation coefficient, proceed as follows:
■ Verify that the variables are ordinal, or if they are quantitative, they do not follow a normal distribution, or the relationships between the variables are not linear;
■ The data pairs must be independent.
■ The samples must be of sufficient size for the test to be valid. However, Spearman's test is robust to small samples.
■ Assign ranks to the values of each variable. In case of tied values, assign each value the average rank of the positions they occupy.
■ Calculate the differences between the ranks of each pair of observations.
■ Square each difference obtained.
■ Calculate the sum of the squares of the differences (\( \sum d_i^2 \)).
■ Apply Spearman's formula to obtain the correlation coefficient: $$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$ where \( n \) is the number of observations.
Decision Rule
To interpret Spearman's rank correlation coefficient \( r_s \), use the following criteria:
■ If \( r_s \) is close to +1 or -1, it indicates a strong positive or negative correlation, respectively;
■ If \( r_s \) is close to 0, it indicates an absence of correlation.
To test the null hypothesis \( H_0 \) that there is no correlation between the two variables, compare the observed value of \( r_s \) with the critical values from the Spearman table for a given significance level (often \( \alpha = 0.05 \)):
■ If \( |r_s| \) is greater than the critical value, reject \( H_0 \), indicating a statistically significant correlation;
■ If \( |r_s| \) is less than or equal to the critical value, do not reject \( H_0 \), meaning there is not enough evidence to conclude a significant correlation.
Example
A study is conducted to examine the relationship between the frequency of blog posts by an editorial team and the average reader engagement (number of comments per post). The objective is to determine if there is a correlation between these two ordinal variables. Results are obtained for 10 different posts.
Data and Ranks
The following data show the frequency of publication (in days) and the average engagement (number of comments) for each post. The values are ranked in ascending order to determine the ranks.
Calculation of Spearman's Rank Correlation Coefficient
Calculate the rank differences for each post, then square these differences. Finally, apply Spearman's rank correlation coefficient formula:
$$ r_s = 1 - \frac{6 \sum d_{i}^{2}}{n(n^2 - 1)} $$
where \( d_i \) is the difference between the ranks of each pair of observations, and \( n \) is the number of observations.
Interpretation
Spearman's rank correlation coefficient \( r_s = -0.991 \) indicates a strong negative correlation between the publication frequency and the average reader engagement. This means that as the frequency of publication increases, reader engagement decreases.
Summary
In this session, we explored the fundamental concepts of statistical inference, an essential branch of statistics that allows us to draw conclusions about a population from a sample of data. The session aims to familiarize you with the various tools and tests used in statistical inference.
During this session, we introduced key concepts such as parameter estimation, hypothesis testing, and different regression techniques. The goal is to understand how these concepts are applied to analyze data in various contexts.
Here is a review of the main concepts covered during the session:
Statistical inference: A set of methods used to draw conclusions about a population based on a sample;
Estimation: The process of determining approximate values of unknown population parameters based on sample data;
Hypothesis testing: A statistical method used to assess the plausibility of a hypothesis based on sampled data;
Chi-square test: A statistical test used to evaluate the independence between two categorical variables;
Analysis of Variance (ANOVA): A statistical technique that compares the means of several groups to determine if they come from the same population;
Correlation: A measure of the strength and direction of the relationship between two variables;
Regression: A statistical technique used to model and analyze the relationship between a dependent variable and one or more independent variables;
Non-parametric tests: A set of statistical tests that do not rely on specific distribution assumptions for the data;
Bibliography of the Block
The Support does not have a final bibliography (in its online version); references are included at the end of each Block.
Abboud, N., & Audroing, J. F. (1989). Probabilités et inférence statistique. Nathan supérieur-économie.
Acree, M. C. (2021). The myth of statistical inference. Springer Nature.
Bhushan, V. (1985). Inférence statistique. Presses Université Laval.
Casella, G., & Berger, R. (2024). Statistical inference. CRC Press.
Cox, D. R. (2006). Principles of statistical inference. Cambridge university press.
Protassov, K. (2002). Analyse statistique des données expérimentales (No. BOOK). Les Ulis: EDP sciences.
Simard, C., & Desgreniers, E. (2015). Notions de statistique. Modulo.
Srivastava, M. K. (2009). Statistical Inference: Testing of Hypotheses. PHI Learning Pvt. Ltd.
Trosset, M. W. (2001). An introduction to statistical inference and Data analysis. Department of mathematics, College of William and Mary.
Summary Questions
What is statistical inference and why is it essential in data analysis?
What is the difference between a point estimate and an interval estimate? Provide an example of each.
How is a null hypothesis and an alternative hypothesis formulated in hypothesis testing? Explain with an example.
What does the Chi-square test involve and in what situations is it appropriate to use it?
What are the main steps to perform an analysis of variance (ANOVA) and what is its primary objective?
How is the Pearson correlation coefficient interpreted? What does a coefficient of -1, 0, and 1 mean?
What is the difference between simple linear regression and multiple regression? When should one be used over the other?
What are the advantages of non-parametric tests over parametric tests? Provide an example of a non-parametric test.
How can we check if the residuals of a regression model are normally distributed? Why is this important?
What are Type I and Type II errors in the context of hypothesis testing? How can they be minimized?
How should the results of an ANOVA be interpreted when rejecting the null hypothesis? What does this rejection mean in terms of comparing group means?
What are the conditions for applying the Chi-square test of independence and how are its results interpreted?
Quiz
The quiz consists of twenty questions related to the topic of statistical inference as well as the themes and concepts covered during the teaching session. To view and test your knowledge, click HERE :)
Course & TD Sheets
This session does not have downloadable sheets. We will have the opportunity to use exercise generators and the Python compiler during the directed work session dedicated to this.
Further Reading
To delve a bit deeper into the concepts related to probability and combinatorial analysis, you can refer to the following documents and videos:
Book
This book provides a simple explanation of the topics related to statistical inference in the human and social sciences. The book is available for free on cairn.info from your personal account: Méot, A. (2003). Introduction aux statistiques inférentielles: De la logique à la pratique. De Boeck Supérieur.
Course Material
This is the course material by Mr. Yves Tillé, widely used by students from various disciplines, presenting the concepts of probability and combinatorial analysis concisely. The course material can be downloaded freely by clicking HERE :) .
YouTube Channel
The channel explains the essential concepts of statistical inference through a series of episodes. Add it to your list
On the Course App
On the Course App, you will find the summary of this Block, as well as series of tutorials related to it.
You will also find references to multimedia content relevant to the Block.
In the Notifications section, an update is planned and will be based on the questions asked by students during the lectures and tutorials.
An update will also include corrections of previous session exams, which will be reviewed in the tutorials to prepare for the current year's exams.
Course Download
Using the link below, you can download the Flipbook in PDF format:
The Python Corner
In this Python Corner, you will find a table summarizing the essentials for statistical inference, with examples drawn from information and communication sciences.
Parameter
Python Code
Example
Point Estimate
import numpy as np
mean = np.mean(data)
print(mean)
Estimate the average time spent by users on a social media site.
Calculation: mean = np.mean([15, 30, 45, 60, 75])
Explanation: The estimated average time spent per user is 45 minutes.
Test whether the average visit duration on a news site differs from 50 minutes.
Calculation: t_stat, p_value = stats.ttest_1samp([15, 30, 45, 60, 75], 50)
Explanation: The t-test checks if the mean is significantly different from 50 minutes.
Test the hypothesis that the distribution of article types (tech, culture, sport) on a site is balanced.
Calculation: chi2, p_value = stats.chisquare([50, 30, 20], [33, 33, 33])
Explanation: The chi-square test checks if the observed distribution differs from the expected distribution.
Compare the average time spent on three different news sites.
Calculation: f_stat, p_value = stats.f_oneway([20, 35, 50], [25, 40, 55], [30, 45, 60])
Explanation: ANOVA tests if the means of the three groups are significantly different.
Measure the correlation between the number of article shares and the number of comments.
Calculation: corr, p_value = stats.pearsonr([10, 20, 30], [2, 4, 6])
Explanation: The correlation coefficient indicates the strength and direction of the relationship between the two variables.
Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
print(model.coef_, model.intercept_)
Predict the number of visits based on the number of new articles published.
Calculation: model = LinearRegression().fit([[1], [2], [3]], [100, 150, 200])
Explanation: The regression model predicts the number of visits based on the number of articles.
Compare user engagement on two different versions of a website.
Calculation: w_stat, p_value = stats.wilcoxon([3, 5, 7], [4, 6, 8])
Explanation: The Wilcoxon test is used to compare two paired non-parametric groups.
Discussion Forum
The forum allows you to engage in discussions about this session. You will notice a subscription button so you can follow conversations about research in the humanities and social sciences. It is also an opportunity for the instructor to address students' concerns and questions.