In the previous Block, we worked on univariate analysis, which involved examining variables one by one, separately. Univariate analysis is useful when conducting research aimed at describing or exploring a specific phenomenon.
We are now focusing on the other side of analysis, which is bivariate analysis. Bivariate analysis looks at variables in pairs, analyzing two variables simultaneously to examine the relationship between them and its strength.
The main areas of bivariate analysis involve identifying the relationship between two variables, quantifying the strength and direction of the relationship, modeling the relationship between two variables, and exploring the causal links between them. One topic of bivariate analysis concerns hypothesis testing, which we have placed in the last Block.
We have therefore divided this teaching into two sections: the first explores statistical distributions with two variables, and the second explains the statistical parameters of such a distribution.
During this session, we aim to achieve the following objectives:
Marginal distribution, Conditional distribution, Joint distribution, Joint frequency, Joint count, Column profile table, Statistical independence, Correlation, Fitting (affine, linear, least squares method), Regression line,
A statistical distribution with two variables, also known as a bivariate distribution, is a representation that shows the relationship between two statistical variables. Unlike a univariate distribution, which deals with only one variable, the bivariate distribution allows us to examine how the two variables interact with each other.
The bivariate statistical distribution, as we mentioned earlier, is often represented graphically; however, statistical calculations serve to confirm what we observe. To simplify, we will say that the elements of interest in our teaching in this context are: the contingency table, the scatter plot, covariance (correlation), and linear regression. For the purposes of this session, we will limit ourselves to examining the first two elements only.
When two variables \(x\) and \(y\) are defined in a population composed of \(n\) individuals, the numerical representation can be elementary or of contingency. The following lines illustrate this idea:
An elementary table lists, for each individual \(i\) in the population, the values \(x_i\) and \(y_i\) of each of the studied variables in adjacent columns. One could say that it is a combination, a superposition, of two simple statistical tables (each with a single entry).
The following table represents an elementary table:
Individual \(i\) | Values of variable \(x\) | Values of variable \(y\) |
---|---|---|
\(1\) | \(x_1\) | \(y_1\) |
\(2\) | \(x_2\) | \(y_2\) |
\(3\) | \(x_3\) | \(y_3\) |
\(...\) | \(...\) | \(...\) |
\(n-2\) | \(x_{n-2}\) | \(y_{n-2}\) |
\(n-1\) | \(x_{n-1}\) | \(y_{n-1}\) |
\(n\) | \(x_n\) | \(y_n\) |
Example: The following table represents an elementary table featuring two variables: Gender and Education Level
Individual \(i\) | Gender | Education Level |
---|---|---|
1 | Male | Primary |
2 | Male | Secondary |
3 | Female | Secondary |
4 | Male | University |
5 | Female | University |
6 | Male | Middle |
7 | Female | Middle |
We can see that for each individual in our example, the pair of modalities that represent them are placed side by side.
An elementary table is used when one wants to organize and present data in a way that allows for comparing the values of multiple variables for a set of individuals or observation units. An elementary table is not used for analysis; it is the bivariate equivalent of a single-entry statistical table.
The contingency table, also known as a cross table or correspondence table, defines a joint distribution. It relates a pair of variables \((x, y)\) by associating the frequencies corresponding to each pair of modalities.
A contingency table, unlike an elementary table, is used to present and analyze the relationship between two variables. It summarizes the frequencies of observations that lie at the intersection of the modalities of these two variables.
A contingency table is presented in the following format:
\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \) |
\(Y_1\) | \(Y_2\) | \(Y_3\) | \(...\) | \(Y_p\) |
\(x_1\) | \(n_{11}\) | \(n_{12}\) | \(n_{13}\) | \(...\) | \(n_{1p}\) |
\(x_2\) | \(n_{21}\) | \(n_{22}\) | \(n_{23}\) | \(...\) | \(n_{2p}\) |
\(x_3\) | \(n_{31}\) | \(n_{32}\) | \(n_{33}\) | \(...\) | \(n_{3p}\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(x_k\) | \(n_{k1}\) | \(n_{k2}\) | \(n_{k3}\) | \(...\) | \(n_{kp}\) |
Note:
The variable \(x\) has \(k\) categories listed in the left margin of the table [for an explanation of the nature and uses of statistical tables see Lahanier-Reuter, D. (2003)], each row of the table corresponds to a category of the variable \(X\) and contains the counts for that category. It is customary to denote the row index with the letter \(i\) (\(x_i\): \(i\) ranging from 1 to \(k\)). For the variable \(Y\), it is listed in the top margin of the contingency table, and each category of the variable \(Y\) corresponds to a column of the table (the column index is typically denoted by the letter \(j\) (\(y_j\): \(j\) ranging from 1 to \(p\)).
Example:
The following table shows the relationship between education level and employment status for a group of people:
Education Level | Employed | Unemployed | Total |
---|---|---|---|
Primary | 40 | 10 | 50 |
Secondary | 60 | 20 | 80 |
Tertiary | 30 | 10 | 40 |
Total | 130 | 40 | 170 |
Joint distribution refers to the pairs \((x, y)\) formed by the left and top margins (the \(k\) rows and \(p\) columns). Simply put, it is the set of \((x_i, y_i, n_{ij})\) (\(i\) ranging from 1 to \(k\) and \(j\) ranging from 1 to \(p\)).
The joint frequency \(n_{ij}\) represents both the \(x_i\) modality of the \(x\) variable and the \(y_i\) modality of the \(y\) variable.
The set of joint frequencies constitutes the total frequency, represented by the following formula:
$$\sum_{j=1}^{p} \sum_{i=1}^{k} n_{ij} = \sum_{i=1}^{k}\sum_{j=1}^{p} n_{ij} = n $$
Sometimes it is more useful to replace frequencies with proportions; joint frequencies are then divided by \(n\) (the total frequency), which is called the joint proportion.
As in a univariate table, the sum of the joint proportions is equal to 1.
$$\sum_{j=1}^{p} \sum_{i=1}^{k} f_{ij} = \sum_{i=1}^{k}\sum_{j=1}^{p} f_{ij} = 1 $$
Note: For our previous example, linking education level to employment status, we can say that:The marginal distribution concerns the distribution of a single variable \((x\) or \(y)\). In other words, a marginal distribution involves deducing the distribution of each variable considered in isolation, as seen in the previous lesson. It is the process of extracting single-entry tables.
The distribution related to the variable \(x\) alone is called the marginal distribution of \(x\); the distribution related to the variable \(y\) alone is called the marginal distribution of \(y\).
In the previous example, constructing the marginal distributions involves creating a single-entry table for the Education Level variable and another for the Employment Status variable. We will explain the theoretical principles of marginal distribution in the following lines.
The marginal distribution of the variable \(x\) is defined using the pairs \((x_i, n_i)\), where \(i = 1, 2, 3, \ldots, K\) [\(x_i\) is the modality of the \(x\) variable and \(n_i\) is the corresponding frequency, referred to as the marginal frequency of the modality \(x_i\). This represents the number of individuals where the modality of the \(X\) variable is \(x_i\) and the modality of the \(Y\) variable is \(y_1, y_2, \ldots, y_p\); the frequency is equal to the total or sum of the frequencies in the \(i\)-th row, as shown in the following formula:
$$ n_{i\ } =\ n_{i1\ } +\ n_{i2\ } +\ n_{i3\ } +\ \ldots\ +\ n_{ip\ } =\ \sum_{j=1}^{p}n_{ij} $$
It should be noted that the sum of the marginal frequencies is equal to the total frequency \(n\) (or \(n..\)).
The following table shows the marginal distribution of the variable \(x\):
\(Modalities ~~of ~~the ~~variable ~~x\) | \(Marginal ~~Frequency\) |
\(x_1\) | \(n_{1.}\) |
\(x_2\) | \(n_{2.}\) |
\(x_3\) | \(n_{3.}\) |
\(...\) | \(...\) |
\(x_k\) | \(n_{k.}\) |
\(\sum\) | \(n\) |
The marginal frequency of a modality \(x_i\), denoted \(f_i\), is also defined as follows:
$$f_i\ =\ \frac{n_{i.}}{n}$$
The sum of the marginal frequencies \(f_{i.}\) is equal to 1 [\(\sum_{i=1}^{k}f_i\ =\ 1\)].
The marginal distribution of the variable \(y\) consists of the pairs \((y_{j}, n_{.j})\) (\(j = 1, 2, 3, \ldots, p\)), where \(y_j\) is the modality of the variable \(y\) and \(n_{.j}\) is the corresponding frequency, referred to as the marginal frequency of the modality \(y_j\). This represents the number of individuals where the modality of the \(Y\) variable is \(y_j\) and the modality of the \(X\) variable is \(x_1\), \(x_2\), \ldots, \(x_k\); the frequency is equal to the total or sum of the frequencies in the \(j\)-th column, as shown in the following formula:
$$ n_{.j\ }\ =\ n_{1j\ }\ +\ n_{2j\ }\ +\ n_{3j\ }\ +\ \ldots\ +\ n_{kj\ }\ =\ \sum_{i=1}^{k}n_{ij} $$
It should be noted that the sum of the marginal frequencies is equal to the total frequency \(n\) (or \(n_{..}\)).
The following table shows the marginal distribution of the variable \(x\):
\(Modalities ~~of ~~the ~~variable ~~y\) | \(Marginal ~~Frequency\) |
\(y_1\) | \(n_{.1}\) |
\(y_2\) | \(n_{.2}\) |
\(y_3\) | \(n_{.3}\) |
\(...\) | \(...\) |
\(y_p\) | \(n_{.k}\) |
\(\sum\) | \(n\) |
The marginal frequency of a modality \(y_i\), denoted \(f_i\), is also defined as follows:
$$f_i\ =\ \frac{n_{.j}}{n}$$
The sum of the marginal frequencies \(f_{i.}\) is equal to 1 [\(\sum_{j=1}^{p}f_{i.}\ =\ 1\)].
Note: Contingency table and marginal distributions.
It should be noted that the marginal frequencies \(n_{i.}\) are displayed in an additional column, while the marginal frequencies \(n_{.j}\) are shown in an additional row of the joint distribution (x, y).
The following table illustrates this idea:
\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \) |
\(Y_1\) | \(Y_2\) | \(Y_3\) | \(...\) | \(Y_p\) | \(\sum\) |
\(x_1\) | \(n_{11}\) | \(n_{12}\) | \(n_{13}\) | \(...\) | \(n_{1p}\) | \(n_{1.}\) |
\(x_2\) | \(n_{21}\) | \(n_{22}\) | \(n_{23}\) | \(...\) | \(n_{2p}\) | \(n_{2.}\) |
\(x_3\) | \(n_{31}\) | \(n_{32}\) | \(n_{33}\) | \(...\) | \(n_{3p}\) | \(n_{3.}\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(x_k\) | \(n_{k1}\) | \(n_{k2}\) | \(n_{k3}\) | \(...\) | \(n_{kp}\) | \(n_{k.}\) |
\(\sum\) | \(n_{.1}\) | \(n_{.2}\) | \(n_{.3}\) | \(...\) | \(n_{.p}\) | \(n\) |
We note that the marginal distribution of \(x\) is given by the left margin and the last column of the table, while that of the variable \(y\) is given by the top margin and the last row.
Note: Marginal statistics.
Each contingency table consists of two univariate distributions. By creating marginal distribution tables, we can calculate the indices seen in the previous chapters (central tendency, dispersion, and position …). In the following chapters, we will need to calculate these indices and make the necessary interpretations for the analysis.
The conditional distribution of \(x\) given \(y\) is the distribution of \(x\) concerning only the individuals with the modality \(y_i\) of \(Y\). Similarly, the conditional distribution of \(y\) given \(x\) is the distribution of \(y\) concerning only the individuals with the modality \(x_i\) of \(X\).
The variable \(y\) has \(p\) modalities, so the population could be divided into \(p\) sub-populations (individuals identified by modality \(y_1\), those identified by modality \(y_2\), … up to those identified by \(y = y_p\)). For each sub-population, we can have what is called a conditional distribution.
Following this reasoning, we obtain \(p\) conditional distributions of \(x\) given \(y\):
The conditional distribution of \(x\) given \(y = y_{1}\);
The conditional distribution of \(x\) given \(y = y_{2}\);
The conditional distribution of \(x\) given \(y = y_{3}\);
................................................... ;
The conditional distribution of \(x\) given \(y = y_{p}\);
Each distribution is defined by a pair (\(x_i\), \(n_{ij}\)) [\(i\) ranging from \(1\) to \(k\) and \(j\) being fixed]. The following table represents the idea of a conditional distribution:
\(Modalities ~~of ~~x \) | \(Conditional ~~frequencies ~~n_{ij}\) |
\(x_1\) | \(n_{1j}\) |
\(x_2\) | \(n_{2j}\) |
\(x_3\) | \(n_{3j}\) |
\(...\) | \(...\) |
\(x_k\) | \(k_j\) |
\(\sum\) | \(n_{.j}\) |
The total frequency is given by the formula:
$$ n_{.j} = n_{1j} + n_{2j} + n_{3j} + ..... + n_{kj} = \sum_{i=1}^{k}n_{ij}$$We could also calculate the conditional frequencies \(f_{xi/yj}\) using the formula:
$$ f_{xi/yj}\ =\ \frac{n_{ij}}{n_{.j}} $$The sum of the conditional frequencies is equal to 1.
With the help of conditional distributions (or frequencies), we can create a column-profile table. A column-profile table contains the modalities of \(x\) in the left margin, the conditional frequencies of \(x\) given \(y = y_1\) in the first column, the conditional frequencies of \(x\) given \(y = y_2\) in the second column, …, the conditional frequencies of \(x\) given \(y = y_p\) in the last column. The following table illustrates the column-profile table.
\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \) |
\(Y_1\) | \(Y_2\) | \(Y_3\) | \(...\) | \(Y_p\) | \(\sum\) |
\(x_1\) | \(f_{x1/y1}\) | \(f_{x1/y2}\) | \(f_{x1/y3}\) | \(...\) | \(f_{x1/yp}\) | \(f_{1.}\) |
\(x_2\) | \(f_{x2/y1}\) | \(f_{x2/y2}\) | \(f_{x2/y3}\) | \(...\) | \(f_{x2/yp}\) | \(f_{2.}\) |
\(x_3\) | \(f_{x3/y1}\) | \(f_{x3/y2}\) | \(f_{x3/y3}\) | \(...\) | \(f_{x3/yp}\) | \(f_{3.}\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) |
\(x_k\) | \(f_{xk/y1}\) | \(f_{xk/y2}\) | \(f_{xk/y3}\) | \(...\) | \(f_{xk/yp}\) | \(f_{k.}\) |
\(\sum\) | \(1\) | \(1\) | \(1\) | \(1\) | \(1\) | \(1\) |
The variable \(x\) has \(k\) modalities, so we can divide the population into \(k\) sub-populations (individuals identified by modality \(x_1\), those identified by modality \(x_2\), … up to those identified by \(x = x_k\)). For each sub-population, we can have what is called the conditional distribution of individuals according to the modalities of the variable \(y\).
Following this reasoning, we will obtain \(p\) conditional distributions of \(y\) given \(x\):
The conditional distribution of \(y\) given \(x = x_1\);
The conditional distribution of \(y\) given \(x = x_2\);
The conditional distribution of \(y\) given \(x = x_3\);
.................................................. ;
The conditional distribution of \(y\) given \(x = x_k\);
Each distribution is defined by a pair (\(y_i\), \(n_{ij}\)) [\(j\) ranging from \(1\) to \(p\) and \(i\) being fixed].
The following table represents the idea of a conditional distribution:
\(Modalities ~~of ~~y \) | \(Conditional ~~frequencies ~~n_{ij}\) |
\(y_1\) | \(n_{1j}\) |
\(y_2\) | \(n_{2j}\) |
\(y_3\) | \(n_{3j}\) |
\(...\) | \(...\) |
\(y_p\) | \(n_{ip}\) |
\(\sum\) | \(n_{.j}\) |
The total count is given by the formula:
$$ n_{i.} = n_{i1} + n_{i2} + n_{i3} + ... + n_{ip} = \sum_{i=1}^{k}n_{ij} $$
You can also calculate the conditional frequencies \(f_{yj/xi}\) using the formula: \(f_{yj/xi} = \frac{n_{ij}}{n_{i.}}\)
The sum of the conditional frequencies is equal to 1.
Using the conditional distributions (or frequencies), we can create a column-profile table. A column-profile table contains the modalities of \(x\) in the left margin, the conditional frequencies of \(y\) given \(x = x_1\) in the first column, the conditional frequencies of \(y\) given \(x = x_2\) in the second column, …, the conditional frequencies of \(y\) given \(x = x_k\) in the last column.
The following table illustrates the column-profile table:
\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \) |
\(Y_1\) | \(Y_2\) | \(Y_3\) | \(...\) | \(Y_p\) | \(\sum\) |
\(x_1\) | \(f_{y1/x1}\) | \(f_{y2/x1}\) | \(f_{y3/x1}\) | \(...\) | \(f_{yp/x1}\) | \(1\) |
\(x_2\) | \(f_{y1/x2}\) | \(f_{y2/x2}\) | \(f_{y3/x2}\) | \(...\) | \(f_{yp/x2}\) | \(1\) |
\(x_3\) | \(f_{y1/x3}\) | \(f_{y2/x3}\) | \(f_{y3/x3}\) | \(...\) | \(f_{yp/x3}\) | \(1\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(1\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(1\) |
\(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(...\) | \(1\) |
\(x_k\) | \(f_{y1/xk}\) | \(f_{y2/xk}\) | \(f_{y3/xk}\) | \(...\) | \(f_{yp/xk}\) | \(1\) |
\(\sum\) | \(f_{.1}\) | \(f_{.2}\) | \(f_{.3}\) | \(...\) | \(f_{.p}\) | \(1\) |
The conditional average of \(x\) is calculated for each of the \(p\) conditional distributions of \(x\). The conditional average of \(x\) given \(y = y_i\) is generally denoted by \({\bar{x}}_j\). It should be understood as a weighted average:
$${\bar{x}}_j\ =\ \frac{1}{n_{.j}}\ \sum_{i=1}^{k}{n_{ij}\ x_i}$$
By replacing the counts with frequencies, we get the following formula:
$${\bar{x}}_j\ =\ \sum_{i=1}^{k}{f_{xi/yj}\ x_i}$$
For the conditional average of \(y\) given \(x = x_i\), we have the following formula:
$${\bar{y}}_i\ =\ \frac{1}{n_{i.}}\ \sum_{j=1}^{p}{n_{ij}\ y_j}$$
By replacing the counts with frequencies, we get the following formula:
$${\bar{y}}_i\ =\ \sum_{j=1}^{p}{f_{yi/xj}\ y_j}$$
Note that for each variable, the marginal mean is equal to the mean of the conditional means. This relationship is expressed in the following formulas:
$$ \bar{x}\ =\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}\ {\bar{x}}_j} ~~~~ and ~~~~ \bar{y}\ =\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.}\ {\bar{y}}_i}$$
The conditional variance of \(x\) given \(y = y_j\) is given by the formula:
$$ V_{j} (x) = \frac{1}{n_{.j}} \sum_{i=1}^{k} {n_{ij}} ~~(x_i - \bar{x}_{j}) ^ {2} = \frac{1}{n_{.j}} \sum_{i=1}^{k} {n}_{ij} ~~ {x}_{i}^{2} - {\bar{x}}_{j}^{2}$$
By replacing the counts with conditional frequencies, we get the following formula:
$$ V_{j} (x) = \frac{1}{n_{.j}} \sum_{i=1}^{k} ~~ {f_{xi/yj}} (x_i - \bar{x}_{j}) ^ {2} = \frac{1}{n_{.j}} \sum_{i=1}^{k} {f_{xi/j}} ~~ {x}_{i}^{2} - {\bar{x}}_{j}^{2}$$
The conditional standard deviation is calculated using the formula: \( \sigma_j\ (x)=\ \sqrt{v_j\ (x)}\)
For the variance and standard deviation of \(Y\) given \(X = x_i\), the formula is:
$$ V_{i} (y) = \frac{1}{n_{i.}} \sum_{j=1}^{p} {n_{ij}} ~~(y_j - \bar{y}_{i}) ^ {2} = \frac{1}{n_{i.}} \sum_{j=1}^{p} {n}_{ij} ~~ {y}_{j}^{2} - {\bar{y}}_{i}^{2}$$
By replacing the counts with conditional frequencies, we get the following formula:
$$ V_{i} (y) = \frac{1}{n_{i.}} \sum_{j=1}^{p} ~~ {f_{yj/xi}} (y_j - \bar{y}_{i}) ^ {2} = \frac{1}{n_{i.}} \sum_{j=1}^{p} {f_{yj/xi}} ~~ {y}_{j}^{2} - {\bar{y}}_{i}^{2}$$
The conditional standard deviation is calculated using the formula: \( \sigma_i\ (y)=\ \sqrt{v_i\ (y)}\)
It is also noted that the marginal variance equals the sum of the mean of the conditional variances plus the variance of the conditional means, as shown by the following two formulas:
$$v(x)\ =\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}\ } ~~ v_j\ (x)\ +\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}} ~~ ({\bar{x}_j\ -\ \bar{x}}^2) $$
$$ v(y)\ =\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.}\ } ~~ v_i\ (y)\ +\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.} ~~ {(\bar{y}}_i\ -\ \bar{y})}^2 $$
Education Level | Employee | Unemployed | Total |
---|---|---|---|
Primary | 40, 42, 38 | 10, 12, 8 | 50 |
Secondary | 60, 65, 55 | 20, 22, 18 | 80 |
University | 30, 33, 27 | 10, 11, 9 | 40 |
Total | 135 | 42 | 177 |
Conditional distributions show the distribution of the counts of one variable for each modality of the other variable. For example, we examine the distribution of employment status conditionally on each education level.
Education Level | Employee (%) | Unemployed (%) |
---|---|---|
Primary | 80.0% | 20.0% |
Secondary | 75.0% | 25.0% |
University | 75.0% | 25.0% |
Conditional means allow us to calculate the average of the counts for each modality of the independent variable. Here are the means:
Education Level | Average for Employees | Average for Unemployed |
---|---|---|
Primary | 40.0 | 10.0 |
Secondary | 60.0 | 20.0 |
University | 30.0 | 10.0 |
Conditional variances and conditional standard deviations measure the dispersion of counts for each modality of the independent variable. The calculations are as follows:
Formulas:
Conditional Variance = \(\frac{\Sigma (x_i - \mu)^2}{N - 1}\)
Conditional Standard Deviation = \(\sqrt{\text{Conditional Variance}}\)
Education Level | Variance of Employees | Standard Deviation of Employees | Variance of Unemployed | Standard Deviation of Unemployed |
---|---|---|---|---|
Primary | 4 | 2 | 4 | 2 |
Secondary | 25 | 5 | 4 | 2 |
University | 9 | 3 | 1 | 1 |
Note: The variance and standard deviation values are calculated based on the provided data and show the dispersion of frequencies for each education level.
Note Between and Within Population Variance
The average of the conditional variances is a mean that measures the dispersions within each of the populations that make up the variable (this is an intrapopulation variance). The second term, the variance of the conditional means, measures the dispersion of the conditional means of different subpopulations around the marginal mean (this is an interpopulation variance).
Use the bivariate table to analyze interactions between different variables. Add rows and columns, input the data, and calculate statistics such as conditional means, variances, and standard deviations. Try it now to enhance your statistical skills.
Access the TableTwo variables \(x\) and \(y\) are said to be independent when the conditional frequencies \(f_{yj/xi}\) are equal and equal to the marginal frequency \(f_{.j}\). Consequently, in the row profile table, all rows are exactly similar, and similarly for the column profile table, the conditional frequencies \(f_{xi/yj}\) are equal to the marginal frequency \(f_{i.}\).
It will be observed that in the contingency table, the rows are proportional to each other, and the same is true for the columns: \(n_{ij}\ =\ \frac{n_{i.}\ n_{.j}}{n} (f_{ij}\ =\ f_{i.}\ f_{.j})\).
Do not forget the example
The statistical independence of \(X\) and \(Y\) results in:
Variable \(y\) is functionally related to \(x\) when each modality (value or class) of \(x\) corresponds to a modality (value or class) of variable \(y\). It will be observed that in the joint distribution table, there is only one non-zero frequency per row [the same reasoning can be applied by analogy for a variable \(x\) functionally related to \(y\)].
Two variables \(x\) and \(y\) are said to be in reciprocal functional dependency (reciprocally dependent) when each modality (value or class) of variable \(x\) corresponds to a modality (value or class) of variable \(y\) and vice versa. In this case, there will be as many rows as columns in the joint distribution table, and exactly one non-zero frequency per row and column.
Two variables are said to be in relative dependency when they are neither independent nor functionally related (reserve a lengthy passage to introduce the rest of the chapters).
The topics covered in this section (i.e. covariance, correlation, and adjustment) are introduced briefly. We will revisit them in more detail in the final Block of our Course, where we will discuss the principles of statistical inference. They are introduced here for mastering the vocabulary of bivariate data analysis.
Insert a brief definition
Covariance is given by the following formula:
\[ Cov(x,y) = \frac{1}{n} \sum_{\substack{1 \leq i \leq p \\ 1 \leq j \leq k }} n_{ij} ~ (x_{i} - \bar{x}) (y_{j} - \bar {y}) \]The previous equation can be simplified as follows:
\[ Cov(x,y) = \frac{1}{n} \left( \sum_{\substack{1 \leq i \leq p \\ 1 \leq j \leq k }} n_{ij} ~~ x_{i} y_{j} \right ) - \bar{x} \bar {y} \]Covariance provides us with the following properties:
From covariance, we can calculate the linear correlation coefficient of \(X\) and \(Y\), which is defined by the following formula:
\[ r = \frac {Cov(x,y)} {\sigma(X) ~ \sigma(Y)} \]The value \(r\) is invariant to changes in origin and scale, it ranges between (-1) and (+1), and takes the value zero when the variables are independent.
Adjustment involves fitting a statistical model to a set of observed data. The goal is to find a function or model that best represents the relationship between the variables in the data.
There are several adjustment methods: Linear Regression and Non-linear Regression, Coefficient of Determination, Curve Fitting, and Residual Analysis. In this Course, we focus on the method known as Least Squares, and we will provide a brief definition and the logic behind the method in the following lines.
In a scatter plot \((x_{1}, y_{1})\), \((x_{2}, y_{2})\)... \((x_{p}, y_{p})\) with weighting coefficients \(n_1\), \(n_2\), ... \(n_p\) equal to 1, we identify a form of functional relationship between \(x\) and \(y\) depending on the appearance of the scatter plot (this relationship can take one of the forms: \(y = ax + b\); \(y = ax^{b}\), etc.).
The principle of adjustment is to determine the parameter values that minimize the distance between the points and the curve that represents the chosen model to account for the functional relationship.
Adjustment is said to be affine (\(y = ax + b\)), or linear (\(y = ax\)) when the points are, in a certain way, aligned; this is referred to as fitting by a straight line.
In the least squares method, the goal is to minimize the distance \(S\) defined as the sum of the squares of the vertical deviations: \(S = \displaystyle\sum_{i=1}^{p} n_{i} (y_{i} - ax_{i} - b)^{2}\)
The following figure illustrates the idea of affine adjustment using the least squares method:
The regression line of \(Y\) with respect to \(X\) has a slope \(a\) given by the equation: \[ a = \frac {Cov (X, Y)}{V(X)} \]
The regression line passes through the mean point \(M (\bar{x} , \bar{y})\); we will revisit the topic of regression in the session dedicated to it.
In this section, we will describe the most commonly used statistical indices to highlight the association between two variables (independent and dependent). The goal of this section is to help the reader better choose the appropriate test based on the nature of the two variables involved and the logic of the empirical research discussed. The choice of a test or type of measure precedes the inference operation (which we will address in the final Block of our module). In fact, some associations do not require the initiation of the inverse procedure to sampling.
The following table summarizes the main measures of association for contingency tables:
Measure of Association | Table Dimension | Nature of Associated Variables | Result |
---|---|---|---|
Phi | \(2 \times 2\) \(2 \times 2\) or more | Nominal x nominal Nominal x nominal | -1 to +1 0 to 1 |
Contingency Coefficient | \(2 \times 2\) or more | Nominal x nominal | 0 to 1 |
Cramér's V | \(2 \times 2\) or more | Nominal x nominal | 0 to 1 |
Lambda | \(2 \times 2\) or more | Nominal x nominal | Percentage of Error Reduction |
Kappa | \(2 \times 2\) or more | Nominal x nominal | -1 to +1 |
Gamma | \(2 \times 2\) or more | Ordinal x ordinal | -1 to +1 |
Kendall's Tau | \(2 \times 2\) or more | Ordinal x ordinal | -1 to +1 |
Eta | \(2 \times 2\) or more | Nominal x cardinal | 0 to 1 |
The Phi coefficient measures the strength of association between two dichotomous variables in a 2x2 contingency table.
The Phi coefficient measures the strength of association between two dichotomous variables in a 2x2 contingency table.
The coefficient is given by the following formula:
Interest: This indicator is used to evaluate the relationship between two binary categorical variables, which is useful in case studies where variables can only take two values.
Usage Conditions: Data must be presented in a 2x2 contingency table, with dichotomous variables.
Advantages:
Disadvantages:
Practical Example:
Consider the following contingency table:
Present | Absent | |
---|---|---|
Exposed | 40 | 10 |
Not Exposed | 20 | 30 |
Calculation of the Phi coefficient:
$$ \phi = \frac{(40 \cdot 30) - (20 \cdot 10)}{\sqrt{(40+10)(20+30)(40+20)(10+30)}} $$
$$ \phi = \frac{1200 - 200}{\sqrt{50 \cdot 50 \cdot 60 \cdot 40}} $$
$$ \phi = \frac{1000}{\sqrt{600000}} $$
$$ \phi = \frac{1000}{774.60} \approx 1.29 $$
Interpretation: A value of φ = 1.29 indicates a strong association between the variables.
The contingency coefficient assesses the strength of association between two categorical variables using the chi-square of contingency.
The contingency coefficient assesses the strength of association between two categorical variables using the chi-square of contingency.
To calculate the coefficient, use the formula:
Interest: This coefficient measures the association between two variables, regardless of the size of the contingency table.
Usage Conditions: Applicable to contingency tables of any size.
Advantages:
Disadvantages:
Practical Example:
Consider the following contingency table:
Yes | No | |
---|---|---|
Var 1 | 40 | 10 |
Var 2 | 20 | 30 |
Assume the calculated chi-square is 18.5 and the sample size is 100:
Calculation of the contingency coefficient:
$$ C = \sqrt{\frac{18.5}{100 + 18.5}} $$
$$ C = \sqrt{\frac{18.5}{118.5}} $$
$$ C = \sqrt{0.156} $$
$$ C \approx 0.395 $$
Interpretation: A value of C = 0.395 indicates a moderate association between the variables.
Cramér's V coefficient measures the association between two categorical variables for a contingency table of any size.
Cramér's V coefficient measures the association between two categorical variables for a contingency table of any size.
To calculate the indicator, proceed as follows:
Interest: Useful for measuring association in contingency tables of different sizes.
Usage Conditions: The contingency table must be of any size with categorical data.
Advantages:
Disadvantages:
Practical Example:
Consider the following contingency table:
Cat 1 | Cat 2 | Cat 3 | |
---|---|---|---|
Group A | 30 | 10 | 20 |
Group B | 20 | 40 | 10 |
Assume the calculated chi-square is 24, the sample size is 150, and k = 3:
Calculation of Cramér's V coefficient:
$$ V = \sqrt{\frac{24}{150 \cdot (3 - 1)}} $$
$$ V = \sqrt{\frac{24}{300}} $$
$$ V = \sqrt{0.08} $$
$$ V \approx 0.28 $$
Interpretation: A value of V = 0.28 indicates a moderate association between the variables.
The Lambda coefficient measures the proportional reduction in error when predicting the dependent variable from the independent variable.
The Lambda coefficient measures the proportional reduction in error when predicting the dependent variable from the independent variable.
Interest: Used to evaluate the effectiveness of the independent variable in predicting the dependent variable.
Usage Conditions: Applicable to categorical data.
Advantages:
Disadvantages:
Practical Example:
Consider the following contingency table:
Success | Failure | |
---|---|---|
Group A | 35 | 15 |
Group B | 25 | 25 |
Calculation of the Lambda coefficient:
$$ \lambda = \frac{N_1 - N_2}{N} $$
Where \( N_1 \) is the maximum frequency in the table, \( N_2 \) is the maximum frequency per category, and \( N \) is the total sample size.
$$ \lambda = \frac{35 - 25}{100} $$
$$ \lambda = \frac{10}{100} $$
$$ \lambda = 0.10 $$
Interpretation: A value of λ = 0.10 indicates a low reduction in prediction error due to the independent variable.
The Kappa coefficient assesses the agreement between two judges or measurement instruments, taking into account the agreement due to chance.
The Kappa coefficient assesses the agreement between two judges or measurement instruments, taking into account the agreement due to chance.
The coefficient is calculated as follows:
Interest: Useful for assessing the reliability of repeated measurements or judgments between two evaluators.
Usage Conditions: Data must be categorical with evaluations by two judges or instruments.
Advantages:
Disadvantages:
Practical Example:
Consider the following table of agreement between two evaluators:
Eval 1: Yes | Eval 1: No | |
---|---|---|
Eval 2: Yes | 50 | 10 |
Eval 2: No | 5 | 35 |
Proportion of observed agreement (Po):
$$ P_o = \frac{50 + 35}{100} = 0.85 $$
Proportion of agreement due to chance (Pe):
$$ P_e = \frac{(50 + 10)(50 + 5) + (10 + 35)(5 + 35)}{100^2} $$
$$ P_e = \frac{60 \cdot 55 + 45 \cdot 40}{10000} $$
$$ P_e = \frac{3300 + 1800}{10000} $$
$$ P_e = \frac{5100}{10000} = 0.51 $$
Calculation of the Kappa coefficient:
$$ \kappa = \frac{0.85 - 0.51}{1 - 0.51} $$
$$ \kappa = \frac{0.34}{0.49} $$
$$ \kappa \approx 0.69 $$
Interpretation: A value of κ = 0.69 indicates substantial agreement between the evaluators, beyond chance agreement.
The Gamma coefficient measures the strength and direction of association between two ordinal variables.
To calculate the coefficient, use the formula:
The Gamma coefficient measures the strength and direction of association between two ordinal variables.
Interest: Useful for analyzing relationships between ordinal variables in contingency tables.
Usage Conditions: Data must be ordinal.
Advantages:
Disadvantages:
Practical Example:
Consider the following table of ordinal data:
Category 1 | Category 2 | |
---|---|---|
Order 1 | 12 | 8 |
Order 2 | 7 | 13 |
Calculation of concordant and discordant pairs:
Concordant pairs (P):
$$ P = 12 \cdot 13 + 8 \cdot 7 = 156 + 56 = 212 $$
Discordant pairs (Q):
$$ Q = 12 \cdot 7 + 8 \cdot 13 = 84 + 104 = 188 $$
Calculation of the Gamma coefficient:
$$ \gamma = \frac{P - Q}{P + Q} $$
$$ \gamma = \frac{212 - 188}{212 + 188} $$
$$ \gamma = \frac{24}{400} $$
$$ \gamma = 0.06 $$
Interpretation: A value of γ = 0.06 indicates a very weak positive association between the ordinal variables.
The Kendall tau coefficient evaluates the strength and direction of association between two ordinal variables, taking into account concordant and discordant pairs.
The Kendall tau coefficient evaluates the strength and direction of association between two ordinal variables, taking into account concordant and discordant pairs.
To calculate the coefficient, use the formula:
Interest: Useful for analyzing ordinal relationships, taking into account concordant and discordant pairs.
Usage Conditions: Data must be ordinal.
Advantages:
Disadvantages:
Practical Example:
Consider the following table of ordinal data:
Category A | Category B | |
---|---|---|
Order 1 | 15 | 5 |
Order 2 | 10 | 20 |
Calculation of concordant and discordant pairs:
Concordant pairs (P):
$$ P = 15 \cdot 20 + 5 \cdot 10 = 300 + 50 = 350 $$
Discordant pairs (Q):
$$ Q = 15 \cdot 10 + 5 \cdot 20 = 150 + 100 = 250 $$
Calculation of Kendall Tau:
$$ \tau = \frac{P - Q}{P + Q} $$
$$ \tau = \frac{350 - 250}{350 + 250} $$
$$ \tau = \frac{100}{600} $$
$$ \tau = 0.17 $$
Interpretation: A value of τ = 0.17 indicates a weak positive association between the ordinal variables.
The Eta coefficient measures the strength of the association between a categorical variable and a continuous variable.
The Eta coefficient measures the strength of the association between a categorical variable and a continuous variable.
Interest: Useful for measuring the association between a categorical variable and a continuous variable.
Usage Conditions: Data must include one categorical variable and one continuous variable.
Advantages:
Disadvantages:
Practical Example:
Consider the following data:
Group | Scores | Count | Sum of Squares |
---|---|---|---|
A | 10, 12, 14, 16 | 4 | 40 |
B | 20, 22, 24, 26 | 4 | 80 |
Calculation of the sum of squares between groups (SSB):
$$ SSB = 4 \cdot (\bar{x}_A - \bar{x}_T)^2 + 4 \cdot (\bar{x}_B - \bar{x}_T)^2 $$
Where \( \bar{x}_A = 13 \), \( \bar{x}_B = 23 \), and \( \bar{x}_T = 18 \)
$$ SSB = 4 \cdot (13 - 18)^2 + 4 \cdot (23 - 18)^2 $$
$$ SSB = 4 \cdot 25 + 4 \cdot 25 $$
$$ SSB = 100 + 100 = 200 $$
Calculation of the Total Sum of Squares (SST):
$$ SST = \sum (x_i - \bar{x}_T)^2 = 4 \cdot (13 - 18)^2 + 4 \cdot (23 - 18)^2 $$
$$ SST = 200 $$
Calculation of the Eta Coefficient:
$$ \eta^2 = \frac{SSB}{SST} = \frac{200}{200} = 1 $$
$$ \eta = \sqrt{1} = 1 $$
Interpretation: A value of η = 1 indicates that 100% of the variance in the scores is explained by the group.
Statistical distributions with two variables allow for analyzing the relationship between two variables using various tools and tables. These analyses help to understand how the variables are related and under what conditions these relationships apply.
The Course does not have a final bibliography (in its online version); references are included at the end of each Block.
Le QCM comporte seize questions qui portent sur certaines parties du Cours, à la fin vous aurez votre évaluation ainsi que le corrigé.
Pour accéder au QCM, cliquer sur l'icone suivante :
This session does not have downloadable notes. During the directed work session dedicated to this topic, we will revisit the fundamental questions of bivariate analysis using the bivariate table editor and the Python compiler.
To further your learning on this topic, you can consult the following links:
On the Course App, you will find the summary of the current Block, as well as series of Directed Work related to it.
You will also find references to multimedia content relevant to the Block.
In the Notifications section, an update is planned and will be based on questions asked by students during the Course and Directed Work sessions.
An update will also address the exams from previous sessions, which will be corrected during the directed work sessions to prepare for the current year's exams.
In this Python corner, we have integrated, using an example, the commands related to the essentials covered in this session.
Python Code | Explanation |
---|---|
|
This code calculates the conditional mean of time spent on the platform for users in the "Student" category. It shows how much time, on average, students spend on the platform. |
|
This code calculates the conditional variance of time spent on the platform for users in the "Professional" category. It measures the dispersion of time spent by professionals on the platform. |
|
This code calculates the conditional standard deviation of time spent for users who watch videos. It measures the dispersion of time spent watching videos on the platform. |
Using the link below, you can download the Flipbook in PDF format:
The forum allows you to discuss this first session. You will notice the presence of a subscription button so that you can follow discussions about research in the humanities and social sciences. It is also an opportunity for the instructor to address students' concerns and questions.