Block II | L2 Bivariate Analysis

" alt="image de haut de la page, elle sert de décoration.">

Block II | Session 2
Bivariate Analysis

Introduction

Session Objectives

Concepts and Topics Covered in the Session

Bivariate Statistical Distributions 1.1. Elementary Table 1.2. Contingency Table 1.3. Joint Distribution 1.4. Marginal Distribution 1.5. Conditional Distributions 1.6. Relationships Between Variables in a Joint Distribution 1.7. Other Parameters of a Bivariate Statistical Series Bivariate Statistical Indices 2.1. Phi Coefficient 2.2. Contingency Coefficient 2.3. Cramér's V Coefficient 2.4. Lambda Coefficient 2.5. Kappa Coefficient 2.6. Gamma Coefficient 2.7. Kendall's Tau Coefficient 2.8. Eta Coefficient

Summary

Block Bibliography

Synthesis Questions

Multiple-Choice Questions

Course & Practical Sheets

Introduction and Session Summary

In the previous Block, we worked on univariate analysis, which involved examining variables one by one, separately. Univariate analysis is useful when conducting research aimed at describing or exploring a specific phenomenon.

We are now focusing on the other side of analysis, which is bivariate analysis. Bivariate analysis looks at variables in pairs, analyzing two variables simultaneously to examine the relationship between them and its strength.

The main areas of bivariate analysis involve identifying the relationship between two variables, quantifying the strength and direction of the relationship, modeling the relationship between two variables, and exploring the causal links between them. One topic of bivariate analysis concerns hypothesis testing, which we have placed in the last Block.

We have therefore divided this teaching into two sections: the first explores statistical distributions with two variables, and the second explains the statistical parameters of such a distribution.

Session Objectives

During this session, we aim to achieve the following objectives:

Introduce students to the logic of bivariate analysis: This session is entirely dedicated to bivariate logic. We will prepare the student, once they have combined these two types of analysis, for the final Block, which is focused on the study of statistical inference.
Understand contingency tables: A cross-tabulation table is used to analyze and understand the relationship between variables. In this teaching, we will focus on how to construct this type of table. Additionally, we will work on the underlying concepts of the bivariate table, specifically: visualization, distribution, interaction, and independence.
Explain the operation and importance of scatter plots: The scatter plot, or scatter diagram, is a visual representation of the contingency table. We will explore this element of our teaching by drawing parallels with the objectives of a contingency table. A scatter plot allows, among other things, to: visualize the relationship between two variables, identify trends between variables, detect outliers, and assess the strength of the relationship between the said variables.
Explore the utility of bivariate parameters: We have chosen to present a number of parameters commonly used in research within the humanities and social sciences. The objective is to understand the mechanics of a parameter before being able to estimate it numerically.

Concepts and Themes to Be Covered During the Session

Marginal distribution, Conditional distribution, Joint distribution, Joint frequency, Joint count, Column profile table, Statistical independence, Correlation, Fitting (affine, linear, least squares method), Regression line,

Statistical Distributions with Two Variables

A statistical distribution with two variables, also known as a bivariate distribution, is a representation that shows the relationship between two statistical variables. Unlike a univariate distribution, which deals with only one variable, the bivariate distribution allows us to examine how the two variables interact with each other.

The bivariate statistical distribution, as we mentioned earlier, is often represented graphically; however, statistical calculations serve to confirm what we observe. To simplify, we will say that the elements of interest in our teaching in this context are: the contingency table, the scatter plot, covariance (correlation), and linear regression. For the purposes of this session, we will limit ourselves to examining the first two elements only.

When two variables $x$ and $y$ are defined in a population composed of $n$ individuals, the numerical representation can be elementary or of contingency. The following lines illustrate this idea:

1.1. The Elementary Table

An elementary table lists, for each individual $i$ in the population, the values $x_i$ and $y_i$ of each of the studied variables in adjacent columns. One could say that it is a combination, a superposition, of two simple statistical tables (each with a single entry).
The following table represents an elementary table:

Individual $i$	Values of variable $x$	Values of variable $y$
$1$	$x_1$	$y_1$
$2$	$x_2$	$y_2$
$3$	$x_3$	$y_3$
$...$	$...$	$...$
$n-2$	$x_{n-2}$	$y_{n-2}$
$n-1$	$x_{n-1}$	$y_{n-1}$
$n$	$x_n$	$y_n$

Table II.2.1. Elementary table.

Example: The following table represents an elementary table featuring two variables: Gender and Education Level

Individual $i$	Gender	Education Level
1	Male	Primary
2	Male	Secondary
3	Female	Secondary
4	Male	University
5	Female	University
6	Male	Middle
7	Female	Middle

Table II.2.2. Elementary statistical table (from our example).

We can see that for each individual in our example, the pair of modalities that represent them are placed side by side.

An elementary table is used when one wants to organize and present data in a way that allows for comparing the values of multiple variables for a set of individuals or observation units. An elementary table is not used for analysis; it is the bivariate equivalent of a single-entry statistical table.

1.2. The Contingency Table

The contingency table, also known as a cross table or correspondence table, defines a joint distribution. It relates a pair of variables $(x, y)$ by associating the frequencies corresponding to each pair of modalities.

A contingency table, unlike an elementary table, is used to present and analyze the relationship between two variables. It summarizes the frequencies of observations that lie at the intersection of the modalities of these two variables.

A contingency table is presented in the following format:

$~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y $ $Modalities ~~of ~~x $	$Y_1$	$Y_2$	$Y_3$	$...$	$Y_p$
$x_1$	$n_{11}$	$n_{12}$	$n_{13}$	$...$	$n_{1p}$
$x_2$	$n_{21}$	$n_{22}$	$n_{23}$	$...$	$n_{2p}$
$x_3$	$n_{31}$	$n_{32}$	$n_{33}$	$...$	$n_{3p}$
$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$
$x_k$	$n_{k1}$	$n_{k2}$	$n_{k3}$	$...$	$n_{kp}$

Table II.2.3. Contingency Table.

Note:
The variable $x$ has $k$ categories listed in the left margin of the table [for an explanation of the nature and uses of statistical tables see Lahanier-Reuter, D. (2003)], each row of the table corresponds to a category of the variable $X$ and contains the counts for that category. It is customary to denote the row index with the letter $i$ ($x_i$: $i$ ranging from 1 to $k$). For the variable $Y$, it is listed in the top margin of the contingency table, and each category of the variable $Y$ corresponds to a column of the table (the column index is typically denoted by the letter $j$ ($y_j$: $j$ ranging from 1 to $p$).

Example:

The following table shows the relationship between education level and employment status for a group of people:

Education Level	Employed	Unemployed	Total
Primary	40	10	50
Secondary	60	20	80
Tertiary	30	10	40
Total	130	40	170

Table II.2.4. Relationship between Education Level and Employment Status.

Explanation and Note:
From the table, we can observe that:

Each row corresponds to an education level: Primary, Secondary, and Tertiary. The Total row shows the total number of people in each employment status category;
The columns Employed and Unemployed represent the employment status of individuals for each education level. The Total column to the right of each row represents the total number of individuals for each education level. We will revisit these concepts later in this session.

These observations will lead us, and this is one of the major objectives of the Module, to the Reading and Analysis of the contingency table. We can now provide an overview of this analytical work:

Reading - Distribution of Employment (dependent variable): It is observed that the majority of individuals with a Secondary education level are employed (60 out of 80), while the number of unemployed is lower at each education level.
Analysis of the Relationship: This table could be used to analyze whether there is a significant relationship between education level and employment status. For example, the chi-square $\chi^2$ test could be used to test the independence between these two variables. (This topic will be covered in the third Block.)

1.3. The Joint Distribution

Joint distribution refers to the pairs $(x, y)$ formed by the left and top margins (the $k$ rows and $p$ columns). Simply put, it is the set of $(x_i, y_i, n_{ij})$ ($i$ ranging from 1 to $k$ and $j$ ranging from 1 to $p$).

The joint frequency $n_{ij}$ represents both the $x_i$ modality of the $x$ variable and the $y_i$ modality of the $y$ variable.

The set of joint frequencies constitutes the total frequency, represented by the following formula:

$$\sum_{j=1}^{p} \sum_{i=1}^{k} n_{ij} = \sum_{i=1}^{k}\sum_{j=1}^{p} n_{ij} = n $$

Sometimes it is more useful to replace frequencies with proportions; joint frequencies are then divided by $n$ (the total frequency), which is called the joint proportion.
As in a univariate table, the sum of the joint proportions is equal to 1.

$$\sum_{j=1}^{p} \sum_{i=1}^{k} f_{ij} = \sum_{i=1}^{k}\sum_{j=1}^{p} f_{ij} = 1 $$

Note: For our previous example, linking education level to employment status, we can say that:

Reading Joint Frequencies: Primary and Employed: $40$ people have a primary education level and are employed, Tertiary and Unemployed: $10$ people have a tertiary education level and are unemployed;
Calculation of Total Frequency: The sum of the frequencies in all cells of the table is 170, which is the total frequency;
Joint Proportion: To obtain the joint proportions, each joint frequency is divided by the total frequency. For example, the joint proportion for Primary and Employed is: $\frac{40}{170} = 0.235$ $ (or ~~ 23.5 \%) $.

1.4. The Marginal Distribution

The marginal distribution concerns the distribution of a single variable $(x$ or $y)$. In other words, a marginal distribution involves deducing the distribution of each variable considered in isolation, as seen in the previous lesson. It is the process of extracting single-entry tables.
The distribution related to the variable $x$ alone is called the marginal distribution of $x$; the distribution related to the variable $y$ alone is called the marginal distribution of $y$.

In the previous example, constructing the marginal distributions involves creating a single-entry table for the Education Level variable and another for the Employment Status variable. We will explain the theoretical principles of marginal distribution in the following lines.

Marginal Distribution of $X$

The marginal distribution of the variable $x$ is defined using the pairs $(x_i, n_i)$, where $i = 1, 2, 3, \ldots, K$ [$x_i$ is the modality of the $x$ variable and $n_i$ is the corresponding frequency, referred to as the marginal frequency of the modality $x_i$. This represents the number of individuals where the modality of the $X$ variable is $x_i$ and the modality of the $Y$ variable is $y_1, y_2, \ldots, y_p$; the frequency is equal to the total or sum of the frequencies in the $i$-th row, as shown in the following formula:

$$ n_{i\ } =\ n_{i1\ } +\ n_{i2\ } +\ n_{i3\ } +\ \ldots\ +\ n_{ip\ } =\ \sum_{j=1}^{p}n_{ij} $$

It should be noted that the sum of the marginal frequencies is equal to the total frequency $n$ (or $n..$).

The following table shows the marginal distribution of the variable $x$:

$Modalities ~~of ~~the ~~variable ~~x$	$Marginal ~~Frequency$
$x_1$	$n_{1.}$
$x_2$	$n_{2.}$
$x_3$	$n_{3.}$
$...$	$...$
$x_k$	$n_{k.}$
$\sum$	$n$

Table II.2.5. The marginal distribution of the variable $x$.

The marginal frequency of a modality $x_i$, denoted $f_i$, is also defined as follows:

$$f_i\ =\ \frac{n_{i.}}{n}$$

The sum of the marginal frequencies $f_{i.}$ is equal to 1 [$\sum_{i=1}^{k}f_i\ =\ 1$].

Marginal Distribution of $y$

The marginal distribution of the variable $y$ consists of the pairs $(y_{j}, n_{.j})$ ($j = 1, 2, 3, \ldots, p$), where $y_j$ is the modality of the variable $y$ and $n_{.j}$ is the corresponding frequency, referred to as the marginal frequency of the modality $y_j$. This represents the number of individuals where the modality of the $Y$ variable is $y_j$ and the modality of the $X$ variable is $x_1$, $x_2$, \ldots, $x_k$; the frequency is equal to the total or sum of the frequencies in the $j$-th column, as shown in the following formula:

$$ n_{.j\ }\ =\ n_{1j\ }\ +\ n_{2j\ }\ +\ n_{3j\ }\ +\ \ldots\ +\ n_{kj\ }\ =\ \sum_{i=1}^{k}n_{ij} $$

It should be noted that the sum of the marginal frequencies is equal to the total frequency $n$ (or $n_{..}$).

The following table shows the marginal distribution of the variable $x$:

$Modalities ~~of ~~the ~~variable ~~y$	$Marginal ~~Frequency$
$y_1$	$n_{.1}$
$y_2$	$n_{.2}$
$y_3$	$n_{.3}$
$...$	$...$
$y_p$	$n_{.k}$
$\sum$	$n$

Table II.2.6. The marginal distribution of the variable $y$.

The marginal frequency of a modality $y_i$, denoted $f_i$, is also defined as follows:

$$f_i\ =\ \frac{n_{.j}}{n}$$

The sum of the marginal frequencies $f_{i.}$ is equal to 1 [$\sum_{j=1}^{p}f_{i.}\ =\ 1$].

Note: Contingency table and marginal distributions.
It should be noted that the marginal frequencies $n_{i.}$ are displayed in an additional column, while the marginal frequencies $n_{.j}$ are shown in an additional row of the joint distribution (x, y).

The following table illustrates this idea:

$~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y $ $Modalities ~~of ~~x $	$Y_1$	$Y_2$	$Y_3$	$...$	$Y_p$	$\sum$
$x_1$	$n_{11}$	$n_{12}$	$n_{13}$	$...$	$n_{1p}$	$n_{1.}$
$x_2$	$n_{21}$	$n_{22}$	$n_{23}$	$...$	$n_{2p}$	$n_{2.}$
$x_3$	$n_{31}$	$n_{32}$	$n_{33}$	$...$	$n_{3p}$	$n_{3.}$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$x_k$	$n_{k1}$	$n_{k2}$	$n_{k3}$	$...$	$n_{kp}$	$n_{k.}$
$\sum$	$n_{.1}$	$n_{.2}$	$n_{.3}$	$...$	$n_{.p}$	$n$

Table II.2.7. Contingency table and marginal distributions.

We note that the marginal distribution of $x$ is given by the left margin and the last column of the table, while that of the variable $y$ is given by the top margin and the last row.

Note: Marginal statistics.
Each contingency table consists of two univariate distributions. By creating marginal distribution tables, we can calculate the indices seen in the previous chapters (central tendency, dispersion, and position …). In the following chapters, we will need to calculate these indices and make the necessary interpretations for the analysis.

1.5. Conditional Distributions

The conditional distribution of $x$ given $y$ is the distribution of $x$ concerning only the individuals with the modality $y_i$ of $Y$. Similarly, the conditional distribution of $y$ given $x$ is the distribution of $y$ concerning only the individuals with the modality $x_i$ of $X$.

Conditional Distributions of $x$ given $y$

The variable $y$ has $p$ modalities, so the population could be divided into $p$ sub-populations (individuals identified by modality $y_1$, those identified by modality $y_2$, … up to those identified by $y = y_p$). For each sub-population, we can have what is called a conditional distribution.

Following this reasoning, we obtain $p$ conditional distributions of $x$ given $y$:

The conditional distribution of $x$ given $y = y_{1}$;
The conditional distribution of $x$ given $y = y_{2}$;
The conditional distribution of $x$ given $y = y_{3}$;
................................................... ;
The conditional distribution of $x$ given $y = y_{p}$;

Each distribution is defined by a pair ($x_i$, $n_{ij}$) [$i$ ranging from $1$ to $k$ and $j$ being fixed]. The following table represents the idea of a conditional distribution:

$Modalities ~~of ~~x $	$Conditional ~~frequencies ~~n_{ij}$
$x_1$	$n_{1j}$
$x_2$	$n_{2j}$
$x_3$	$n_{3j}$
$...$	$...$
$x_k$	$k_j$
$\sum$	$n_{.j}$

Table II.2.8: Conditional distribution of $x$ given $y$.

The total frequency is given by the formula:

$$ n_{.j} = n_{1j} + n_{2j} + n_{3j} + ..... + n_{kj} = \sum_{i=1}^{k}n_{ij}$$

We could also calculate the conditional frequencies $f_{xi/yj}$ using the formula:

$$ f_{xi/yj}\ =\ \frac{n_{ij}}{n_{.j}} $$

The sum of the conditional frequencies is equal to 1.

With the help of conditional distributions (or frequencies), we can create a column-profile table. A column-profile table contains the modalities of $x$ in the left margin, the conditional frequencies of $x$ given $y = y_1$ in the first column, the conditional frequencies of $x$ given $y = y_2$ in the second column, …, the conditional frequencies of $x$ given $y = y_p$ in the last column. The following table illustrates the column-profile table.

$~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y $ $Modalities ~~of ~~x $	$Y_1$	$Y_2$	$Y_3$	$...$	$Y_p$	$\sum$
$x_1$	$f_{x1/y1}$	$f_{x1/y2}$	$f_{x1/y3}$	$...$	$f_{x1/yp}$	$f_{1.}$
$x_2$	$f_{x2/y1}$	$f_{x2/y2}$	$f_{x2/y3}$	$...$	$f_{x2/yp}$	$f_{2.}$
$x_3$	$f_{x3/y1}$	$f_{x3/y2}$	$f_{x3/y3}$	$...$	$f_{x3/yp}$	$f_{3.}$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$...$	$...$	$...$	$...$	$...$	$...$	$...$
$x_k$	$f_{xk/y1}$	$f_{xk/y2}$	$f_{xk/y3}$	$...$	$f_{xk/yp}$	$f_{k.}$
$\sum$	$1$	$1$	$1$	$1$	$1$	$1$

Table II.2.9. Column-profile table.

Conditional distributions of $y$ given $x$

The variable $x$ has $k$ modalities, so we can divide the population into $k$ sub-populations (individuals identified by modality $x_1$, those identified by modality $x_2$, … up to those identified by $x = x_k$). For each sub-population, we can have what is called the conditional distribution of individuals according to the modalities of the variable $y$.
Following this reasoning, we will obtain $p$ conditional distributions of $y$ given $x$:
The conditional distribution of $y$ given $x = x_1$;
The conditional distribution of $y$ given $x = x_2$;
The conditional distribution of $y$ given $x = x_3$;
.................................................. ;
The conditional distribution of $y$ given $x = x_k$;
Each distribution is defined by a pair ($y_i$, $n_{ij}$) [$j$ ranging from $1$ to $p$ and $i$ being fixed].
The following table represents the idea of a conditional distribution:

$Modalities ~~of ~~y $	$Conditional ~~frequencies ~~n_{ij}$
$y_1$	$n_{1j}$
$y_2$	$n_{2j}$
$y_3$	$n_{3j}$
$...$	$...$
$y_p$	$n_{ip}$
$\sum$	$n_{.j}$

Table II.2.10: Conditional distribution of $y$ given $x$.

The total count is given by the formula:

$$ n_{i.} = n_{i1} + n_{i2} + n_{i3} + ... + n_{ip} = \sum_{i=1}^{k}n_{ij} $$

You can also calculate the conditional frequencies $f_{yj/xi}$ using the formula: $f_{yj/xi} = \frac{n_{ij}}{n_{i.}}$

The sum of the conditional frequencies is equal to 1.

Using the conditional distributions (or frequencies), we can create a column-profile table. A column-profile table contains the modalities of $x$ in the left margin, the conditional frequencies of $y$ given $x = x_1$ in the first column, the conditional frequencies of $y$ given $x = x_2$ in the second column, …, the conditional frequencies of $y$ given $x = x_k$ in the last column.
The following table illustrates the column-profile table:

$~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y $ $Modalities ~~of ~~x $	$Y_1$	$Y_2$	$Y_3$	$...$	$Y_p$	$\sum$
$x_1$	$f_{y1/x1}$	$f_{y2/x1}$	$f_{y3/x1}$	$...$	$f_{yp/x1}$	$1$
$x_2$	$f_{y1/x2}$	$f_{y2/x2}$	$f_{y3/x2}$	$...$	$f_{yp/x2}$	$1$
$x_3$	$f_{y1/x3}$	$f_{y2/x3}$	$f_{y3/x3}$	$...$	$f_{yp/x3}$	$1$
$...$	$...$	$...$	$...$	$...$	$...$	$1$
$...$	$...$	$...$	$...$	$...$	$...$	$1$
$...$	$...$	$...$	$...$	$...$	$...$	$1$
$x_k$	$f_{y1/xk}$	$f_{y2/xk}$	$f_{y3/xk}$	$...$	$f_{yp/xk}$	$1$
$\sum$	$f_{.1}$	$f_{.2}$	$f_{.3}$	$...$	$f_{.p}$	$1$

Table II.2.11. Column-profile table.

Conditional Averages

The conditional average of $x$ is calculated for each of the $p$ conditional distributions of $x$. The conditional average of $x$ given $y = y_i$ is generally denoted by ${\bar{x}}_j$. It should be understood as a weighted average:

$${\bar{x}}_j\ =\ \frac{1}{n_{.j}}\ \sum_{i=1}^{k}{n_{ij}\ x_i}$$

By replacing the counts with frequencies, we get the following formula:

$${\bar{x}}_j\ =\ \sum_{i=1}^{k}{f_{xi/yj}\ x_i}$$

For the conditional average of $y$ given $x = x_i$, we have the following formula:

$${\bar{y}}_i\ =\ \frac{1}{n_{i.}}\ \sum_{j=1}^{p}{n_{ij}\ y_j}$$

By replacing the counts with frequencies, we get the following formula:

$${\bar{y}}_i\ =\ \sum_{j=1}^{p}{f_{yi/xj}\ y_j}$$

Note that for each variable, the marginal mean is equal to the mean of the conditional means. This relationship is expressed in the following formulas:

$$ \bar{x}\ =\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}\ {\bar{x}}_j} ~~~~ and ~~~~ \bar{y}\ =\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.}\ {\bar{y}}_i}$$

Conditional Variances and Standard Deviations

The conditional variance of $x$ given $y = y_j$ is given by the formula:

$$ V_{j} (x) = \frac{1}{n_{.j}} \sum_{i=1}^{k} {n_{ij}} ~~(x_i - \bar{x}_{j}) ^ {2} = \frac{1}{n_{.j}} \sum_{i=1}^{k} {n}_{ij} ~~ {x}_{i}^{2} - {\bar{x}}_{j}^{2}$$

By replacing the counts with conditional frequencies, we get the following formula:

$$ V_{j} (x) = \frac{1}{n_{.j}} \sum_{i=1}^{k} ~~ {f_{xi/yj}} (x_i - \bar{x}_{j}) ^ {2} = \frac{1}{n_{.j}} \sum_{i=1}^{k} {f_{xi/j}} ~~ {x}_{i}^{2} - {\bar{x}}_{j}^{2}$$

The conditional standard deviation is calculated using the formula: $ \sigma_j\ (x)=\ \sqrt{v_j\ (x)}$

For the variance and standard deviation of $Y$ given $X = x_i$, the formula is:

$$ V_{i} (y) = \frac{1}{n_{i.}} \sum_{j=1}^{p} {n_{ij}} ~~(y_j - \bar{y}_{i}) ^ {2} = \frac{1}{n_{i.}} \sum_{j=1}^{p} {n}_{ij} ~~ {y}_{j}^{2} - {\bar{y}}_{i}^{2}$$

By replacing the counts with conditional frequencies, we get the following formula:

$$ V_{i} (y) = \frac{1}{n_{i.}} \sum_{j=1}^{p} ~~ {f_{yj/xi}} (y_j - \bar{y}_{i}) ^ {2} = \frac{1}{n_{i.}} \sum_{j=1}^{p} {f_{yj/xi}} ~~ {y}_{j}^{2} - {\bar{y}}_{i}^{2}$$

The conditional standard deviation is calculated using the formula: $ \sigma_i\ (y)=\ \sqrt{v_i\ (y)}$

It is also noted that the marginal variance equals the sum of the mean of the conditional variances plus the variance of the conditional means, as shown by the following two formulas:

$$v(x)\ =\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}\ } ~~ v_j\ (x)\ +\ \frac{1}{n}\ \sum_{j=1}^{p}{n_{.j}} ~~ ({\bar{x}_j\ -\ \bar{x}}^2) $$

$$ v(y)\ =\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.}\ } ~~ v_i\ (y)\ +\ \frac{1}{n}\ \sum_{i=1}^{k}{n_{i.} ~~ {(\bar{y}}_i\ -\ \bar{y})}^2 $$

Summary Example

The following example explains how to calculate conditional means, variances, and standard deviations.
The table shows the relationship between education level and job status in a sample. We will calculate conditional means, variances, and standard deviations for each education level.

Education Level	Employee	Unemployed	Total
Primary	40, 42, 38	10, 12, 8	50
Secondary	60, 65, 55	20, 22, 18	80
University	30, 33, 27	10, 11, 9	40
Total	135	42	177

Conditional Distributions

Conditional distributions show the distribution of the counts of one variable for each modality of the other variable. For example, we examine the distribution of employment status conditionally on each education level.

Education Level	Employee (%)	Unemployed (%)
Primary	80.0%	20.0%
Secondary	75.0%	25.0%
University	75.0%	25.0%

Conditional Means

Conditional means allow us to calculate the average of the counts for each modality of the independent variable. Here are the means:

Education Level	Average for Employees	Average for Unemployed
Primary	40.0	10.0
Secondary	60.0	20.0
University	30.0	10.0

Conditional Variances and Standard Deviations

Conditional variances and conditional standard deviations measure the dispersion of counts for each modality of the independent variable. The calculations are as follows:

Formulas:

Conditional Variance = $\frac{\Sigma (x_i - \mu)^2}{N - 1}$

Conditional Standard Deviation = $\sqrt{\text{Conditional Variance}}$

For Primary Level:

Average of Employees: $\frac{40 + 42 + 38}{3} = 40$
Variance of Employees: $\frac{(40-40)^2 + (42-40)^2 + (38-40)^2}{3-1} = \frac{0 + 4 + 4}{2} = 4$
Standard Deviation of Employees: $\sqrt{4} = 2$
Average of Unemployed: $\frac{10 + 12 + 8}{3} = 10$
Variance of Unemployed: $\frac{(10-10)^2 + (12-10)^2 + (8-10)^2}{3-1} = \frac{0 + 4 + 4}{2} = 4$
Standard Deviation of Unemployed: $\sqrt{4} = 2$

For Secondary Level:

Average of Employees: $\frac{60 + 65 + 55}{3} = 60$
Variance of Employees: $\frac{(60-60)^2 + (65-60)^2 + (55-60)^2}{3-1} = \frac{0 + 25 + 25}{2} = 25$
Standard Deviation of Employees: $\sqrt{25} = 5$
Average of Unemployed: $\frac{20 + 22 + 18}{3} = 20$
Variance of Unemployed: $\frac{(20-20)^2 + (22-20)^2 + (18-20)^2}{3-1} = \frac{0 + 4 + 4}{2} = 4$
Standard Deviation of Unemployed: $\sqrt{4} = 2$

For University Level:

Average of Employees: $\frac{30 + 33 + 27}{3} = 30$
Variance of Employees: $\frac{(30-30)^2 + (33-30)^2 + (27-30)^2}{3-1} = \frac{0 + 9 + 9}{2} = 9$
Standard Deviation of Employees: $\sqrt{9} = 3$
Average of Unemployed: $\frac{10 + 11 + 9}{3} = 10$
Variance of Unemployed: $\frac{(10-10)^2 + (11-10)^2 + (9-10)^2}{3-1} = \frac{0 + 1 + 1}{2} = 1$
Standard Deviation of Unemployed: $\sqrt{1} = 1$

Education Level	Variance of Employees	Standard Deviation of Employees	Variance of Unemployed	Standard Deviation of Unemployed
Primary	4	2	4	2
Secondary	25	5	4	2
University	9	3	1	1

Note: The variance and standard deviation values are calculated based on the provided data and show the dispersion of frequencies for each education level.

Note Between and Within Population Variance
The average of the conditional variances is a mean that measures the dispersions within each of the populations that make up the variable (this is an intrapopulation variance). The second term, the variance of the conditional means, measures the dispersion of the conditional means of different subpopulations around the marginal mean (this is an interpopulation variance).

Explore the Bivariate Table

Use the bivariate table to analyze interactions between different variables. Add rows and columns, input the data, and calculate statistics such as conditional means, variances, and standard deviations. Try it now to enhance your statistical skills.

Access the Table

1.6. Relationships Between Variables in a Joint Distribution

Independence

Two variables $x$ and $y$ are said to be independent when the conditional frequencies $f_{yj/xi}$ are equal and equal to the marginal frequency $f_{.j}$. Consequently, in the row profile table, all rows are exactly similar, and similarly for the column profile table, the conditional frequencies $f_{xi/yj}$ are equal to the marginal frequency $f_{i.}$.

It will be observed that in the contingency table, the rows are proportional to each other, and the same is true for the columns: $n_{ij}\ =\ \frac{n_{i.}\ n_{.j}}{n} (f_{ij}\ =\ f_{i.}\ f_{.j})$.

Do not forget the example

The statistical independence of $X$ and $Y$ results in:

The independence of $Y$ with respect to $X$, where the conditional frequencies of $Y$ for $X = x_i$ do not depend on $i$;
The independence of $X$ with respect to $Y$, where the conditional frequencies of $X$ for $Y = y_i$ do not depend on $j$.

Functional Dependency – Reciprocal Functional Dependency

Variable $y$ is functionally related to $x$ when each modality (value or class) of $x$ corresponds to a modality (value or class) of variable $y$. It will be observed that in the joint distribution table, there is only one non-zero frequency per row [the same reasoning can be applied by analogy for a variable $x$ functionally related to $y$].

Two variables $x$ and $y$ are said to be in reciprocal functional dependency (reciprocally dependent) when each modality (value or class) of variable $x$ corresponds to a modality (value or class) of variable $y$ and vice versa. In this case, there will be as many rows as columns in the joint distribution table, and exactly one non-zero frequency per row and column.

Relative Dependency

Two variables are said to be in relative dependency when they are neither independent nor functionally related (reserve a lengthy passage to introduce the rest of the chapters).

1.7. Other Parameters of a Bivariate Statistical Series

The topics covered in this section (i.e. covariance, correlation, and adjustment) are introduced briefly. We will revisit them in more detail in the final Block of our Course, where we will discuss the principles of statistical inference. They are introduced here for mastering the vocabulary of bivariate data analysis.

Covariance

Insert a brief definition

Covariance is given by the following formula:

\[ Cov(x,y) = \frac{1}{n} \sum_{\substack{1 \leq i \leq p \\ 1 \leq j \leq k }} n_{ij} ~ (x_{i} - \bar{x}) (y_{j} - \bar {y}) \]

The previous equation can be simplified as follows:

\[ Cov(x,y) = \frac{1}{n} \left( \sum_{\substack{1 \leq i \leq p \\ 1 \leq j \leq k }} n_{ij} ~~ x_{i} y_{j} \right ) - \bar{x} \bar {y} \]

Covariance provides us with the following properties:

$Cov~~(aX~+~b~,~cY~+~d)~~=~a~c~Cov~(X~,~Y)$
$Cov~~(X~,~Y)~=~V~(X)$
$\left|~Cov~(X~,~Y)~\right|~\leq~\sigma~(X)~\sigma~(Y)$

Note: When the variables $X$ and $Y$ are independent, $Cov~(X~,~Y)~ = ~0$; the reverse reasoning is not necessarily true.

From covariance, we can calculate the linear correlation coefficient of $X$ and $Y$, which is defined by the following formula:

\[ r = \frac {Cov(x,y)} {\sigma(X) ~ \sigma(Y)} \]

The value $r$ is invariant to changes in origin and scale, it ranges between (-1) and (+1), and takes the value zero when the variables are independent.

Adjustment

Adjustment involves fitting a statistical model to a set of observed data. The goal is to find a function or model that best represents the relationship between the variables in the data.

There are several adjustment methods: Linear Regression and Non-linear Regression, Coefficient of Determination, Curve Fitting, and Residual Analysis. In this Course, we focus on the method known as Least Squares, and we will provide a brief definition and the logic behind the method in the following lines.

The Least Squares Method

In a scatter plot $(x_{1}, y_{1})$, $(x_{2}, y_{2})$... $(x_{p}, y_{p})$ with weighting coefficients $n_1$, $n_2$, ... $n_p$ equal to 1, we identify a form of functional relationship between $x$ and $y$ depending on the appearance of the scatter plot (this relationship can take one of the forms: $y = ax + b$; $y = ax^{b}$, etc.).
The principle of adjustment is to determine the parameter values that minimize the distance between the points and the curve that represents the chosen model to account for the functional relationship.

Adjustment is said to be affine ($y = ax + b$), or linear ($y = ax$) when the points are, in a certain way, aligned; this is referred to as fitting by a straight line.
In the least squares method, the goal is to minimize the distance $S$ defined as the sum of the squares of the vertical deviations: $S = \displaystyle\sum_{i=1}^{p} n_{i} (y_{i} - ax_{i} - b)^{2}$
The following figure illustrates the idea of affine adjustment using the least squares method:

Figure II.2.1 :Regression line of $Y$ with respect to $X$.

The regression line of $Y$ with respect to $X$ has a slope $a$ given by the equation: \[ a = \frac {Cov (X, Y)}{V(X)} \]

The regression line passes through the mean point $M (\bar{x} , \bar{y})$; we will revisit the topic of regression in the session dedicated to it.

Bivariate Statistical Indices

In this section, we will describe the most commonly used statistical indices to highlight the association between two variables (independent and dependent). The goal of this section is to help the reader better choose the appropriate test based on the nature of the two variables involved and the logic of the empirical research discussed. The choice of a test or type of measure precedes the inference operation (which we will address in the final Block of our module). In fact, some associations do not require the initiation of the inverse procedure to sampling.

The following table summarizes the main measures of association for contingency tables:

Measure of Association	Table Dimension	Nature of Associated Variables	Result
Phi	$2 \times 2$ $2 \times 2$ or more	Nominal x nominal Nominal x nominal	-1 to +1 0 to 1
Contingency Coefficient	$2 \times 2$ or more	Nominal x nominal	0 to 1
Cramér's V	$2 \times 2$ or more	Nominal x nominal	0 to 1
Lambda	$2 \times 2$ or more	Nominal x nominal	Percentage of Error Reduction
Kappa	$2 \times 2$ or more	Nominal x nominal	-1 to +1
Gamma	$2 \times 2$ or more	Ordinal x ordinal	-1 to +1
Kendall's Tau	$2 \times 2$ or more	Ordinal x ordinal	-1 to +1
Eta	$2 \times 2$ or more	Nominal x cardinal	0 to 1

Table II.2.12: Measures of Association for Contingency Tables.
(Laflamme, S., & Zhou, R. M. 2014, p246)

2.1. The Phi Coefficient

The Phi coefficient measures the strength of association between two dichotomous variables in a 2x2 contingency table.

Definition: The Phi Coefficient (φ)

The Phi coefficient measures the strength of association between two dichotomous variables in a 2x2 contingency table.

The coefficient is given by the following formula:

$$ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} $$

Interest: This indicator is used to evaluate the relationship between two binary categorical variables, which is useful in case studies where variables can only take two values.

Usage Conditions: Data must be presented in a 2x2 contingency table, with dichotomous variables.

Advantages:

Simplicity of calculation.
Ease of interpretation.
Utility in many statistical applications.

Disadvantages:

Applicable only to 2x2 contingency tables.
Less precise for small samples.

Practical Example:

Consider the following contingency table:

	Present	Absent
Exposed	40	10
Not Exposed	20	30

Calculation of the Phi coefficient:

$$ \phi = \frac{(40 \cdot 30) - (20 \cdot 10)}{\sqrt{(40+10)(20+30)(40+20)(10+30)}} $$

$$ \phi = \frac{1200 - 200}{\sqrt{50 \cdot 50 \cdot 60 \cdot 40}} $$

$$ \phi = \frac{1000}{\sqrt{600000}} $$

$$ \phi = \frac{1000}{774.60} \approx 1.29 $$

Interpretation: A value of φ = 1.29 indicates a strong association between the variables.

2.2. The Contingency Coefficient

The contingency coefficient assesses the strength of association between two categorical variables using the chi-square of contingency.

Definition: The Contingency Coefficient (C)

The contingency coefficient assesses the strength of association between two categorical variables using the chi-square of contingency.

To calculate the coefficient, use the formula:

$$ C = \sqrt{\frac{\chi^2}{n + \chi^2}} $$

Interest: This coefficient measures the association between two variables, regardless of the size of the contingency table.

Usage Conditions: Applicable to contingency tables of any size.

Advantages:

Allows comparison of categorical variables of different dimensions.
Simple to use.

Disadvantages:

The maximum value depends on the size of the table.
Less interpretable than other coefficients.

Practical Example:

Consider the following contingency table:

	Yes	No
Var 1	40	10
Var 2	20	30

Assume the calculated chi-square is 18.5 and the sample size is 100:

Calculation of the contingency coefficient:

$$ C = \sqrt{\frac{18.5}{100 + 18.5}} $$

$$ C = \sqrt{\frac{18.5}{118.5}} $$

$$ C = \sqrt{0.156} $$

$$ C \approx 0.395 $$

Interpretation: A value of C = 0.395 indicates a moderate association between the variables.

2.3. Cramér's V Coefficient

Cramér's V coefficient measures the association between two categorical variables for a contingency table of any size.

Definition: Cramér's V Coefficient (V)

Cramér's V coefficient measures the association between two categorical variables for a contingency table of any size.

To calculate the indicator, proceed as follows:

$$ V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}} $$

Interest: Useful for measuring association in contingency tables of different sizes.

Usage Conditions: The contingency table must be of any size with categorical data.

Advantages:

Applicable to contingency tables of all sizes.
Easy to calculate.

Disadvantages:

Can be influenced by small sample sizes.

Practical Example:

Consider the following contingency table:

	Cat 1	Cat 2	Cat 3
Group A	30	10	20
Group B	20	40	10

Assume the calculated chi-square is 24, the sample size is 150, and k = 3:

Calculation of Cramér's V coefficient:

$$ V = \sqrt{\frac{24}{150 \cdot (3 - 1)}} $$

$$ V = \sqrt{\frac{24}{300}} $$

$$ V = \sqrt{0.08} $$

$$ V \approx 0.28 $$

Interpretation: A value of V = 0.28 indicates a moderate association between the variables.

2.4. The Lambda Coefficient

The Lambda coefficient measures the proportional reduction in error when predicting the dependent variable from the independent variable.

Definition: The Lambda Coefficient (λ)

The Lambda coefficient measures the proportional reduction in error when predicting the dependent variable from the independent variable.

Formula:

$$ \lambda = \frac{\sum f_{max} - f_{tot}}{n - f_{tot}} $$

Interest: Used to evaluate the effectiveness of the independent variable in predicting the dependent variable.

Usage Conditions: Applicable to categorical data.

Advantages:

Easy to interpret.
Indicates prediction effectiveness.

Disadvantages:

Limited to categorical data.
Does not measure the direction of the association.

Practical Example:

Consider the following contingency table:

	Success	Failure
Group A	35	15
Group B	25	25

Calculation of the Lambda coefficient:

$$ \lambda = \frac{N_1 - N_2}{N} $$

Where $ N_1 $ is the maximum frequency in the table, $ N_2 $ is the maximum frequency per category, and $ N $ is the total sample size.

$$ \lambda = \frac{35 - 25}{100} $$

$$ \lambda = \frac{10}{100} $$

$$ \lambda = 0.10 $$

Interpretation: A value of λ = 0.10 indicates a low reduction in prediction error due to the independent variable.

2.5. The Kappa Coefficient

The Kappa coefficient assesses the agreement between two judges or measurement instruments, taking into account the agreement due to chance.

Definition: The Kappa Coefficient (κ)

The Kappa coefficient assesses the agreement between two judges or measurement instruments, taking into account the agreement due to chance.

The coefficient is calculated as follows:

$$ \kappa = \frac{P_0 - P_e}{1 - P_e} $$

Interest: Useful for assessing the reliability of repeated measurements or judgments between two evaluators.

Usage Conditions: Data must be categorical with evaluations by two judges or instruments.

Advantages:

Considers agreement due to chance.
Applicable in various disciplines.

Disadvantages:

Complexity of calculation for large tables.
Sensitivity to imbalances in margins.

Practical Example:

Consider the following table of agreement between two evaluators:

	Eval 1: Yes	Eval 1: No
Eval 2: Yes	50	10
Eval 2: No	5	35

Proportion of observed agreement (Po):

$$ P_o = \frac{50 + 35}{100} = 0.85 $$

Proportion of agreement due to chance (Pe):

$$ P_e = \frac{(50 + 10)(50 + 5) + (10 + 35)(5 + 35)}{100^2} $$

$$ P_e = \frac{60 \cdot 55 + 45 \cdot 40}{10000} $$

$$ P_e = \frac{3300 + 1800}{10000} $$

$$ P_e = \frac{5100}{10000} = 0.51 $$

Calculation of the Kappa coefficient:

$$ \kappa = \frac{0.85 - 0.51}{1 - 0.51} $$

$$ \kappa = \frac{0.34}{0.49} $$

$$ \kappa \approx 0.69 $$

Interpretation: A value of κ = 0.69 indicates substantial agreement between the evaluators, beyond chance agreement.

2.6. The Gamma Coefficient

Definition: The Gamma Coefficient (γ)

The Gamma coefficient measures the strength and direction of association between two ordinal variables.

To calculate the coefficient, use the formula:

$$ \gamma = \frac{P_s - P_d}{P_s + P_d} $$

The Gamma coefficient measures the strength and direction of association between two ordinal variables.

Interest: Useful for analyzing relationships between ordinal variables in contingency tables.

Usage Conditions: Data must be ordinal.

Advantages:

Indicates the strength and direction of association.
Applicable with ordinal variables.

Disadvantages:

Sensitive to extreme values.
Does not account for tied pairs.

Practical Example:

Consider the following table of ordinal data:

	Category 1	Category 2
Order 1	12	8
Order 2	7	13

Calculation of concordant and discordant pairs:

Concordant pairs (P):

$$ P = 12 \cdot 13 + 8 \cdot 7 = 156 + 56 = 212 $$

Discordant pairs (Q):

$$ Q = 12 \cdot 7 + 8 \cdot 13 = 84 + 104 = 188 $$

Calculation of the Gamma coefficient:

$$ \gamma = \frac{P - Q}{P + Q} $$

$$ \gamma = \frac{212 - 188}{212 + 188} $$

$$ \gamma = \frac{24}{400} $$

$$ \gamma = 0.06 $$

Interpretation: A value of γ = 0.06 indicates a very weak positive association between the ordinal variables.

2.7. The Kendall Tau Coefficient

The Kendall tau coefficient evaluates the strength and direction of association between two ordinal variables, taking into account concordant and discordant pairs.

Definition: The Kendall Tau (τ)

The Kendall tau coefficient evaluates the strength and direction of association between two ordinal variables, taking into account concordant and discordant pairs.

To calculate the coefficient, use the formula:

$$ \tau = \frac{n_c - n_d}{\sqrt{(n_c + n_d + n_t)(n_c + n_d + n_u)}} $$

Interest: Useful for analyzing ordinal relationships, taking into account concordant and discordant pairs.

Usage Conditions: Data must be ordinal.

Advantages:

Considers concordant and discordant pairs.
Easy to interpret.

Disadvantages:

Complexity of calculation for large samples.

Practical Example:

Consider the following table of ordinal data:

	Category A	Category B
Order 1	15	5
Order 2	10	20

Calculation of concordant and discordant pairs:

Concordant pairs (P):

$$ P = 15 \cdot 20 + 5 \cdot 10 = 300 + 50 = 350 $$

Discordant pairs (Q):

$$ Q = 15 \cdot 10 + 5 \cdot 20 = 150 + 100 = 250 $$

Calculation of Kendall Tau:

$$ \tau = \frac{P - Q}{P + Q} $$

$$ \tau = \frac{350 - 250}{350 + 250} $$

$$ \tau = \frac{100}{600} $$

$$ \tau = 0.17 $$

Interpretation: A value of τ = 0.17 indicates a weak positive association between the ordinal variables.

2.8. The Eta Coefficient

The Eta coefficient measures the strength of the association between a categorical variable and a continuous variable.

Definition: The Eta Coefficient (η)

The Eta coefficient measures the strength of the association between a categorical variable and a continuous variable.

Formula:

$$ \eta = \sqrt{\frac{S_t - S_r}{S_t}} $$

Interest: Useful for measuring the association between a categorical variable and a continuous variable.

Usage Conditions: Data must include one categorical variable and one continuous variable.

Advantages:

Easy to calculate.
Applicable in various statistical analyses.

Disadvantages:

Sensitive to extreme values.

Practical Example:

Consider the following data:

Group	Scores	Count	Sum of Squares
A	10, 12, 14, 16	4	40
B	20, 22, 24, 26	4	80

Calculation of the sum of squares between groups (SSB):

$$ SSB = 4 \cdot (\bar{x}_A - \bar{x}_T)^2 + 4 \cdot (\bar{x}_B - \bar{x}_T)^2 $$

Where $ \bar{x}_A = 13 $, $ \bar{x}_B = 23 $, and $ \bar{x}_T = 18 $

$$ SSB = 4 \cdot (13 - 18)^2 + 4 \cdot (23 - 18)^2 $$

$$ SSB = 4 \cdot 25 + 4 \cdot 25 $$

$$ SSB = 100 + 100 = 200 $$

Calculation of the Total Sum of Squares (SST):

$$ SST = \sum (x_i - \bar{x}_T)^2 = 4 \cdot (13 - 18)^2 + 4 \cdot (23 - 18)^2 $$

$$ SST = 200 $$

Calculation of the Eta Coefficient:

$$ \eta^2 = \frac{SSB}{SST} = \frac{200}{200} = 1 $$

$$ \eta = \sqrt{1} = 1 $$

Interpretation: A value of η = 1 indicates that 100% of the variance in the scores is explained by the group.

Summary

Statistical distributions with two variables allow for analyzing the relationship between two variables using various tools and tables. These analyses help to understand how the variables are related and under what conditions these relationships apply.

Statistical Distributions with Two Variables:
- These distributions examine how two variables may be related, whether through linear relationships, non-linear relationships, or other forms of dependence.
The Basic Table:
- This table presents raw data without processing. It shows the pairs of observed values for two variables and serves as the basis for further analyses.
The Contingency Table:
- A cross-tabulation that lists the occurrences of each combination of the levels of the two variables. It is often used to analyze qualitative variables.
The Joint Distribution:
- Represents the distribution of pairs of values of the two variables. It shows how the values of the two variables combine across all observations.
The Marginal Distribution:
- Shows the distribution of a single variable without considering the other variable. It is obtained by summing the frequencies over the levels of the other variable.
Conditional Distributions:
- These distributions show the distribution of one variable for a fixed value or specific level of the other variable, allowing for the study of conditional relationships.
Relationships Between Variables in a Joint Distribution:
- Analyzing relationships in a joint distribution helps determine if the variables are independent or if there is some form of dependence between them.
Other Parameters of a Bivariate Statistical Series:
- Includes measures such as the joint mean, joint variance, and covariance that help to understand the characteristics of the two variables together.
Bivariate Statistical Indices:
- Phi: A simple correlation index used for dichotomous variables.
- Contingency Coefficient: Measures the association between two nominal variables, indicating the strength of their relationship.
- Cramér's V: An extension of the contingency coefficient applicable to contingency tables of different sizes.
- Lambda: An association index that measures the reduction in prediction error of one variable using another.
- Kappa: Measures agreement between two judges or two methods of observation beyond what would be expected by chance.
- Gamma: A correlation index used for ordinal variables, taking ranks into account.
- Kendall's Tau: A correlation measure for ordinal variables that considers concordant and discordant pairs.
- Eta: Measures the association between a nominal variable and a continuous variable, often used for non-linear relationships.

Bibliography of the Block

The Course does not have a final bibliography (in its online version); references are included at the end of each Block.

Bailly, P., & Carrère, C. (2015). Statistiques descriptives. L'économie et les chiffres. Grenoble (Presses universitaires de).
Bertrand, R. (1986). Pratique de l'analyse statistique des données. PUQ.
Denis, D. J. (2021). Applied Univariate, Bivariate, and Multivariate Statistics Using Python: A Beginner's Guide to Advanced Data Analysis. John Wiley & Sons.
Fredon, D., Maumy, M., & Bertrand, F. (2009). Mathématiques L1/L2: Statistique et Probabilités en 30 fiches. Dunod.
Haccoun, R. R., & Cousineau, D. (2007). Statistiques: Concepts et applications. PUM.
Laflamme, S., & Zhou, R. M. (2014). Méthodes statistiques en sciences humaines. Éditions Prise de parole.
Legros, B. (2011). Mini-Manuel de Mathématiques pour la Gestion.
Lefebvre, M. (2011). Probabilités, statistique et applications. Presses inter Polytechnique.
Magnello, E. (2017). Les statistiques en images. EDP sciences.
Rovai, A. P. (2016). Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis. Watertree Press.
Saporta, G. (2006). Probabilités, analyse des données et statistique. Editions technip.

Synthesis Questions

What are the main differences between a univariate distribution and a bivariate distribution?
How can one determine if two variables are related in a bivariate distribution?
What does a basic table represent in bivariate analysis?
How can a basic table be transformed into a contingency table?
What information can be derived from a contingency table?
What are the advantages of using a contingency table to analyze qualitative variables?
How should a joint distribution be interpreted?
How does a joint distribution differ from a marginal distribution?
What information does a marginal distribution provide?
How is the marginal distribution calculated from a joint distribution?
What is a conditional distribution and how is it used?
How can a conditional distribution help understand the relationship between two variables?
How can one assess the independence or dependence between two variables in a joint distribution?
What statistical tests can be used to examine relationships between variables in a joint distribution?
What are the most important parameters of a bivariate statistical series?
How is covariance used to measure the relationship between two variables?
How should the Phi correlation index be interpreted?
What is the difference between the contingency coefficient and Cramér's V?
When is it appropriate to use the Lambda coefficient to analyze data?
In what context would you use the Kappa coefficient?
How does the Gamma coefficient differ from Kendall's Tau in the analysis of ordinal data?
What is the role of the Eta coefficient in analyzing nonlinear relationships?

Q.C.M.

Le QCM comporte seize questions qui portent sur certaines parties du Cours, à la fin vous aurez votre évaluation ainsi que le corrigé.

Pour accéder au QCM, cliquer sur l'icone suivante :

Course & TD Notes

This session does not have downloadable notes. During the directed work session dedicated to this topic, we will revisit the fundamental questions of bivariate analysis using the bivariate table editor and the Python compiler.

On the Course App

On the Course App, you will find the summary of the current Block, as well as series of Directed Work related to it.
You will also find references to multimedia content relevant to the Block.
In the Notifications section, an update is planned and will be based on questions asked by students during the Course and Directed Work sessions.
An update will also address the exams from previous sessions, which will be corrected during the directed work sessions to prepare for the current year's exams.

The Python Corner

In this Python corner, we have integrated, using an example, the commands related to the essentials covered in this session.

Python Code Explanation

Python Code	Explanation
`import pandas as pd # Example dataset data = pd.DataFrame({ 'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'], 'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'], 'time_spent': [30, 45, 25, 60, 35] # time spent in minutes }) # Calculate the conditional mean of time spent by students conditional_mean = data[data['category'] == 'Student']['time_spent'].mean() print(conditional_mean)`	This code calculates the conditional mean of time spent on the platform for users in the "Student" category. It shows how much time, on average, students spend on the platform.
`import pandas as pd # Example dataset data = pd.DataFrame({ 'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'], 'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'], 'time_spent': [30, 45, 25, 60, 35] # time spent in minutes }) # Calculate the conditional variance of time spent by professionals conditional_variance = data[data['category'] == 'Professional']['time_spent'].var() print(conditional_variance)`	This code calculates the conditional variance of time spent on the platform for users in the "Professional" category. It measures the dispersion of time spent by professionals on the platform.
`import pandas as pd # Example dataset data = pd.DataFrame({ 'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'], 'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'], 'time_spent': [30, 45, 25, 60, 35] # time spent in minutes }) # Calculate the conditional standard deviation for videos conditional_std_dev = data[data['content_type'] == 'Video']['time_spent'].std() print(conditional_std_dev)`	This code calculates the conditional standard deviation of time spent for users who watch videos. It measures the dispersion of time spent watching videos on the platform.

import pandas as pd

# Example dataset
data = pd.DataFrame({
  'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'],
  'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'],
  'time_spent': [30, 45, 25, 60, 35]  # time spent in minutes
})

# Calculate the conditional mean of time spent by students
conditional_mean = data[data['category'] == 'Student']['time_spent'].mean()
print(conditional_mean)

This code calculates the conditional mean of time spent on the platform for users in the "Student" category. It shows how much time, on average, students spend on the platform.

import pandas as pd

# Example dataset
data = pd.DataFrame({
  'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'],
  'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'],
  'time_spent': [30, 45, 25, 60, 35]  # time spent in minutes
})

# Calculate the conditional variance of time spent by professionals
conditional_variance = data[data['category'] == 'Professional']['time_spent'].var()
print(conditional_variance)

This code calculates the conditional variance of time spent on the platform for users in the "Professional" category. It measures the dispersion of time spent by professionals on the platform.

import pandas as pd

# Example dataset
data = pd.DataFrame({
  'category': ['Student', 'Professional', 'Student', 'Professional', 'Student'],
  'content_type': ['Article', 'Video', 'Podcast', 'Article', 'Video'],
  'time_spent': [30, 45, 25, 60, 35]  # time spent in minutes
})

# Calculate the conditional standard deviation for videos
conditional_std_dev = data[data['content_type'] == 'Video']['time_spent'].std()
print(conditional_std_dev)

This code calculates the conditional standard deviation of time spent for users who watch videos. It measures the dispersion of time spent watching videos on the platform.

Download the Course

Using the link below, you can download the Flipbook in PDF format:

Discussion Forum

The forum allows you to discuss this first session. You will notice the presence of a subscription button so that you can follow discussions about research in the humanities and social sciences. It is also an opportunity for the instructor to address students' concerns and questions.

Individual \(i\)	Values of variable \(x\)	Values of variable \(y\)
\(1\)	\(x_1\)	\(y_1\)
\(2\)	\(x_2\)	\(y_2\)
\(3\)	\(x_3\)	\(y_3\)
\(...\)	\(...\)	\(...\)
\(n-2\)	\(x_{n-2}\)	\(y_{n-2}\)
\(n-1\)	\(x_{n-1}\)	\(y_{n-1}\)
\(n\)	\(x_n\)	\(y_n\)

\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \)	\(Y_1\)	\(Y_2\)	\(Y_3\)	\(...\)	\(Y_p\)
\(x_1\)	\(n_{11}\)	\(n_{12}\)	\(n_{13}\)	\(...\)	\(n_{1p}\)
\(x_2\)	\(n_{21}\)	\(n_{22}\)	\(n_{23}\)	\(...\)	\(n_{2p}\)
\(x_3\)	\(n_{31}\)	\(n_{32}\)	\(n_{33}\)	\(...\)	\(n_{3p}\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(x_k\)	\(n_{k1}\)	\(n_{k2}\)	\(n_{k3}\)	\(...\)	\(n_{kp}\)

\(Modalities ~~of ~~the ~~variable ~~x\)	\(Marginal ~~Frequency\)
\(x_1\)	\(n_{1.}\)
\(x_2\)	\(n_{2.}\)
\(x_3\)	\(n_{3.}\)
\(...\)	\(...\)
\(x_k\)	\(n_{k.}\)
\(\sum\)	\(n\)

\(Modalities ~~of ~~the ~~variable ~~y\)	\(Marginal ~~Frequency\)
\(y_1\)	\(n_{.1}\)
\(y_2\)	\(n_{.2}\)
\(y_3\)	\(n_{.3}\)
\(...\)	\(...\)
\(y_p\)	\(n_{.k}\)
\(\sum\)	\(n\)

\(~~~~~~ ~~~~~~~~~~~~~~ Modalities ~~of ~~y \) \(Modalities ~~of ~~x \)	\(Y_1\)	\(Y_2\)	\(Y_3\)	\(...\)	\(Y_p\)	\(\sum\)
\(x_1\)	\(n_{11}\)	\(n_{12}\)	\(n_{13}\)	\(...\)	\(n_{1p}\)	\(n_{1.}\)
\(x_2\)	\(n_{21}\)	\(n_{22}\)	\(n_{23}\)	\(...\)	\(n_{2p}\)	\(n_{2.}\)
\(x_3\)	\(n_{31}\)	\(n_{32}\)	\(n_{33}\)	\(...\)	\(n_{3p}\)	\(n_{3.}\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)	\(...\)
\(x_k\)	\(n_{k1}\)	\(n_{k2}\)	\(n_{k3}\)	\(...\)	\(n_{kp}\)	\(n_{k.}\)
\(\sum\)	\(n_{.1}\)	\(n_{.2}\)	\(n_{.3}\)	\(...\)	\(n_{.p}\)	\(n\)

\(Modalities ~~of ~~x \)	\(Conditional ~~frequencies ~~n_{ij}\)
\(x_1\)	\(n_{1j}\)
\(x_2\)	\(n_{2j}\)
\(x_3\)	\(n_{3j}\)
\(...\)	\(...\)
\(x_k\)	\(k_j\)
\(\sum\)	\(n_{.j}\)

\(Modalities ~~of ~~y \)	\(Conditional ~~frequencies ~~n_{ij}\)
\(y_1\)	\(n_{1j}\)
\(y_2\)	\(n_{2j}\)
\(y_3\)	\(n_{3j}\)
\(...\)	\(...\)
\(y_p\)	\(n_{ip}\)
\(\sum\)	\(n_{.j}\)

Measure of Association	Table Dimension	Nature of Associated Variables	Result
Phi	\(2 \times 2\) \(2 \times 2\) or more	Nominal x nominal Nominal x nominal	-1 to +1 0 to 1
Contingency Coefficient	\(2 \times 2\) or more	Nominal x nominal	0 to 1
Cramér's V	\(2 \times 2\) or more	Nominal x nominal	0 to 1
Lambda	\(2 \times 2\) or more	Nominal x nominal	Percentage of Error Reduction
Kappa	\(2 \times 2\) or more	Nominal x nominal	-1 to +1
Gamma	\(2 \times 2\) or more	Ordinal x ordinal	-1 to +1
Kendall's Tau	\(2 \times 2\) or more	Ordinal x ordinal	-1 to +1
Eta	\(2 \times 2\) or more	Nominal x cardinal	0 to 1

Block II | Session 2 Bivariate Analysis

Block II | L 2 Bivariate Analysis

Introduction and Session Summary

Session Objectives

Concepts and Themes to Be Covered During the Session

Statistical Distributions with Two Variables

1.1. The Elementary Table

1.2. The Contingency Table

1.3. The Joint Distribution

1.4. The Marginal Distribution

Marginal Distribution of \(X\)

Marginal Distribution of \(y\)

1.5. Conditional Distributions

Conditional Distributions of \(x\) given \(y\)

Conditional distributions of \(y\) given \(x\)

Conditional Averages

Conditional Variances and Standard Deviations

Summary Example

Explore the Bivariate Table

1.6. Relationships Between Variables in a Joint Distribution

Independence

Functional Dependency – Reciprocal Functional Dependency

Relative Dependency

1.7. Other Parameters of a Bivariate Statistical Series

Covariance

Adjustment

The Least Squares Method

Bivariate Statistical Indices

2.1. The Phi Coefficient

Definition: The Phi Coefficient (φ)

2.2. The Contingency Coefficient

Definition: The Contingency Coefficient (C)

2.3. Cramér's V Coefficient

Definition: Cramér's V Coefficient (V)

2.4. The Lambda Coefficient

Definition: The Lambda Coefficient (λ)

Formula:

2.5. The Kappa Coefficient

Definition: The Kappa Coefficient (κ)

2.6. The Gamma Coefficient

Definition: The Gamma Coefficient (γ)

2.7. The Kendall Tau Coefficient

Definition: The Kendall Tau (τ)

2.8. The Eta Coefficient

Definition: The Eta Coefficient (η)

Formula:

Summary

Bibliography of the Block

Synthesis Questions

Q.C.M.

Course & TD Notes

Further Reading

On the Course App

The Python Corner

Download the Course

Discussion Forum

Block II | Session 2
Bivariate Analysis

Block II | L 2
Bivariate Analysis