" alt="top image of the page, it is used for decoration.">



walking Introduction and Session Overview

This first session of the second Block focuses on the analysis (numerical and visual representations) of the data distribution. This set of representations constitutes one of the elementary operations in data analysis (as well as in descriptive statistics), with the goal of describing the distribution data using a set of indicators.

Organizing and representing data is the step that follows the data collection process. The available data must undergo a series of transformations to make them more readable and consequently exploitable for statistical analysis purposes.

In descriptive statistics, the first step of the analysis involves examining variables one by one, hence the term univariate analysis. Univariate analysis operates in two complementary ways: data representation (construction of distribution tables and graphs) and the calculation of relative and descriptive measures (central tendency, position, and dispersion measures). Numerical and graphical representations of the distribution provide a global view of the studied phenomenon and its direction. Next, a certain number of indices will be calculated to simplify the mass of information contained in our distribution, allowing for interpretation according to the hypotheses of our research.

In a research project, it is not always necessary to take into account various statistical indicators to fully account for the studied phenomenon. The experience of research, understood in a broad sense, will provide informed insight regarding the decisions to be made about the elements to present in our data analysis.

Descriptive statistics are a set of indices aimed at expressing in numbers the main characteristics of the data distribution, with the objective of interpreting them. Univariate descriptive statistics can be grouped into three main categories: central tendency measures, position measures, and dispersion measures. Other indices can also be calculated, and we will provide examples at the end of this section.

This session and the following two (measures of dispersion and measures of position) focus on characterizing the frequency distribution obtained after grouping raw data. The grouping and sorting of raw data provide an overall picture of the distribution, and the measures aim to describe the distribution's characteristics. Data analysis relies on a language of indices to summarize the essential information contained in a data set. Statistical indices offer an overview of the studied data.

When we determine the parameters of central tendency, they help us understand what happens in the middle of our data distribution. However, with identical central tendency indices, two or more data sets can be distributed differently; we say that these data sets deviate from the central tendency values. Even with the same central tendency parameters, observations can be very close or very far from the central values. Therefore, we cannot rely solely on central values to understand and analyze our data set. At this point, we will focus on parameters that describe the dispersion of data around the center. Central tendency measures provide information about the center of the distribution but have limitations when the data are dispersed to the point where these measures do not accurately reflect the observed data. Dispersion measures can be considered complementary to those that describe what happens at the center of the distribution. Finally, position measures provide the specific position of a data point within the distribution.

During the three sessions that make up this course, you will be guided by a Guide Example, which is a fictional survey about the use of the social network Facebook. This example will help you understand, using its data, how to calculate and interpret univariate analysis data. To access the Guide Example, click on the button located to the right of this main text.

In this course, you will find two types of editors and commands that allow you to work with the Python language.
The editors, Graph Editor and Data Editor. Two types of editors have been integrated, one for representing a qualitative variable and another for plotting diagrams for quantitative variables. To calculate the statistical indices, which form the basis of univariate analysis, two types of Data Editors have also been incorporated. These will allow you to calculate the parameters we will discuss in this course using the data you input.
At the end of this course, you will find a downloadable Python Command Sheet, which will help you better use the language to calculate the indicators covered during the course and build the appropriate graphs. You can rely on the GSE (Google Search Engine) located in the navigation bar for further explanations, and additional guidance is provided in the 'For Further Learning' section.


define-location Session Objectives

During this session, we aim to achieve the following objectives:

  • Numerical representation of data:
    In this course, we aim to familiarize ourselves with statistical tables. Statistical tables are the first to be established when looking for a general, global overview of the data. We will focus on the typology and content of a (univariate) statistical table and also try to lay the foundations for constructing statistical tables according to the most commonly used bibliographic styles in social sciences and humanities (the reader will find a whole section dedicated to bibliographic writing following APA, CHICAGO, and TURABIAN standards in the Appendix section);

  • Graphical representation of data:
    Graphs and diagrams are another way to present data. In this course, we will explain how to construct this graphical representation of data according to the researcher’s interest and objectives. We will also explain how to work with graphs to meet the requirements of a given bibliographic style. Students can use the graph editor included in this course to simulate, practice, or simply learn in-depth using the examples provided in the editor;

  • Define the central tendency indices: Mode, Median, and Mean
    Recognizing and defining the parameters of central tendency is an important task in data analysis. This work will allow you to make an informed choice when interpreting the data from the survey. We have included a digital editor that calculates each index based on the inserted data, allowing you to practice at any time;

  • Calculate indices and produce them in a data analysis software
    As mentioned in the very first course, we have favored an analytical approach that involves calculating the statistical indices by hand and then using software to better understand the origin of each index and its role in the analysis process. Alongside the Python compiler, you can also work with the online versions of SPSS and JAMOVI (I thank them for the support I received from the groups of these two data science giants);

  • Define and calculate the dispersion and position indices:
    This session is also an opportunity to calculate dispersion measures, and we will see how to carry out this work. As mentioned earlier, you can use the digital editor for real-life examples;

  • Apply the acquired knowledge to a specific case:
    The guide example inserted at the very beginning of the course will serve as our first immersion into the world of data interpretation. We will try to give meaning to the obtained results;

  • Interpret the results:
    Data analysis partly relies on the interpretation of the indices. We will see how to incorporate this interpretation into your future analyses. This objective is crucial, as by the end of this course, you should be able to move from the simple calculation of an index to its interpretation, reinforcing your statements when composing your survey;

concept Concepts and Themes to be Covered During the Session

Data representation, statistical table, diagram, graph, table typology, diagram typology, measures of central tendency, mode, median, mean, modal class, uniform distribution, unimodal distribution, bimodal distribution, multimodal distribution, median class, range, variance, mean deviation, standard deviation, coefficient of variation, rank, middle rank, percentile rank (deciles – percentiles – quartiles), z-score, T-score.


Block Presentation

1Data Distribution

Raw Data . To perform a univariate analysis, one must have raw data describing the characteristics of the concerned population (or sample). Raw data is untreated data, and compiling it into a data distribution (also called a statistical series) constitutes the first step in the presentation, treatment, and descriptive analysis of the data.

In order to establish a data distribution, the frequencies that make up each category (or value) of the studied variable must be determined.

1.1. Frequency Tables

When raw data is organized into a table, we obtain what is called a data table.

In social sciences and humanities, tables are essential tools for research. They serve certain functions that allow for a better understanding of the studied phenomenon.

There are three types of statistical tables: data tables (for now, we will limit ourselves to an introduction to the nature of these tables; the student will find more explanations in the section dedicated to computer data processing using a dedicated program), variable distribution tables, and contingency tables, which are relevant to this chapter and will be addressed in the following sections.

The Data Table

In research, data tables are the first to be constructed. They are used for data processing, commonly referred to as Flat Sorting. A data table is extensive, and each cell provides information that characterizes the subject.

A data table primarily consists of two margins: the list of subjects and the list of variables. [The list of subjects is a numerical or alphabetical list used to enumerate and identify the subjects under study: the questionnaire numbers in our case. The list of variables provides information about the characteristics of the considered variables: VAR001, VAR002, etc. [Thus, it constitutes enumerative lists of raw data collected from a large set of subjects].

Variable Distribution Table

A variable distribution table is constructed after extracting one or more columns from the data table.

A variable distribution table is a juxtaposition table; it faithfully reproduces the content of one or more columns from the data table taken separately.
A variable distribution table generally consists of three margins:

  • A margin that groups the modalities or values of the variable, denoted \(x_i\);
  • A margin that identifies the respective frequencies of each modality (or value), denoted \(n_i\) [possibly another column for frequencies denoted \(f_i\) or percentages denoted \(\%\) ];
  • A Total margin, sometimes identified using the Sigma symbol (\(\sum\)), which indicates the sum of frequencies, totals, or percentages (other sums can be calculated and will be developed in the following sections).

The table below illustrates the concept of a variable distribution table using our example [the mechanism is the same for all types of variables]:

Age (years) \(n_i\) \(\%\)
20 40 40
21 20 20
22 15 15
23 10 10
24 15 15
\(\sum\) 100 100
Table II.1.1. Distribution of respondents by age.
1.2. Graphical and Numerical Representations

The preliminary creation of a table simplifies the construction of the graph. The choice of a type of graph depends on the nature of the variable, its measurement scale, and the type of data grouping performed.

Qualitative Variable

There are two types of graphs for representing a qualitative variable: bar chart and pie chart.

Definition II.1.1: Bar Chart

A bar chart has two perpendicular axes. On the horizontal axis (the axis of the variable's categories), the categories of the variable are represented by segments of equal width, ensuring they are separated by equal spaces. On the vertical axis (number of units, percentages), the frequencies (or percentages) are plotted. For each segment associated with a category, a rectangle is constructed with a height proportional to the frequency (or percentage) of the category, according to an appropriate scale.

Definition II.1.2: Pie Chart

A pie chart is a chart consisting of a circle divided into sectors, each sector having a central angle proportional to the quantity represented (Dodge, 2007, 129-130). Pie charts are mainly used to present data that, when combined, form a whole.

Consider the following frequency table [taken from our example]:

Gender \(n_i\)
Male 60
Female 40
\(\sum\) 100
Table II.1.2. Distribution of respondents by gender.

We will have the following results (concerning) the two types of graphical representations associated with it:

Figure II.1.1. Column chart of the distribution of respondents by gender. Figure II.1.2. Pie chart of the distribution of respondents by gender.
Presentation Standards for a Graph
The following guidelines are standard across most bibliographic styles of writing. The Appendix provides details on various standards related to each writing style.

  • It should have a title similar to that of a variable distribution table;
  • The graph should be numbered;
  • Modalities should be listed (or the axes and units of measurement named), or a legend explaining the symbols used should be added;
  • A comment should be included.
Explore the Chart Editor

Discover the chart editor for qualitative variables. Click the link below to try introducing modalities and data. Learn and master the basics interactively and playfully.

Access the Editor

All editors are accessible in the Appendix of this course.


Quantitative Variable

The representation and processing of a quantitative variable is more complex than that of a qualitative variable. The graphical representation of a quantitative variable mainly depends on two parameters: the number of observations (relative to the population N, or the sample n), and the number of values that the studied variable can take (regardless of whether it is discrete or continuous).

The quantitative variable falls into three main categories of data: isolated data, data grouped by values, and data grouped by class.

Isolated Data

We refer to isolated data when the size (N) of the population is less than 20 units. This represents a small amount of data. Note that this condition has no theoretical basis; it is based on practical experience.

In the case of isolated data, constructing a frequency distribution table does not have any particular significance; data will be grouped in ascending order.

Data Grouped by Values

Data is said to be grouped by values when the number of distinct values of the variable is low compared to the size of the population N (or the sample size n), and the latter is greater than 20.

The treatment of data grouped by values remains the same as for qualitative variables. Constructing a frequency distribution table follows the same logic, with one exception: the modality column is replaced by a values column.

The numerical presentation of a discrete quantitative variable is thus done using a variable distribution table where the first column contains the values of the variable, the second column is the frequency column. If necessary, a third column for percentages can be added.

The following table presents the distribution of a discrete quantitative variable.

Number of Siblings Frequency \(\%\)
0 10 10
1 17 17
2 40 40
3 20 20
4 7 7
5 6 6
\(\sum\) 100 100
Table II.1.3. Distribution of the sample by number of siblings
Graphical Representation

A discrete quantitative variable is represented, when the data are grouped by different values, using a bar chart. A bar chart consists of two perpendicular axes: on the horizontal axis, the various values of the variable are plotted, and on the vertical axis, the corresponding values (or frequencies) are plotted. Perpendicular to the value axis, and opposite each value, a straight line segment, called a bar, is drawn, with a width proportional to the frequency or percentage of the value.

Explore the Chart Editor

Discover the chart editor for discrete quantitative variables. Click the link below to try introducing values and data. Learn and master the basics interactively and playfully.

Access the Editor

All editors are available in the Appendix section of this course.

Grouped Data in Classes

Data are grouped into classes when the number of values of the variable is close to N or n (and n is greater than or equal to 20). In the case of highly variable data, it becomes very difficult (if not impossible) to treat them as isolated values, and constructing a frequency distribution table for the variable becomes unnecessary because we would end up with a large number of frequencies equal to 1. It is therefore appropriate to group them into classes while adhering to certain principles.

To group such data, they should be included in classes.

The numerical presentation of data grouped into classes consists of a distribution table with one column for the classes and a second column for the frequencies.

The following table represents the distribution of class-grouped data for the variable "age" from our example.

Age (Years) Frequency \(\%\)
[20-21[ 40 40
[21-22[ 20 20
[22-23[ 15 15
[23-24[ 10 10
[24-25[ 15 15
\(\sum\) 100 100
Table II.1.3. Distribution of the sample by age

Note: For the purposes of other statistical index calculations, it may be useful to add columns to the previous table, especially for calculating amplitudes, class centers, etc.

Graphical Representation

A frequency distribution of data grouped into classes can be represented in two ways: the histogram and the frequency polygon.

Diagrams

Definition II.1.3: Histogram

The histogram is a graphical representation of the distribution of quantitative data. It consists of vertical bars whose height is proportional to the frequency or proportion of values in each class interval.

Definition II.1.4: Frequency Polygon

The frequency polygon is a broken line that connects the tops of the histogram bars. It allows for a continuous visualization of the data distribution and highlights trends and variations.

The graphical representation of the variable age in our example provides the following results:

Figure II.1.3: Histogram of the distribution

Explore the Graph Editor

Discover the graph editor for continuous quantitative variables. Click the link below to try adding classes and data. Learn and master the basics interactively and playfully.

Access the Editor

All editors are accessible in the Appendix section of this course.


Note: For continuous quantitative variables, we typically calculate several indices related to statistical indicators, such as amplitude and class center, cumulative frequencies (both increasing and decreasing). The next session, which covers measures of central tendency, will revisit these calculations.



2Central Tendency Measures

Central tendency measures aim to highlight the center of the frequency distribution. The measures of central tendency are: mode, median, and mean.

Note: In this session, we will present the main measures of central tendency used in the analysis of data in the humanities and social sciences. This presentation is not arbitrary, as it will be used to study, in a more practical manner, the interpretation of the data contained in the first learning booklet of the analysis software.

At the end of this section, you will find the Spreadsheet that allows you to calculate all central tendency, dispersion, and position parameters. The same spreadsheet is available in the Appendix section of this course.

2.1. The Mode \(M_o\)

The mode (denoted \(M_O\)) is the simplest measure of central tendency to understand.

Note: The mode is the only measure of central tendency that can be evaluated regardless of the nature of the variable. For a qualitative variable, calculating the median or mean is not meaningful.

Definition II.2.1: The Mode

The mode represents the modality (or value) with the highest frequency. When a data series has two modalities with the highest frequency, it is called a Bimodal series.

In our survey, for the variable Gender (VAR001), the Mode is: Male, as it is the most represented modality in terms of frequency (60%).
For the variable number of siblings, the Mode is: 3 (siblings) (40% of the total frequency)

Calculating the Mode for a Qualitative Variable
For a qualitative variable, the mode represents the most frequent modality in the frequency distribution.
Visually, the modality is represented by the tallest bar in the bar chart or the largest sector in the pie chart.

Example:
In our survey, we have already plotted two charts representing the variable Gender. From the pie chart or bar chart representation, we can see that the Mode of our series is the gender: Male

Figure II.2.1. Distribution of the sample by Gender

Calculating the Mode for a Quantitative Variable
If the variable being studied is quantitative, the mode represents the most frequent value in the statistical series.
Depending on the type of data, the mode can be directly calculated or estimated.

Case of Isolated Data

In the case of isolated data, the mode is the value with the greatest number of occurrences.

Example

In the following series: 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6. The Mode \(M_o\) is the number 4 because it has the highest frequency (it repeats the most).

Case of Grouped Data by Values

The definition remains the same as before: the Mode is the value with the highest frequency (percentage) in the distribution table. Graphically, it is represented by the tallest segment in the bar chart.

Example

In our SPSS output work, we can establish the following table showing the variable Number of Siblings:

Figure II.2.2. Distribution of the sample by number of siblings

In the bar chart of the distribution, we can also see that the tallest bar represents a number of siblings equal to: 2

Table II.2.1. Distribution of the sample by number of siblings

Case of Grouped Data

In the case of grouped data, it is not possible to determine a unique value for the Mode, but the modal class can be identified.
Without knowing the exact value of the Mode, the class center \( (c_i) \) is generally used as an estimated value for the Mode.

Definition II.2.2: Class Center \( (c_i) \)

The class center represents the central point of a class in a frequency distribution

To calculate the class center, use the following formula:

\(\text{Class Center} = \frac{\text{Lower Bound} + \text{Upper Bound}}{2}\)

Where:
Lower Bound: the lower limit of the class,
Upper Bound: the upper limit of the class.

Some authors calculate the exact value of the Mode using a formula. For this course, it is not essential to perform this calculation.

Visually, the Mode can be identified in the histogram; it corresponds to the class center of the tallest bar in the histogram.
The following example represents the histogram of the age variable from our dataset.

Figure II.2.3. Histogram of the age variable


We observe that the modal class is the one ranging from 20 to 22 years, which means that the mode of our series is 21 (the center of this class).
A distribution that does not have a visually apparent mode is called a uniform distribution.

2.2. The Median \(M_d\)

The Median (denoted \(M_d\)) is another measure of central tendency that is of particular interest to researchers. The median is the value (or modality) that divides the data into two equal parts.

Definition II.2.2: The Median \(M_d\)

The median is the value (or modality) that divides the data into two equal parts.

In our example, for the variable Study Level (VAR003), the median is: second year, since 55% (more than 50%, i.e., half) of the surveyed students are in the first or second year.

To evaluate or calculate the median, the data must be ordered. In the case of a nominal qualitative variable, such an operation cannot be performed; the median only makes sense for an ordinal qualitative variable or a quantitative variable.

Case of Isolated Data

In the case of isolated quantitative data, the median is the central data if the number of observations is odd. If the number of observations is even, the median will be at the midpoint between the two central values, as indicated in the following two formulas:

Odd \(N\) (or \(n\)): $$\color{RoyalBlue}{{\text{Md}}} = \left(\frac{n+1}{2}\right)^{\text{th}} \text{ data} $$

Example:
In the following series: 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, we observe that there are 13 observations. Since \(n\), representing the number of observations, is an odd number, we choose the 2 as the median because it is the number in the seventh place, as shown below:

\(\underbrace{1, 1, 1, 1, 2, 2}_\text{Six observations} ~~~~ \underbrace{2}_\text{The median} ~~~~\underbrace{3, 3, 4, 4, 4, 5}_\text{Six observations}\)

Even \(N\) (or \(n\)) : $$M_d = \text{the midpoint between } \left(\frac{n}{2}\right)^{\text{th}} \text{ and } \left(\frac{n+1}{2}\right)^{\text{th}} \text{ data} $$

Example:
In the previous series, we will add a number, for example, 6: 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 6. We now observe that there are 14 observations. Since \(n\), representing the number of observations, is an even number, we choose the median as the midpoint between the 7th and 8th observations, as shown below:

\(\underbrace{1, 1, 1, 1, 2, 2}_\text{Six observations} ~~~~ \underbrace{\frac{2 + 3}{2}}_\text{The median} ~~~~ \underbrace{3, 4, 4, 4, 5, 6}_\text{Six observations}\)

Therefore, we say that the median is: 2.5.

Case of Grouped Data by Values

The definition remains the same as for an ordinal qualitative variable. To determine the median, first calculate the cumulative percentages of the data distribution; then identify the first value where 50% or more of the cumulative percentages are represented.

Note: When the cumulative percentage of a value corresponds exactly to 50%, the median will be the number located midway between the value with a 50% cumulative percentage and the next value.

Example:
In our example guide, we will calculate the Median for the variable number of siblings, resulting in the following:


Figure II.2.4. Median value of the number of siblings variable, SPSS calculation


Case of Grouped Data by Classes

To determine the median in the case of grouped data by classes, the following steps are generally followed:

  • Determine the median class and its lower bound: the median class is the first class whose cumulative percentage exceeds 50%;
  • Determine the frequency of the median class;
  • Determine the cumulative frequency of the class preceding the median class.

Then, apply the following formula:

\(M_d= L+\left[\frac{ { \frac{n}{2}} - \sum f_{inf}}{f_{Md}} \right]* c\)

Note: The same formula can be applied by replacing frequencies with percentages.

In our example, we calculate the median for the age variable as follows:

\(M_d= 20+\left[ \frac {50 -0} {75} \right]* 2 = 21\)
2.3. The Mean

The mean is one of the most commonly used measures of central tendency in human sciences research.
The term Mean encompasses several types: geometric mean, quadratic mean, etc. In our case, we use the term Mean to refer to the arithmetic mean calculated in data analysis.

Definition II.2.3: The Arithmetic Mean \(\bar x\)

The arithmetic mean is the ratio of the sum of the weighted frequencies to the number of observations (N).

In our example, for the variable number of siblings (VAR004), the arithmetic mean is equal to: 2 (siblings).

Note: We refer to the usual arithmetic mean when dealing with a variable whose data are limited (which do not require weighting). This definition is rarely or not used in research.

When calculating the population mean, we use the notation \(\mu_x\) (read as mu sub x), while the notation \(\bar x\) (read as x bar) is used for a sample. This distinction is important to remember, as it will be needed during inference operations.

Generally, and as with the median, the mean cannot be calculated for a nominal qualitative variable, but it can be for an ordinal qualitative variable if certain conditions are met. The calculation of the mean is especially meaningful for a quantitative variable.

In simple terms, the mean is given by the following relationship:

For a population:

\(\mu_x = \frac {Sum~ of~ all~ data~ in~ the~ population} {Total~ number~ of~ data~ in~ the~ population} \)

For a sample:

\(\bar x = \frac {Sum~ of~ all~ data~ in~ the~ sample} {Total~ number~ of~ data~ in~ the~ sample} \)

Case of a Qualitative Variable

In the case of an ordinal qualitative variable, the mean can be calculated provided that the variable's modalities have been previously coded. The definition of the mean calculation remains the same as mentioned above.

Case of a Quantitative Variable

The calculation of the mean for a quantitative variable follows the same logic described above. However, there is a difference when the considered variable is a continuous quantitative variable, which involves using class centers as in the calculation of the median discussed previously.

Calculation of the Mean for Isolated Data

For isolated data, the mean is calculated by summing all the values of the variable and dividing by their number, as expressed in the following formula:

For the population:

\(\mu_x = \frac {\sum \chi_i } {N}\)

For the sample:

\(\bar \chi = \frac {\sum \chi_i } {n}\)


Example: Consider the following series: 1, 3, 4, 5, 5, 6, 6, 7, 8, 9, 11, 12, 13, 30, 30.

Applying the definition of the arithmetic mean, we get the following result:

\(\bar \chi = \frac {\sum \chi_i } {n} = \frac {1+3+4+5+5+6+6+7+8+9+11+12+13+30+30 } {15} = \bar \chi = \frac {150} {15} = 10 ~~ , ~~ \bar{\chi} = 10 \)


Calculation of the Mean for Grouped Data by Values
When the data is grouped by values, the formula applies with an adjustment: the mean is weighted by the respective frequencies \((n_i)\) representing each value.

Thus, the formula to use for data grouped by values is written as follows:

For the population:

\(\mu_x = \frac {\sum n_i \chi_i } {N}\)

For the sample:

\(\bar \chi = \frac {\sum n_i \chi_i } {n}\)

Example: We will calculate here, using SPSS software, the arithmetic mean for data grouped by values; the same procedure applies to data grouped by class.

In this example, we will calculate the arithmetic mean for the variable number of siblings from our example guide.
In SPSS, the manipulation yields the following result:


Figure II.2.6. Value of the mean for the variable number of siblings, SPSS calculation

Calculation of the Mean for Grouped Data by Classes

In the case of data grouped by classes, the value of the mean is weighted according to the class midpoints, providing an approximate result for the mean. The mean is calculated using the following formulas:

For the population:

\(\mu_x = \frac {\sum n_i c_i } {N}\)

For the sample:

\( \bar \chi = \frac {\sum n_i c_i } {n}\)

Example: Provide an example of calculating the arithmetic mean in the case of a continuous quantitative variable.

Usage of Parameters
  • Mode is mainly used in the context of a qualitative variable, especially when the distribution is bimodal or multimodal, as the data tends to be less meaningful when calculating the mean or median;

  • The median is used in the case of a quantitative variable and when the distribution is symmetric;

  • In the case of a quantitative variable with a symmetric distribution, the most appropriate measure is the mean, as it allows for inference due to its stability;

  • When, in the case of a symmetric distribution, all three measures (mean, median, mode) are close, the mean is often used as it is the most representative.


3Dispersion Parameters


Measures of central tendency provide information about the center of the distribution. However, they have limitations when the data in a distribution are sufficiently dispersed that these measures do not faithfully represent the observed data. Dispersion measures can be considered complementary to those describing what happens at the center of the distribution.
In the humanities, we use two (02) types of dispersion measures: range and standard deviation.

3.1. Range

The range is defined as the difference between the largest and smallest values in a statistical series.

Case of Isolated Data

Using the definition, we obtain a simpler formula:

\( \text{Range} = \text{Largest value} - \text{Smallest value} \)
Example and Explanation

In a descriptive statistics exam, a teacher corrected the papers of two groups, each consisting of forty (40) students. We will reproduce the scores of each group.

Group 1
0 0 1 1 2
3 3 4 5 6
6 7 8 9 9
9 10 10 10 10
10 10 11 11 12
12 13 13 14 15
15 16 16 17 17
18 18 19 20 20
Group 2
4 4 4 4 5
5 5 6 6 6
6 7 7 7 7
8 10 10 10 10
10 10 10 10 10
11 11 11 14 14
14 15 15 15 16
16 16 17 17 17

We notice that both series have exactly the same central values, namely: a Mode of 10, a mean of 10, and a median of 10 as well.

However, when calculating the range for each of the two series, we obtain the following results:

\( \text{Range} ~~ (\text{Group 1}) ~~: ~~ E_{G1} ~~ = ~~ 20 ~~ - ~~ 0 ~~ = ~~ 20. \)

\( \text{Range} ~~ (\text{Group 2}) ~~: ~~ E_{G2} ~~ = ~~ 17 ~~ - ~~ 4 ~~ = ~~ 13. \)

We observe, from the range calculation, that in the first group the scores vary from 0 to 20, resulting in a range that is larger than that observed in the second group, which is 13. The dispersion of the scores obtained by the students is greater in the first group than in the second.

Despite being simple to evaluate, the range provides an initial insight, an impression of the variability of the data.

Case of Data Grouped by Values

Regardless of the type of value grouping considered, the definition of the range remains the same:

\( \text{Range} ~~ = ~~ \text{Largest value} ~~ - ~~ \text{Smallest value} \)

Example and Explanation

In the following table, which presents the number of children per household, we can easily assess the range :

\(x_i\) 0 1 2 3 4 5 \(\sum\)
\(n_i\) 22 40 18 12 05 03 \(100\)

The highest value being \(5\) and the lowest value being \(0\), we note that the range is \(5 - 0 = 5\). The same observation can be made regarding the first example: two series with the same range do not necessarily have the same variability.

Case of Data Grouped by Classes

In the case of data grouped into classes, the range is calculated based on the class boundaries. The range of the sample is equal to the difference between the upper boundary of the last class and the lower boundary of the first class.

\( \text{Range} ~~ = ~~ \text{Upper boundary of the last class} ~~ - ~~ \text{Lower boundary of the first class} \)

Example and Explanation

In the following example, which represents the distribution of a sample according to the variable age, we will evaluate the range as follows:

\(Age_{years}\) [ 20 - 30 [ [ 30 - 40 [ [ 40 - 50 [ [ 50 - 60 [ [ 60 - 70 [ [ 70 - 80 [ \(\sum\)
\(n_i\) 10 20 40 15 14 11 \(100\)

The highest value being 80 and the lowest value being 30, we note that the range is \(80 - 30 = 50\). The same observation can be made regarding the first example: two series with the same range do not necessarily have the same variability.


3.2. Variance

The variance \(\sigma_x^2\) is, like the standard deviation \(\sigma_x\) and the coefficient of variation (\(C_v\)), an indicator of dispersion around the mean.

The principle of calculating variance (and standard deviation) involves estimating the average difference (or what is also called the mean deviation) of each observation from the arithmetic mean of these observations.

It is noted that calculating mean deviations results in positive and negative values that cancel each other out; the sum of all mean deviations is equal to 0. To address this, variance provides a solution by calculating the squares of the deviations, which are either zero or positive.

The variance of a variable \(x\), denoted \(\sigma_x^2\) (read as \( \sigma^2 \text{ subscript } x \)), can be calculated for both the population and the sample.

Definition II.3.1: Variance \(\sigma_x^2\)

The variance of a variable (x) is equal to the average of the squared deviations between the values of the variable and the mean.

Note: It should be noted that the greater the mean deviations (deviations of the values from the mean), the higher the variance, and vice versa. Thus, the dispersion around the mean is greater.

Case of Individual Data

If the data are individual, the \(n\) values of the variable are denoted: \(x_1, x_2, x_3, x_4, x_5...., x_n\). The variance is obtained by dividing the sum of the squared deviations between the data and the population mean by the number of data points, as shown in the following formula (click on the formula for more details):


Note regarding the calculation of variance for a sample. For a sample, the formula for calculating variance needs to be adjusted. This modification accounts for sampling error and the fact that the sample is smaller than the population.

The correction for sampling bias is obtained by dividing the sum of squared deviations by (n − 1) rather than n. Thus, the variance calculated for a sample is called sample variance and is denoted: \(\sigma_{x}^{2}\).

The formula to use for calculating variance for individual data in a sample is as follows (click on the formula for more details):


Remark (variance calculation). In the calculation of variance, care should be taken not to subtract the square of the arithmetic mean from the sum of \(x_{i}^{2}\). It is necessary to divide by the number of observations first.

By simplifying the variance formula, we end up with the following equation, known as the König-Huygens theorem:

\(\sigma_{x}^{2} = \frac{1}{n} \sum\limits_{i} {x_i}^{2} - \bar{x}^2\)

Example and Explanation

Let's try to calculate the variance for each group from the previous example: (we will use the table to better compute the terms of the equation)

Calculating the Variance for Group 1:

We have seen that the arithmetic mean \(\bar{x} = 10\). The following table includes an additional cell for calculating \(x_i^2\):

Group 1

\(x\) 0 0 1 1 2 3 3 4 5 6 6 7 8 9 9 9 10 10 10 10
\(x^{2}\) 0 0 1 1 4 9 9 16 25 36 36 49 64 81 81 81 100 100 100 100
\(x\) 10 10 10 11 11 12 12 13 13 14 15 15 16 16 17 17 18 18 19 20
\(x^{2}\) 100 100 100 121 121 144 144 169 169 196 225 225 256 256 289 289 324 324 361 400

Applying the formula, we get the following result: \(29.4\)

We will now proceed with calculating the variance for the second group to compare the results between the two groups:

Calculating the Variance for Group 2:

For Group 2, we will replicate the same procedure, and we will end up with the following result:

Group 2

\(x\) 4 4 4 4 5 5 5 6 6 6 6 7 7 7 7 8 10 10 10 10
\(x^{2}\) 16 16 16 16 25 25 25 36 36 36 36 49 49 49 49 64 100 100 100 100
\(x\) 10 10 10 10 10 11 11 11 14 14 14 15 15 15 16 16 16 17 17 17
\(x^{2}\) 100 100 100 100 100 121 121 121 196 196 196 225 225 225 256 256 256 289 289 289

Applying the formula, we get the following result: \(17.6\)

Note

It is worth noting that variance, like range, is sensitive to the variability of observations. Just as with the range, the variance of the scores in Group 1 is higher than that of Group 2.

Case of Data Grouped by Values

Consider a quantitative variable \(x\) defined on a population of \(n\) individuals, with the values of the variable being: \(x_1, x_2, x_3, x_4, ........x_k\) and the frequencies: \(n_1, n_2, n_3, n_4, ........n_k\). The variance is calculated following the same logic, noting that it is weighted by \(n_i\).

For a Population:
We remain within the same definition; variance is the weighted average of the squared deviations from the mean.

\(\sigma_{x}^{2} = \frac {\sum\limits_{i= 1}^{k} {n_i} (x_i - \mu_x)^2} {N}\)

For a Sample:

\(\sigma_{x}^{2} = \frac {\sum\limits_{i=1}^{k} n_i (x_i - \bar x)^2} {n-1}\)

We can once again use the König-Huygens theorem, simplifying it to obtain the following formula:

For a Population:

\(\sigma_{x}^{2} = (\frac {1}{N} \sum\limits_{i=1}^{k} n_i {x_i}^2) - \mu_x^2\)

For a Sample:

\(\sigma_{x}^{2} = (\frac {1}{n} \sum\limits_{i=1}^{k} n_i {x_i}^2) - \bar x^2\)
Example and Explanation

Let's revisit the example of the number of children per household. We will add two columns to our table: in the first column, we will calculate \((x_i - \bar{x})^2\) and in the second column, \(n_i \cdot (x_i - \bar{x})^2\):

\(x_i\) \(n_i\) \((x_i-\bar{x})^2\) \(n_i . (x_i - \bar{x})^2\)
0 22 - 1.47 - 32.34
1 40 - 0.47 - 8.83
2 18 0.53 5.05
3 12 2.34 28.09
4 5 2.53 32
5 3 3.53 37.38
N 100 --- 61.35

The mean of the series is equal to: \(1.47\). Applying the variance formula, we get the following result:

\(\sigma_{x}^{2} = \frac{61.35}{100} = 0.6135\)
Case of Values Grouped by Classes

The definition remains the same except that values are replaced by class midpoints, noted as \(c_i\).

For a Population:

\(\sigma_{x}^{2} = \frac {\sum n_i (c_i - \mu_x)^2} {N}\)

For a Sample:

\(\sigma_{x}^{2} = \frac {\sum n_i (c_i - \bar x)^2} {n-1} \)

For the example regarding age, we get the result: \(\sigma_{x}^{2} = 283.84\).

3.3. The Standard Deviation

Variance has the drawback of being expressed in squared units (in our previous example: students' scores squared, number of children squared, etc.), which makes it less directly interpretable.

To bring variance to the same scale as the mean, we take the square root of it, resulting in a measure expressed in the same unit as the variable being studied: the standard deviation.

Definition II.3.2: The Standard Deviation \(\sigma_x\)

The standard deviation is defined as the square root of the variance (denoted as \(\sigma_x\) in the case of the population, read as sigma x, and \(S_x\) in the case of the sample). The standard deviation measures the average deviation between a value of the variable and the mean of the variable, effectively expressing it in the same unit of measurement.

Explanation: Referring to the variance of the age variable, the standard deviation would be: \(\sigma_x = \sqrt{283.84} = 16.84\). Thus, we observe that the age variable shows significant variation in our sample.

Use of Parameters:
  • The Range is simple to calculate and understand; it provides a quick idea of the difference between the extreme values of a dataset. However, it is sensitive to extreme or outlier values and does not provide information about the distribution of other values.

  • Variance uses all the data from the sample (or population), giving a comprehensive measure of dispersion. It remains a fundamental measure for many statistical tests and models (we will discuss this in the course on Inference). One weakness of variance is that its units are the squares of the data units, which can be difficult to interpret. Like the range, variance is also sensitive to extreme values.

  • The standard deviation is easy to interpret because it is expressed in the same units as the original data. However, like variance, it is sensitive to extreme values.

4 Position Parameters


Position statistics allow us to make comparisons, positioning one or more observations relative to the mean or to the entire set of observations.

There are different positioning statistics, and we are interested in three of them: quantiles (or percentile ranks), the absolute rank, and the reference value.

4.1. Quantiles (Percentile Ranks)

A frequency distribution can be divided into a chosen number of parts. The percentile rank indicates the position of an observation (or value) relative to, and in comparison with, all other observations.

The operation of dividing the frequency distribution into certain parts is referred to as quantiles.

Quantiles are defined by analogy with the median. The most commonly used quantiles divide the frequency distribution into four (quartiles), five (quintiles), ten (deciles), and one hundred (percentiles).

A quantile of order α %, denoted q α, is the value of the variable for which the associated cumulative frequency is equal to α %

For frequencies: $$ F (q_{\alpha}) = \alpha \% $$

For counts: $$ N (q_{\alpha}) = \alpha \% \cdot n $$

In this section, we will limit ourselves to percentiles; other quantiles will be mentioned (with their formulas), and the reader can refer to them as needed.

Calculating the Percentile Rank
By definition, the percentile rank is defined by the percentage of observations falling below this value plus half the percentage of observations falling exactly on this value (definition to be reviewed and improved).
The calculation of the percentile rank is done using the statistical table, after calculating the cumulative frequencies (percentages), and then performing an arithmetic correction to obtain the value of the percentile rank.
[insert a table and explain the procedures]
Percentile ranks are used in standardized tests, also known as norm-referenced tests, such as IQ tests, TOEFL, SAT, GRE, and GMAT, etc. By definition, standardized tests or norm-referenced tests are assessments designed to be administered and scored consistently for all participants.

Percentiles
Percentiles are values that divide the frequency distribution into 100 equal parts.
For example, the 18th percentile, denoted C18 (C subscript 18), is the value below which 18% of the data falls (and 82% of the data is above).

A percentile of order α is denoted Cα (where α represents the value below which α % of the data falls).
The calculation of a quantile is similar to that of the median, except that 50% is replaced by α %.

Case of isolated data
Calculating a percentile in the case of isolated data is quite simple: if \(N\) corresponds to 100% of the data, then \(\alpha\) % corresponds to \(d\) data points, expressed as \(p ~~( Position ) \). The rule of three applied to this kind of calculation is:

Starting from the following equivalence:

$$\frac {\alpha} {100} = \frac {p}{N} \xrightarrow{\hspace{3cm}} p = \frac {N \alpha}{100}$$

Note, examples, and explanations:

  • Case 1: If d is an integer:

    Suppose we have the following scores for 5 students:

    \[ 45, 50, 55, 60, 65 \]

    We will calculate the 40th percentile (P40).

    Calculation Steps:

    1. Sort the data in ascending order.
    2. Calculate the position of the percentile: \[ \text{Position} = 40 \times \left( \frac{5 + 1}{100} \right) = 40 \times 0.06 = 2.4 \] The position 2.4 means that the 40th percentile lies between the 2nd and 3rd scores.
    3. Interpolate to find the exact value:

      The corresponding values are:

      • 2nd value: 50
      • 3rd value: 55

      Linear interpolation is done as follows:

      \[ P40 = \text{Value at the lower position} + (\text{Fractional part of the position} \times \text{Difference between the values}) \] \[ P40 = 50 + (0.4 \times (55 - 50)) = 50 + (0.4 \times 5) = 50 + 2 = 52 \]

    Result: The 40th percentile for these data is 52. This means that 40% of the students have a score of 52 or less.

  • Case 2: If d is not an integer:

    We have the following scores for 8 students:

    \[ 48, 55, 58, 60, 65, 68, 72, 75 \]

    We will calculate the 75th percentile (P75).

    Calculation Steps:

    1. Sort the data in ascending order.
    2. Calculate the position of the percentile: \[ \text{Position} = 75 \times \left( \frac{8 + 1}{100} \right) = 75 \times 0.09 = 6.75 \] The position 6.75 means that the 75th percentile is located between the 6th and 7th scores.
    3. Interpolate to find the exact value:

      The corresponding values are:

      • 6th value: 68
      • 7th value: 72

      Linear interpolation is done as follows:

      \[ P75 = \text{Value at the lower position} + (\text{Fractional part of the position} \times \text{Difference between the values}) \] \[ P75 = 68 + (0.75 \times (72 - 68)) = 68 + (0.75 \times 4) = 68 + 3 = 71 \]

    Result: The 75th percentile for these data is 71. This means that 75% of the students have a score of 71 or less.

  • Case of grouped data by values

    The logic of the calculation remains the same as in the case of isolated data. The percentile of order α is the first value for which the cumulative percentage exceeds \( \frac {α} {100} \). If there is a value for which the cumulative percentage is equal to \(\frac{α} {100} \), the percentile is the number located halfway between this value and the next value.

    To calculate the percentile of order α, we use the formula for calculating the median for a discrete quantitative variable.


    Case of grouped data by classes

    In the case of data grouped into classes, we will use the formula for calculating the median as discussed previously.

    The calculation of the percentile α will involve finding the value that exceeds α%.

    To accurately calculate the percentile α %, replace 50% with α % and select the class containing Cα (not the median class).

    $$C_{\alpha}= b_{cα} \left [\frac {α- F_{cα-1}} {F_{cα}}\right] * L_{cα} $$

    Data: Suppose we have the following grouped scores for 40 students:

    Classes Frequency (f)
    [ 0 - 10 [ 5
    [ 10 - 20 [ 8
    [ 20 - 30 [ 12
    [ 30 - 40 [ 10
    [ 40 - 50 [ 5
    Σ 40

    We will calculate the 70th percentile (P70).

    Calculation Steps:

    1. Calculate the total number of observations (N): \[ N = \sum f = 40 \]
    2. Calculate the position of the percentile: \[ \text{Position} = \frac{70}{100} \times N = 0.70 \times 40 = 28 \]
    3. Identify the class interval containing the percentile:

      We need to calculate the cumulative frequency until we reach position 28:

      • For the class 0-10: \( F_1 = 5 \)
      • For the class 10-20: \( F_2 = 5 + 8 = 13 \)
      • For the class 20-30: \( F_3 = 13 + 12 = 25 \)
      • For the class 30-40: \( F_4 = 25 + 10 = 35 \)

      The 28th observation falls within the cumulative frequency of 35, corresponding to the interval [30, 40].

    4. Apply the formula to calculate the percentile: \[ P70 = b_{c\alpha} \left [\frac {\alpha - F_{c\alpha-1}} {F_{c\alpha}}\right] \times L_{c\alpha} \] where:
      • \( b_{c\alpha} = 30 \) (the lower boundary of the class interval containing the percentile)
      • \( F_{c\alpha-1} = 25 \) (the cumulative frequency before the class interval containing the percentile)
      • \( F_{c\alpha} = 10 \) (the frequency of the class interval containing the percentile)
      • \( L_{c\alpha} = 10 \) (the size of the class interval)
      \[ P70 = 30 + \left( \frac{28 - 25}{10} \right) \times 10 \] \[ P70 = 30 + \left( \frac{3}{10} \right) \times 10 \] \[ P70 = 30 + 0.3 \times 10 \] \[ P70 = 30 + 3 = 33 \]

    Result: The 70th percentile for these grouped data is 33. This means that 70% of the students have a score less than or equal to 33.

    Note: Use of Percentiles

    The percentile rank is a simple statistic to calculate and interpret; however, it can be an inadequate measure when the distribution is not symmetrical, particularly when the sample size is small. The percentile rank does not consider statistical indices (Mean and standard deviation) in its interpretation, making it sensitive to the shape of the data distribution.


    4.2. Rank

    Rank helps determine the position of a single data point. There are usually three types of rank: absolute rank, fifth rank, and percentile rank.

    The absolute rank indicates the position of an observation in relation to, comparatively, the extreme observations. The statistical series being arranged in ascending or descending order. Absolute rank is a positioning statistic that indicates, somewhat loosely, the position of an observation relative to the observations at the two ends of the data set. The fifth rank is a number between 1 and 5, indicating which interval a data point falls into in a distribution divided into five equal parts.

    In our course, we will focus only on the percentile rank.

    Percentile Rank:

    By definition, the percentile rank is the percentage of data points below it. The percentile rank is expressed as an integer with a value between 1 and 99. Determining the percentile rank is the reverse operation of determining the percentile.

    Example and Explanation:

    The following table shows the grouped scores for 40 students:

    Classes Frequency (f)
    [ 0 - 10 [ 5
    [ 10 - 20 [ 8
    [ 20 - 30 [ 12
    [ 30 - 40 [ 10
    [ 40 - 50 [ 5
    Σ 40

    We will calculate the percentile rank for a value of 35.

    Calculation Steps:

    1. Identify the class interval containing the value:

      The value 35 falls within the interval [30, 40[.

    2. Calculate the cumulative frequency before the interval containing the value:
      • For the class 0-10: \( F_1 = 5 \)
      • For the class 10-20: \( F_2 = 5 + 8 = 13 \)
      • For the class 20-30: \( F_3 = 13 + 12 = 25 \)
      • \( F \) for the interval [30, 40[ before 30 is \( F_3 = 25 \)
    3. Formula:

      $$Percentile~~rank~~=~~\text{integer part of} \left[ \frac{X_{r} - b_{r}} {L_{r}} \times f_{r} + F_{r-1} \right]$$
    4. Apply the formula to calculate the percentile rank: \[ Percentile~~rank~~=~~\text{integer part of} \left( \frac{F + \frac{(x - b_{r})}{a_i} \times f_r}{N} \right) \times 100 \] where:
      • \( b_{r} = 30 \) (the lower bound of the class interval containing the value)
      • \( F = 25 \) (the cumulative frequency before the class interval containing the value)
      • \( f_r = 10 \) (the frequency of the class interval containing the value)
      • \( x = 35 \) (the value for which we are calculating the percentile rank)
      • \( a_i = 10 \) (the class interval width)
      • \( N = 40 \) (the total number of observations)
      \[ P = \left( \frac{25 + \frac{(35 - 30)}{10} \times 10}{40} \right) \times 100 \] \[ P = \left( \frac{25 + \frac{5}{10} \times 10}{40} \right) \times 100 \] \[ P = \left( \frac{25 + 0.5 \times 10}{40} \right) \times 100 \] \[ P = \left( \frac{25 + 5}{40} \right) \times 100 \] \[ P = \left( \frac{30}{40} \right) \times 100 \] \[ P = 0.75 \times 100 \] \[ P = 75 \]

    Result: The value of 35 is at the 75th percentile. This means that 75% of the students have a score less than or equal to 35.

    The percentile rank can be determined directly using the ogive.

    Definition II.4.1: The Ogive

    The ogive is a graph that represents the cumulative frequency of the data. It allows us to visualize the cumulative distribution and estimate percentiles or percentile ranks. The horizontal axis \((x)\) represents the values or classes, and the vertical axis \((y)\) represents the cumulative frequency.

    We will calculate the percentile rank for a value of 35 using the ogive.

    Steps to Calculate the Percentile Rank from the Ogive:

    1. Plot the ogive:

      Calculate the cumulative frequency for each class and plot the points corresponding to the upper bounds of each class and their cumulative frequency.

    2. Figure II.4.1 : Ogive of the grouped data.

    3. Determine the percentile rank from the ogive:

      Locate the value 35 on the x-axis. Draw a vertical line from 35 to the ogive. Draw a horizontal line from the intersection to the y-axis to read the percentile rank.

    Percentile Rank Calculation:

    From the ogive, the value of 35 corresponds to a cumulative frequency of 30.

    The percentile rank for a value of 35 is thus:

    \[ P = \left( \frac{30}{40} \right) \times 100 = 75 \]

    Result: The value of 35 is at the 75th percentile. This means that 75% of the students have a score less than or equal to 35.

    4.3. The Z Score

    The Z score allows us to represent the position of an observation relative to the unit of measurement that is the standard deviation.

    By definition, the Z score is the distance between a data point and the mean, expressed in standard deviations.

    Definition II.4.2: The Z Score

    The Z score, also known as the Z value or standardized score, is a statistical measure that indicates how many standard deviations a data point is above or below the mean of the dataset. In other words, the Z score allows us to standardize different values within a dataset, enabling comparisons between data from different distributions or sets.

    Formula:

    The \(Z\) score for a value \(x\) is calculated using the following formula:

    $$ Z = \frac {Value~~of~~the~~data~~ - Mean} {Standard~~Deviation}$$ This formula can be rewritten as: $$ Z = \frac {x - M} { \sigma} $$

    where:

    • \(x\) is the value of the observation;
    • \(M\) (or \(\bar{x}\)) is the arithmetic mean of the population (or sample);
    • \(\sigma\) is the standard deviation of the dataset (sample or population).
    Example and Explanation:

    The two tables below show the respective scores of twenty students in two modules: Research Methodology in Human and Social Sciences and Presentation and Data Analysis.

    The goal is to rank the students based on the combined results from the two modules, in comparison to the mean, variance, and standard deviation of their scores.

    Comparison of Scores with the Z Score
    Student Research Methodology in Social Sciences
    Student 160
    Student 270
    Student 380
    Student 490
    Student 550
    Student 685
    Student 775
    Student 845
    Student 965
    Student 1055
    Student 1170
    Student 1295
    Student 1365
    Student 1455
    Student 1585
    Student 1675
    Student 1765
    Student 1855
    Student 1960
    Student 2080
    Student Presentation and Data Analysis
    Student 165
    Student 275
    Student 385
    Student 495
    Student 555
    Student 680
    Student 790
    Student 850
    Student 970
    Student 1060
    Student 1175
    Student 1295
    Student 1365
    Student 1455
    Student 1580
    Student 1670
    Student 1765
    Student 1855
    Student 1960
    Student 2085
    Statistical Calculations
    Research Methodology Module:

    \[ \text{Mean} = \frac{\displaystyle \scriptsize 60 + 70 + 80 + 90 + 50 + 85 + 75 + 45 + 65 + 55 + 70 + 95 + 65 + 55 + 85 + 75 + 65 + 55 + 60 + 80}{\scriptsize 20} = 70 \]

    \[ \text{Variance} = \frac{\sum (x_i - \mu)^2}{n} = 200 \]

    \[ \text{Standard Deviation} = \sqrt{200} = 14.14 \]


    Presentation and Data Analysis Module:

    \[ \text{Mean} = \frac{ \displaystyle \scriptsize 65 + 75 + 85 + 95 + 55 + 80 + 90 + 50 + 70 + 60 + 75 + 95 + 65 + 55 + 80 + 70 + 65 + 55 + 60 + 85}{\displaystyle \scriptsize 20} = 72.5 \]

    \[ \text{Variance} = \frac{\sum (x_i - \mu)^2}{n} = 206.25 \]

    \[ \text{Standard Deviation} = \sqrt{206.25} = 14.36 \]


    Z Score

    Using the Z Score formula, we will calculate the Z Score for each module, then, once obtained, we will calculate the average by adding them and dividing by two. We will then have the average Z Score with which we will rank the students' results.

    Student Z Score (Research Methodology in Social Sciences) Z Score (Presentation and Data Analysis) Average Z Score
    Student 1-0.71-0.52-0.62
    Student 20.000.170.08
    Student 30.710.870.79
    Student 41.411.571.49
    Student 5-1.41-1.22-1.32
    Student 61.060.520.79
    Student 70.351.220.78
    Student 8-1.77-1.57-1.67
    Student 9-0.35-0.17-0.26
    Student 10-1.06-0.87-0.97
    Student 110.000.170.08
    Student 121.771.571.67
    Student 13-0.35-0.52-0.44
    Student 14-1.06-1.22-1.14
    Student 151.060.520.79
    Student 160.35-0.170.09
    Student 17-0.35-0.52-0.44
    Student 18-1.06-1.22-1.14
    Student 19-0.71-0.87-0.79
    Student 200.710.870.79

    Student Ranking

    After calculating the average Z-Scores, we can obtain this ranking.

    Rank Student Average Z-Score
    1Student 121.67
    2Student 41.49
    3Student 30.79
    4Student 60.79
    5Student 70.78
    6Student 150.79
    7Student 200.79
    8Student 20.08
    9Student 110.08
    10Student 160.09
    11Student 1-0.62
    12Student 9-0.26
    13Student 13-0.44
    14Student 17-0.44
    15Student 10-0.97
    16Student 5-1.32
    17Student 14-1.14
    18Student 18-1.14
    19Student 8-1.67
    20Student 19-0.79

    Table II.4.1. Student ranking according to their average Z-Scores

    Use of Parameters:
    • Quantiles provide a detailed view of the data distribution, they are less sensitive to outliers; however, they can be less intuitive to understand and interpret and require more complex calculations for large data sets;

    • Percentile Rank (or Percentiles) allow comparing individual values to the rest of the data set, they are also useful for distribution analysis and for identifying extreme values; on the other hand, their use can be influenced by the sample size, and another challenge lies in the fact that they require ranking calculations and interpolations for non-uniform data sets;

    • Z-Score (Standard Score, Z-value, Z-score) allows for standardizing different distributions to make them comparable, it indicates the relative position of a value compared to the mean in terms of standard deviations and is useful for detecting outliers. The Z-Score requires knowledge of the mean and standard deviation of the data set and is less intuitive for non-specialists to understand.
    Explore the Discrete Data Editor

    Discover the data editor for discrete quantitative variables. Click the link below to try entering table data to calculate statistical parameters. Learn and master the basics interactively and playfully.

    Access the Editor

    All editors are accessible in the Annex section of this Course.


    Explore the Continuous Data Editor

    Discover the data editor for continuous quantitative variables. Click the link below to try entering class data from a table to calculate statistical parameters. Learn and master the basics interactively and playfully.

    Access the Editor

    All editors are accessible in the Annex section of this Course.


    upload-to-cloud Summary

    In this course, we have just covered the various indices that allow us to describe a data set.

    Central tendency indices are present in most documents related to data analysis. Central tendency indicators can be seen as a first approach to understanding the overall information that defines the identity of our population or survey sample.
    Central tendency parameters also help guide the future analysis of our data. Therefore, it is important to grasp their significance:

    • Central tendency indices are important for gaining a general overview of data analysis;
    • The Mode is the simplest index to calculate, providing information on the most frequent occurrence in our sample;
    • The Median is an index that tells us the position of the middle of our frequency;
    • The Mean gives us an approximation of the relationship between frequencies and observations.

    In this course, we have also covered how to calculate and interpret measures of dispersion.

    Measures of dispersion, when associated with measures of central tendency, provide a preliminary approach to analyzing our survey data. It is crucial to master the process:

    • Dispersion parameters help us understand what happens around the mean;
    • The Range is the difference between the highest and lowest values in a statistical series;
    • The variance of a variable is equal to the mean of the squared deviations between the values of the variable and the mean;
    • The standard deviation is the square root of the variance.

    Univariate analysis will also involve the interpretation of measures of position. These allow us to know and identify the exact location of an observation in our statistical series:

    • Quantiles help to understand the distribution of data by dividing it into equal segments, they are used to detect outliers by comparing extreme values and facilitate comparison between different data distributions by providing uniform benchmarks;
    • Ranks provide a simple and intuitive understanding of the relative position of an observation in a dataset, they are used in many non-parametric statistical tests (such as the Wilcoxon test, the Kruskal-Wallis test) that do not require the assumption of normality of the data;
    • The Z-score allows for the standardization of different data distributions, thus facilitating their comparison even if they have different scales, and it is useful for converting different score scales into a common scale, making comparison easier.

    books Block Bibliography

    The course does not have a final bibliography (in its online version); references are inserted at the end of each block.

    • Agresti, A., Franklin, C., & Klingenberg, B. (2023). Statistics: The art and science of learning from data (5th ed.). Pearson.
    • Bluman, A. G. (2023). Elementary statistics: A step by step approach, a brief version (with extra additional topics) (8th ed.). McGraw-Hill Education.
    • Brase, C. H., & Brase, C. P. (2023). Understandable statistics: Concepts and methods (13th ed.). Cengage Learning.
    • Brase, C. H., Brase, C. P., Dolor, J., & Seibert, J. (2023). Understandable statistics: Concepts and methods (13th ed.). Cengage Learning.
    • Carroll, S. R., & Carroll, D. J. (2023). Simplifying statistics for graduate students. Rowman & Littlefield Publishers.
    • Field, A. (2024). Discovering statistics using IBM SPSS statistics (6th ed.). SAGE Publications Ltd.
    • James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An introduction to statistical learning: With applications in Python (Springer Texts in Statistics) (1st ed.). Springer.
    • Kahl, A. (2023). Introductory statistics. Bentham Science Publishers.
    • Larson, R. (2023). Elementary statistics: Picturing the world (8th ed.). Pearson.
    • Otsuka, J. (2023). Thinking about statistics: The philosophical foundations. Routledge.
    • Yau, N. (2024). Visualize this: The FlowingData guide to design, visualization, and statistics (2nd ed.). Wiley.
    ask-question Synthesis Questions

    The following questions allow you to review the knowledge discussed during the Block. We will have a discussion during the tutorial sessions.

    • What is the arithmetic mean, and how is it calculated?
    • What are the limitations of using the mean as a measure of central tendency?
    • How does the median differ from the mean, and in what situations is it preferred?
    • What is the mode, and when is it used as a measure of central tendency?
    • How do you interpret the standard deviation as a measure of dispersion?
    • Compare and contrast variance and standard deviation in terms of use and interpretation.
    • What are quartiles, and how are they used to describe the position of data?
    • Explain the importance of outliers in interpreting measures of central tendency and dispersion.
    • How are measures of central tendency, dispersion, and position used together to describe a data distribution?

    test-passedM.C.Q.

    The MCQ consists of twelve questions covering certain parts of the course. At the end, you will receive your evaluation as well as the answer key.

    To access the MCQ, click on the following icon: quizizz

    external-checklist-logistics-flaticons-lineal-color-flat-icons-3 Course Notes & Tutorials

    In this section, you will be able to download notes related to the current course.

    Sheet 1 Python Libraries: In this note, you will get to know the Python libraries dedicated to data analysis (Pandas, NumPy, Matplotlib). These libraries will help you create diagrams and calculate univariate statistical parameters. Click HERE to download the note.

    path Going Further

    To further your learning from this first Block, you can consult the following links:

    • Book
      A highly interesting book that encourages reflection on the use of statistics and data analysis in the study of social phenomena: Eyraud, C. (2024). *Les données chiffrées en sciences sociales: Du matériau brut à la connaissance des phénomènes sociaux.* Armand Colin. [available for free by logging into the university account].

    • Book
      This book is a collection of tutorials ranging from basic concepts to more advanced exercises that are part of our Course program: Monino, J. (2017). *TD de statistique descriptive.* Dunod. https://doi.org/10.3917/dunod.monin.2017.01. [available for free by logging into the university account].

    • Video
      A link to a YouTube channel that explains the basics of descriptive statistics in various episodes:

    cell-phone On the Course App

    On the Course App, you will find a summary of this Block, as well as related tutorial series.
    There are also links to multimedia content relevant to the Block.
    In the Notifications section, an update is planned based on the questions raised by students during lectures and tutorials.
    An update also covers exams from previous sessions, which will be reviewed in tutorial sessions to prepare for the current year's exams.

    Python Corner

    In this Python Corner, you will learn how to calculate the descriptive statistics parameters covered in the Course, and then how to create the corresponding charts and diagrams.

    Below you will find the data for the three types of variables, which you can copy and paste into the online Python editor, Trinket.

    The explanations are contained in the booklet that you can download in the Course & Tutorials Notes section above. The booklet includes detailed explanations on what you need to master for calculating univariate statistical parameters.

    Data for a Qualitative Characteristic
      [ "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue",
      "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow",
      "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue",
      "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow",
      "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue" ]
          
    Data for a Discrete Quantitative Characteristic
      [ 5, 7, 9, 12, 5, 8, 6, 10, 15, 8,
      7, 11, 13, 14, 5, 6, 9, 7, 10, 12,
      11, 8, 6, 13, 14, 15, 7, 8, 9, 10,
      11, 12, 13, 14, 15, 6, 5, 8, 9, 7,
      12, 11, 10, 9, 6, 7, 8, 11, 13, 14 ]
          
    Data for a Continuous Quantitative Characteristic
      [ 5.2, 7.5, 9.1, 12.3, 5.8, 8.4, 6.9, 10.2, 15.6, 8.1,
      7.7, 11.5, 13.4, 14.2, 5.9, 6.1, 9.3, 7.8, 10.6, 12.4,
      11.9, 8.7, 6.5, 13.1, 14.7, 15.4, 7.2, 8.5, 9.7, 10.9,
      11.3, 12.1, 13.9, 14.6, 15.1, 6.2, 5.4, 8.6, 9.8, 7.1,
      12.7, 11.4, 10.3, 9.5, 6.7, 7.9, 8.8, 11.6, 13.2, 14.9 ]
          
    List of Python Commands for Descriptive Statistics

    The following list contains the most commonly used Python commands for calculating descriptive statistical parameters and creating diagrams. As mentioned earlier, the booklet contains more details and explanations about the use of libraries and related commands.

    In the next session, we will see how to import your data directly from other formats.

    Calculation of Univariate Statistical Parameters
    Parameters Command Explanation
    Mean import numpy as np
    data = [5.2, 7.5, ...]
    mean = np.mean(data)
    print(f"The mean of the data is {mean:.2f}")
    Importing the NumPy library and calculating the mean of the data.
    Median import numpy as np
    data = [5.2, 7.5, ...]
    median = np.median(data)
    print(f"The median of the data is {median:.2f}")
    Importing the NumPy library and calculating the median of the data.
    Mode from scipy import stats
    data = [5.2, 7.5, ...]
    mode = stats.mode(data)[0][0]
    print(f"The mode of the data is {mode}")
    Importing the SciPy library and calculating the mode of the data.
    Standard Deviation import numpy as np
    data = [5.2, 7.5, ...]
    std_dev = np.std(data)
    print(f"The standard deviation of the data is {std_dev:.2f}")
    Importing the NumPy library and calculating the standard deviation of the data.
    Variance import numpy as np
    data = [5.2, 7.5, ...]
    variance = np.var(data)
    print(f"The variance of the data is {variance:.2f}")
    Importing the NumPy library and calculating the variance of the data.
    Quartiles import numpy as np
    data = [5.2, 7.5, ...]
    quartiles = np.percentile(data, [25, 50, 75])
    print(f"The quartiles of the data are {quartiles}")
    Importing the NumPy library and calculating the quartiles of the data.
    Deciles import numpy as np
    data = [5.2, 7.5, ...]
    deciles = np.percentile(data, [10, 20, ..., 90])
    print(f"The deciles of the data are {deciles}")
    Importing the NumPy library and calculating the deciles of the data.
    Two Parameters import numpy as np
    data = [5.2, 7.5, ...]
    mean = np.mean(data)
    median = np.median(data)
    print(f"The mean is {mean:.2f}, the median is {median:.2f}")
    Importing the NumPy library and calculating both the mean and the median of the data.
    Three Parameters import numpy as np
    data = [5.2, 7.5, ...]
    mean = np.mean(data)
    median = np.median(data)
    std_dev = np.std(data)
    print(f"The mean is {mean:.2f}, the median is {median:.2f}, the standard deviation is {std_dev:.2f}")
    Importing the NumPy library and calculating the mean, median, and standard deviation of the data.
    Pie Chart import matplotlib.pyplot as plt
    labels = ['A', 'B', 'C']
    sizes = [15, 30, 45]
    plt.pie(sizes, labels=labels)
    plt.show()
    Importing the Matplotlib library and creating a pie chart.
    Bar Chart import matplotlib.pyplot as plt
    labels = ['A', 'B', 'C']
    sizes = [15, 30, 45]
    plt.bar(labels, sizes)
    plt.show()
    Importing the Matplotlib library and creating a bar chart.
    Stem Plot import matplotlib.pyplot as plt
    labels = ['A', 'B', 'C']
    sizes = [15, 30, 45]
    plt.stem(labels, sizes)
    plt.show()
    Importing the Matplotlib library and creating a stem plot.
    Histogram import matplotlib.pyplot as plt
    data = [5.2, 7.5, ...]
    plt.hist(data, bins=10)
    plt.show()
    Importing the Matplotlib library and creating a histogram.
    Frequency Polygon import matplotlib.pyplot as plt
    import numpy as np
    data = [5.2, 7.5, ...]
    counts, bins = np.histogram(data, bins=10)
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    plt.plot(bin_centers, counts, '-o')
    plt.show()
    Importing the Matplotlib library and creating a frequency polygon.
    Cumulative Frequency Curve import matplotlib.pyplot as plt
    import numpy as np
    data = [5.2, 7.5, ...]
    counts, bins = np.histogram(data, bins=10, cumulative=True)
    plt.plot(bins[:-1], counts, '-o')
    plt.show()
    Importing the Matplotlib library and creating a cumulative frequency curve.
    Box Plot import matplotlib.pyplot as plt
    data = [5.2, 7.5, ...]
    plt.boxplot(data)
    plt.show()
    Importing the Matplotlib library and creating a box plot.
    Scatter Plot import matplotlib.pyplot as plt
    x = [5.2, 7.5, ...]
    y = [7.5, 8.6, ...]
    plt.scatter(x, y)
    plt.show()
    Importing the Matplotlib library and creating a scatter plot.
    chat Download the Course

    Using the link below, you can download the Flipbook in PDF format: bookmark-ribbon

    chat Discussion Forum

    The forum allows you to discuss this first session. You will notice the presence of a subscription button so you can follow discussions about research in humanities and social sciences. It is also an opportunity for the instructor to address students' concerns and questions.