This first session of the second Block focuses on the analysis (numerical and visual representations) of the data distribution. This set of representations constitutes one of the elementary operations in data analysis (as well as in descriptive statistics), with the goal of describing the distribution data using a set of indicators.
Organizing and representing data is the step that follows the data collection process. The available data must undergo a series of transformations to make them more readable and consequently exploitable for statistical analysis purposes.
In descriptive statistics, the first step of the analysis involves examining variables one by one, hence the term univariate analysis. Univariate analysis operates in two complementary ways: data representation (construction of distribution tables and graphs) and the calculation of relative and descriptive measures (central tendency, position, and dispersion measures). Numerical and graphical representations of the distribution provide a global view of the studied phenomenon and its direction. Next, a certain number of indices will be calculated to simplify the mass of information contained in our distribution, allowing for interpretation according to the hypotheses of our research.
In a research project, it is not always necessary to take into account various statistical indicators to fully account for the studied phenomenon. The experience of research, understood in a broad sense, will provide informed insight regarding the decisions to be made about the elements to present in our data analysis.
Descriptive statistics are a set of indices aimed at expressing in numbers the main characteristics of the data distribution, with the objective of interpreting them. Univariate descriptive statistics can be grouped into three main categories: central tendency measures, position measures, and dispersion measures. Other indices can also be calculated, and we will provide examples at the end of this section.
This session and the following two (measures of dispersion and measures of position) focus on characterizing the frequency distribution obtained after grouping raw data. The grouping and sorting of raw data provide an overall picture of the distribution, and the measures aim to describe the distribution's characteristics. Data analysis relies on a language of indices to summarize the essential information contained in a data set. Statistical indices offer an overview of the studied data.
When we determine the parameters of central tendency, they help us understand what happens in the middle of our data distribution. However, with identical central tendency indices, two or more data sets can be distributed differently; we say that these data sets deviate from the central tendency values. Even with the same central tendency parameters, observations can be very close or very far from the central values. Therefore, we cannot rely solely on central values to understand and analyze our data set. At this point, we will focus on parameters that describe the dispersion of data around the center. Central tendency measures provide information about the center of the distribution but have limitations when the data are dispersed to the point where these measures do not accurately reflect the observed data. Dispersion measures can be considered complementary to those that describe what happens at the center of the distribution. Finally, position measures provide the specific position of a data point within the distribution.
During the three sessions that make up this course, you will be guided by a Guide Example, which is a fictional survey about the use of the social network Facebook. This example will help you understand, using its data, how to calculate and interpret univariate analysis data. To access the Guide Example, click on the button located to the right of this main text.
In this course, you will find two types of editors and commands that allow you to work with the Python language.
The editors, Graph Editor and Data Editor. Two types of editors have been integrated, one for representing a qualitative variable and another for plotting diagrams for quantitative variables. To calculate the statistical indices, which form the basis of univariate analysis, two types of Data Editors have also been incorporated. These will allow you to calculate the parameters we will discuss in this course using the data you input.
At the end of this course, you will find a downloadable Python Command Sheet, which will help you better use the language to calculate the indicators covered during the course and build the appropriate graphs. You can rely on the GSE (Google Search Engine) located in the navigation bar for further explanations, and additional guidance is provided in the 'For Further Learning' section.
During this session, we aim to achieve the following objectives:
Data representation, statistical table, diagram, graph, table typology, diagram typology, measures of central tendency, mode, median, mean, modal class, uniform distribution, unimodal distribution, bimodal distribution, multimodal distribution, median class, range, variance, mean deviation, standard deviation, coefficient of variation, rank, middle rank, percentile rank (deciles – percentiles – quartiles), z-score, T-score.
Raw Data . To perform a univariate analysis, one must have raw data describing the characteristics of the concerned population (or sample). Raw data is untreated data, and compiling it into a data distribution (also called a statistical series) constitutes the first step in the presentation, treatment, and descriptive analysis of the data.
In order to establish a data distribution, the frequencies that make up each category (or value) of the studied variable must be determined.
When raw data is organized into a table, we obtain what is called a data table.
In social sciences and humanities, tables are essential tools for research. They serve certain functions that allow for a better understanding of the studied phenomenon.
There are three types of statistical tables: data tables (for now, we will limit ourselves to an introduction to the nature of these tables; the student will find more explanations in the section dedicated to computer data processing using a dedicated program), variable distribution tables, and contingency tables, which are relevant to this chapter and will be addressed in the following sections.
In research, data tables are the first to be constructed. They are used for data processing, commonly referred to as Flat Sorting. A data table is extensive, and each cell provides information that characterizes the subject.
A data table primarily consists of two margins: the list of subjects and the list of variables. [The list of subjects is a numerical or alphabetical list used to enumerate and identify the subjects under study: the questionnaire numbers in our case. The list of variables provides information about the characteristics of the considered variables: VAR001, VAR002, etc. [Thus, it constitutes enumerative lists of raw data collected from a large set of subjects].
A variable distribution table is constructed after extracting one or more columns from the data table.
A variable distribution table is a juxtaposition table; it faithfully reproduces the content of one or more columns from the data table taken separately.
A variable distribution table generally consists of three margins:
The table below illustrates the concept of a variable distribution table using our example [the mechanism is the same for all types of variables]:
Age (years) | \(n_i\) | \(\%\) |
---|---|---|
20 | 40 | 40 |
21 | 20 | 20 |
22 | 15 | 15 |
23 | 10 | 10 |
24 | 15 | 15 |
\(\sum\) | 100 | 100 |
The preliminary creation of a table simplifies the construction of the graph. The choice of a type of graph depends on the nature of the variable, its measurement scale, and the type of data grouping performed.
There are two types of graphs for representing a qualitative variable: bar chart and pie chart.
A bar chart has two perpendicular axes. On the horizontal axis (the axis of the variable's categories), the categories of the variable are represented by segments of equal width, ensuring they are separated by equal spaces. On the vertical axis (number of units, percentages), the frequencies (or percentages) are plotted. For each segment associated with a category, a rectangle is constructed with a height proportional to the frequency (or percentage) of the category, according to an appropriate scale.
A pie chart is a chart consisting of a circle divided into sectors, each sector having a central angle proportional to the quantity represented (Dodge, 2007, 129-130). Pie charts are mainly used to present data that, when combined, form a whole.
Consider the following frequency table [taken from our example]:
Gender | \(n_i\) |
---|---|
Male | 60 |
Female | 40 |
\(\sum\) | 100 |
We will have the following results (concerning) the two types of graphical representations associated with it:
Figure II.1.1. Column chart of the distribution of respondents by gender. | Figure II.1.2. Pie chart of the distribution of respondents by gender. |
Discover the chart editor for qualitative variables. Click the link below to try introducing modalities and data. Learn and master the basics interactively and playfully.
Access the EditorAll editors are accessible in the Appendix of this course.
The representation and processing of a quantitative variable is more complex than that of a qualitative variable. The graphical representation of a quantitative variable mainly depends on two parameters: the number of observations (relative to the population N, or the sample n), and the number of values that the studied variable can take (regardless of whether it is discrete or continuous).
The quantitative variable falls into three main categories of data: isolated data, data grouped by values, and data grouped by class.
We refer to isolated data when the size (N) of the population is less than 20 units. This represents a small amount of data. Note that this condition has no theoretical basis; it is based on practical experience.
In the case of isolated data, constructing a frequency distribution table does not have any particular significance; data will be grouped in ascending order.
Data is said to be grouped by values when the number of distinct values of the variable is low compared to the size of the population N (or the sample size n), and the latter is greater than 20.
The treatment of data grouped by values remains the same as for qualitative variables. Constructing a frequency distribution table follows the same logic, with one exception: the modality column is replaced by a values column.
The numerical presentation of a discrete quantitative variable is thus done using a variable distribution table where the first column contains the values of the variable, the second column is the frequency column. If necessary, a third column for percentages can be added.
The following table presents the distribution of a discrete quantitative variable.
Number of Siblings | Frequency | \(\%\) |
---|---|---|
0 | 10 | 10 |
1 | 17 | 17 |
2 | 40 | 40 |
3 | 20 | 20 |
4 | 7 | 7 |
5 | 6 | 6 |
\(\sum\) | 100 | 100 |
A discrete quantitative variable is represented, when the data are grouped by different values, using a bar chart. A bar chart consists of two perpendicular axes: on the horizontal axis, the various values of the variable are plotted, and on the vertical axis, the corresponding values (or frequencies) are plotted. Perpendicular to the value axis, and opposite each value, a straight line segment, called a bar, is drawn, with a width proportional to the frequency or percentage of the value.
Discover the chart editor for discrete quantitative variables. Click the link below to try introducing values and data. Learn and master the basics interactively and playfully.
Access the EditorAll editors are available in the Appendix section of this course.
Data are grouped into classes when the number of values of the variable is close to N or n (and n is greater than or equal to 20). In the case of highly variable data, it becomes very difficult (if not impossible) to treat them as isolated values, and constructing a frequency distribution table for the variable becomes unnecessary because we would end up with a large number of frequencies equal to 1. It is therefore appropriate to group them into classes while adhering to certain principles.
To group such data, they should be included in classes.
The numerical presentation of data grouped into classes consists of a distribution table with one column for the classes and a second column for the frequencies.
The following table represents the distribution of class-grouped data for the variable "age" from our example.
Age (Years) | Frequency | \(\%\) |
---|---|---|
[20-21[ | 40 | 40 |
[21-22[ | 20 | 20 |
[22-23[ | 15 | 15 |
[23-24[ | 10 | 10 |
[24-25[ | 15 | 15 |
\(\sum\) | 100 | 100 |
Note: For the purposes of other statistical index calculations, it may be useful to add columns to the previous table, especially for calculating amplitudes, class centers, etc.
A frequency distribution of data grouped into classes can be represented in two ways: the histogram and the frequency polygon.
The histogram is a graphical representation of the distribution of quantitative data. It consists of vertical bars whose height is proportional to the frequency or proportion of values in each class interval.
The frequency polygon is a broken line that connects the tops of the histogram bars. It allows for a continuous visualization of the data distribution and highlights trends and variations.
The graphical representation of the variable age in our example provides the following results:
Discover the graph editor for continuous quantitative variables. Click the link below to try adding classes and data. Learn and master the basics interactively and playfully.
Access the EditorAll editors are accessible in the Appendix section of this course.
Note: For continuous quantitative variables, we typically calculate several indices related to statistical indicators, such as amplitude and class center, cumulative frequencies (both increasing and decreasing). The next session, which covers measures of central tendency, will revisit these calculations.
Central tendency measures aim to highlight the center of the frequency distribution. The measures of central tendency are: mode, median, and mean.
Note: In this session, we will present the main measures of central tendency used in the analysis of data in the humanities and social sciences. This presentation is not arbitrary, as it will be used to study, in a more practical manner, the interpretation of the data contained in the first learning booklet of the analysis software.
At the end of this section, you will find the Spreadsheet that allows you to calculate all central tendency, dispersion, and position parameters. The same spreadsheet is available in the Appendix section of this course.
The mode (denoted \(M_O\)) is the simplest measure of central tendency to understand.
Note: The mode is the only measure of central tendency that can be evaluated regardless of the nature of the variable. For a qualitative variable, calculating the median or mean is not meaningful.
The mode represents the modality (or value) with the highest frequency. When a data series has two modalities with the highest frequency, it is called a Bimodal series.
In our survey, for the variable Gender (VAR001), the Mode is: Male, as it is the most represented modality in terms of frequency (60%).
For the variable number of siblings, the Mode is: 3 (siblings) (40% of the total frequency)
Calculating the Mode for a Qualitative Variable
For a qualitative variable, the mode represents the most frequent modality in the frequency distribution.
Visually, the modality is represented by the tallest bar in the bar chart or the largest sector in the pie chart.
Example:
In our survey, we have already plotted two charts representing the variable Gender. From the pie chart or bar chart representation, we can see that the Mode of our series is the gender: Male
Calculating the Mode for a Quantitative Variable
If the variable being studied is quantitative, the mode represents the most frequent value in the statistical series.
Depending on the type of data, the mode can be directly calculated or estimated.
In the case of isolated data, the mode is the value with the greatest number of occurrences.
In the following series: 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6. The Mode \(M_o\) is the number 4 because it has the highest frequency (it repeats the most).
The definition remains the same as before: the Mode is the value with the highest frequency (percentage) in the distribution table. Graphically, it is represented by the tallest segment in the bar chart.
In our SPSS output work, we can establish the following table showing the variable Number of Siblings:
Figure II.2.2. Distribution of the sample by number of siblings
In the bar chart of the distribution, we can also see that the tallest bar represents a number of siblings equal to: 2
Table II.2.1. Distribution of the sample by number of siblings
In the case of grouped data, it is not possible to determine a unique value for the Mode, but the modal class can be identified.
Without knowing the exact value of the Mode, the class center \( (c_i) \) is generally used as an estimated value for the Mode.
The class center represents the central point of a class in a frequency distribution
To calculate the class center, use the following formula:
\(\text{Class Center} = \frac{\text{Lower Bound} + \text{Upper Bound}}{2}\)
Where:
Lower Bound: the lower limit of the class,
Upper Bound: the upper limit of the class.
Some authors calculate the exact value of the Mode using a formula. For this course, it is not essential to perform this calculation.
Visually, the Mode can be identified in the histogram; it corresponds to the class center of the tallest bar in the histogram.
The following example represents the histogram of the age variable from our dataset.
Figure II.2.3. Histogram of the age variable
We observe that the modal class is the one ranging from 20 to 22 years, which means that the mode of our series is 21 (the center of this class).
A distribution that does not have a visually apparent mode is called a uniform distribution.
The Median (denoted \(M_d\)) is another measure of central tendency that is of particular interest to researchers. The median is the value (or modality) that divides the data into two equal parts.
The median is the value (or modality) that divides the data into two equal parts.
In our example, for the variable Study Level (VAR003), the median is: second year, since 55% (more than 50%, i.e., half) of the surveyed students are in the first or second year.
To evaluate or calculate the median, the data must be ordered. In the case of a nominal qualitative variable, such an operation cannot be performed; the median only makes sense for an ordinal qualitative variable or a quantitative variable.
In the case of isolated quantitative data, the median is the central data if the number of observations is odd. If the number of observations is even, the median will be at the midpoint between the two central values, as indicated in the following two formulas:
Odd \(N\) (or \(n\)): $$\color{RoyalBlue}{{\text{Md}}} = \left(\frac{n+1}{2}\right)^{\text{th}} \text{ data} $$
Example:
In the following series: 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, we observe that there are 13 observations. Since \(n\), representing the number of observations, is an odd number, we choose the 2 as the median because it is the number in the seventh place, as shown below:
Even \(N\) (or \(n\)) : $$M_d = \text{the midpoint between } \left(\frac{n}{2}\right)^{\text{th}} \text{ and } \left(\frac{n+1}{2}\right)^{\text{th}} \text{ data} $$
Example:
In the previous series, we will add a number, for example, 6: 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 6. We now observe that there are 14 observations. Since \(n\), representing the number of observations, is an even number, we choose the median as the midpoint between the 7th and 8th observations, as shown below:
Therefore, we say that the median is: 2.5.
The definition remains the same as for an ordinal qualitative variable. To determine the median, first calculate the cumulative percentages of the data distribution; then identify the first value where 50% or more of the cumulative percentages are represented.
Note: When the cumulative percentage of a value corresponds exactly to 50%, the median will be the number located midway between the value with a 50% cumulative percentage and the next value.
Example:
In our example guide, we will calculate the Median for the variable number of siblings, resulting in the following:
Figure II.2.4. Median value of the number of siblings variable, SPSS calculation
To determine the median in the case of grouped data by classes, the following steps are generally followed:
Then, apply the following formula:
Note: The same formula can be applied by replacing frequencies with percentages.
In our example, we calculate the median for the age variable as follows:
The mean is one of the most commonly used measures of central tendency in human sciences research.
The term Mean encompasses several types: geometric mean, quadratic mean, etc. In our case, we use the term Mean to refer to the arithmetic mean calculated in data analysis.
The arithmetic mean is the ratio of the sum of the weighted frequencies to the number of observations (N).
In our example, for the variable number of siblings (VAR004), the arithmetic mean is equal to: 2 (siblings).
Note: We refer to the usual arithmetic mean when dealing with a variable whose data are limited (which do not require weighting). This definition is rarely or not used in research.
When calculating the population mean, we use the notation \(\mu_x\) (read as mu sub x), while the notation \(\bar x\) (read as x bar) is used for a sample. This distinction is important to remember, as it will be needed during inference operations.
Generally, and as with the median, the mean cannot be calculated for a nominal qualitative variable, but it can be for an ordinal qualitative variable if certain conditions are met. The calculation of the mean is especially meaningful for a quantitative variable.
In simple terms, the mean is given by the following relationship:
For a population:
For a sample:
In the case of an ordinal qualitative variable, the mean can be calculated provided that the variable's modalities have been previously coded. The definition of the mean calculation remains the same as mentioned above.
The calculation of the mean for a quantitative variable follows the same logic described above. However, there is a difference when the considered variable is a continuous quantitative variable, which involves using class centers as in the calculation of the median discussed previously.
Calculation of the Mean for Isolated Data
For isolated data, the mean is calculated by summing all the values of the variable and dividing by their number, as expressed in the following formula:
For the population:
For the sample:
Example: Consider the following series: 1, 3, 4, 5, 5, 6, 6, 7, 8, 9, 11, 12, 13, 30, 30.
Applying the definition of the arithmetic mean, we get the following result:
Calculation of the Mean for Grouped Data by Values
When the data is grouped by values, the formula applies with an adjustment: the mean is weighted by the respective frequencies \((n_i)\) representing each value.
Thus, the formula to use for data grouped by values is written as follows:
For the population:
For the sample:
Example: We will calculate here, using SPSS software, the arithmetic mean for data grouped by values; the same procedure applies to data grouped by class.
In this example, we will calculate the arithmetic mean for the variable number of siblings from our example guide.
In SPSS, the manipulation yields the following result:
Calculation of the Mean for Grouped Data by Classes
In the case of data grouped by classes, the value of the mean is weighted according to the class midpoints, providing an approximate result for the mean. The mean is calculated using the following formulas:
For the population:
For the sample:
Example: Provide an example of calculating the arithmetic mean in the case of a continuous quantitative variable.
Measures of central tendency provide information about the center of the distribution. However, they have limitations when the data in a distribution are sufficiently dispersed that these measures do not faithfully represent the observed data. Dispersion measures can be considered complementary to those describing what happens at the center of the distribution.
In the humanities, we use two (02) types of dispersion measures: range and standard deviation.
The range is defined as the difference between the largest and smallest values in a statistical series.
Using the definition, we obtain a simpler formula:
In a descriptive statistics exam, a teacher corrected the papers of two groups, each consisting of forty (40) students. We will reproduce the scores of each group.
Group 1 | ||||
---|---|---|---|---|
0 | 0 | 1 | 1 | 2 |
3 | 3 | 4 | 5 | 6 |
6 | 7 | 8 | 9 | 9 |
9 | 10 | 10 | 10 | 10 |
10 | 10 | 11 | 11 | 12 |
12 | 13 | 13 | 14 | 15 |
15 | 16 | 16 | 17 | 17 |
18 | 18 | 19 | 20 | 20 |
Group 2 | ||||
---|---|---|---|---|
4 | 4 | 4 | 4 | 5 |
5 | 5 | 6 | 6 | 6 |
6 | 7 | 7 | 7 | 7 |
8 | 10 | 10 | 10 | 10 |
10 | 10 | 10 | 10 | 10 |
11 | 11 | 11 | 14 | 14 |
14 | 15 | 15 | 15 | 16 |
16 | 16 | 17 | 17 | 17 |
We notice that both series have exactly the same central values, namely: a Mode of 10, a mean of 10, and a median of 10 as well.
However, when calculating the range for each of the two series, we obtain the following results:
We observe, from the range calculation, that in the first group the scores vary from 0 to 20, resulting in a range that is larger than that observed in the second group, which is 13. The dispersion of the scores obtained by the students is greater in the first group than in the second.
Despite being simple to evaluate, the range provides an initial insight, an impression of the variability of the data.
Regardless of the type of value grouping considered, the definition of the range remains the same:
In the following table, which presents the number of children per household, we can easily assess the range :
\(x_i\) | 0 | 1 | 2 | 3 | 4 | 5 | \(\sum\) |
\(n_i\) | 22 | 40 | 18 | 12 | 05 | 03 | \(100\) |
The highest value being \(5\) and the lowest value being \(0\), we note that the range is \(5 - 0 = 5\). The same observation can be made regarding the first example: two series with the same range do not necessarily have the same variability.
In the case of data grouped into classes, the range is calculated based on the class boundaries. The range of the sample is equal to the difference between the upper boundary of the last class and the lower boundary of the first class.
In the following example, which represents the distribution of a sample according to the variable age, we will evaluate the range as follows:
\(Age_{years}\) | [ 20 - 30 [ | [ 30 - 40 [ | [ 40 - 50 [ | [ 50 - 60 [ | [ 60 - 70 [ | [ 70 - 80 [ | \(\sum\) |
\(n_i\) | 10 | 20 | 40 | 15 | 14 | 11 | \(100\) |
The highest value being 80 and the lowest value being 30, we note that the range is \(80 - 30 = 50\). The same observation can be made regarding the first example: two series with the same range do not necessarily have the same variability.
The variance \(\sigma_x^2\) is, like the standard deviation \(\sigma_x\) and the coefficient of variation (\(C_v\)), an indicator of dispersion around the mean.
The principle of calculating variance (and standard deviation) involves estimating the average difference (or what is also called the mean deviation) of each observation from the arithmetic mean of these observations.
It is noted that calculating mean deviations results in positive and negative values that cancel each other out; the sum of all mean deviations is equal to 0. To address this, variance provides a solution by calculating the squares of the deviations, which are either zero or positive.
The variance of a variable \(x\), denoted \(\sigma_x^2\) (read as \( \sigma^2 \text{ subscript } x \)), can be calculated for both the population and the sample.
The variance of a variable (x) is equal to the average of the squared deviations between the values of the variable and the mean.
Note: It should be noted that the greater the mean deviations (deviations of the values from the mean), the higher the variance, and vice versa. Thus, the dispersion around the mean is greater.
If the data are individual, the \(n\) values of the variable are denoted: \(x_1, x_2, x_3, x_4, x_5...., x_n\). The variance is obtained by dividing the sum of the squared deviations between the data and the population mean by the number of data points, as shown in the following formula (click on the formula for more details):
Note regarding the calculation of variance for a sample. For a sample, the formula for calculating variance needs to be adjusted. This modification accounts for sampling error and the fact that the sample is smaller than the population.
The correction for sampling bias is obtained by dividing the sum of squared deviations by (n − 1) rather than n. Thus, the variance calculated for a sample is called sample variance and is denoted: \(\sigma_{x}^{2}\).
The formula to use for calculating variance for individual data in a sample is as follows (click on the formula for more details):
Remark (variance calculation). In the calculation of variance, care should be taken not to subtract the square of the arithmetic mean from the sum of \(x_{i}^{2}\). It is necessary to divide by the number of observations first.
By simplifying the variance formula, we end up with the following equation, known as the König-Huygens theorem:
Let's try to calculate the variance for each group from the previous example: (we will use the table to better compute the terms of the equation)
We have seen that the arithmetic mean \(\bar{x} = 10\). The following table includes an additional cell for calculating \(x_i^2\):
Group 1
\(x\) | 0 | 0 | 1 | 1 | 2 | 3 | 3 | 4 | 5 | 6 | 6 | 7 | 8 | 9 | 9 | 9 | 10 | 10 | 10 | 10 |
\(x^{2}\) | 0 | 0 | 1 | 1 | 4 | 9 | 9 | 16 | 25 | 36 | 36 | 49 | 64 | 81 | 81 | 81 | 100 | 100 | 100 | 100 |
\(x\) | 10 | 10 | 10 | 11 | 11 | 12 | 12 | 13 | 13 | 14 | 15 | 15 | 16 | 16 | 17 | 17 | 18 | 18 | 19 | 20 |
\(x^{2}\) | 100 | 100 | 100 | 121 | 121 | 144 | 144 | 169 | 169 | 196 | 225 | 225 | 256 | 256 | 289 | 289 | 324 | 324 | 361 | 400 |
Applying the formula, we get the following result: \(29.4\)
We will now proceed with calculating the variance for the second group to compare the results between the two groups:
For Group 2, we will replicate the same procedure, and we will end up with the following result:
Group 2
\(x\) | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 7 | 7 | 7 | 7 | 8 | 10 | 10 | 10 | 10 |
\(x^{2}\) | 16 | 16 | 16 | 16 | 25 | 25 | 25 | 36 | 36 | 36 | 36 | 49 | 49 | 49 | 49 | 64 | 100 | 100 | 100 | 100 |
\(x\) | 10 | 10 | 10 | 10 | 10 | 11 | 11 | 11 | 14 | 14 | 14 | 15 | 15 | 15 | 16 | 16 | 16 | 17 | 17 | 17 |
\(x^{2}\) | 100 | 100 | 100 | 100 | 100 | 121 | 121 | 121 | 196 | 196 | 196 | 225 | 225 | 225 | 256 | 256 | 256 | 289 | 289 | 289 |
Applying the formula, we get the following result: \(17.6\)
Note
It is worth noting that variance, like range, is sensitive to the variability of observations. Just as with the range, the variance of the scores in Group 1 is higher than that of Group 2.
Consider a quantitative variable \(x\) defined on a population of \(n\) individuals, with the values of the variable being: \(x_1, x_2, x_3, x_4, ........x_k\) and the frequencies: \(n_1, n_2, n_3, n_4, ........n_k\). The variance is calculated following the same logic, noting that it is weighted by \(n_i\).
For a Population:
We remain within the same definition; variance is the weighted average of the squared deviations from the mean.
For a Sample:
We can once again use the König-Huygens theorem, simplifying it to obtain the following formula:
For a Population:
For a Sample:
Let's revisit the example of the number of children per household. We will add two columns to our table: in the first column, we will calculate \((x_i - \bar{x})^2\) and in the second column, \(n_i \cdot (x_i - \bar{x})^2\):
\(x_i\) | \(n_i\) | \((x_i-\bar{x})^2\) | \(n_i . (x_i - \bar{x})^2\) |
0 | 22 | - 1.47 | - 32.34 |
1 | 40 | - 0.47 | - 8.83 |
2 | 18 | 0.53 | 5.05 |
3 | 12 | 2.34 | 28.09 |
4 | 5 | 2.53 | 32 |
5 | 3 | 3.53 | 37.38 |
N | 100 | --- | 61.35 |
The mean of the series is equal to: \(1.47\). Applying the variance formula, we get the following result:
The definition remains the same except that values are replaced by class midpoints, noted as \(c_i\).
For a Population:
For a Sample:
For the example regarding age, we get the result: \(\sigma_{x}^{2} = 283.84\).
Variance has the drawback of being expressed in squared units (in our previous example: students' scores squared, number of children squared, etc.), which makes it less directly interpretable.
To bring variance to the same scale as the mean, we take the square root of it, resulting in a measure expressed in the same unit as the variable being studied: the standard deviation.
The standard deviation is defined as the square root of the variance (denoted as \(\sigma_x\) in the case of the population, read as sigma x, and \(S_x\) in the case of the sample). The standard deviation measures the average deviation between a value of the variable and the mean of the variable, effectively expressing it in the same unit of measurement.
Explanation: Referring to the variance of the age variable, the standard deviation would be: \(\sigma_x = \sqrt{283.84} = 16.84\). Thus, we observe that the age variable shows significant variation in our sample.
Position statistics allow us to make comparisons, positioning one or more observations relative to the mean or to the entire set of observations.
There are different positioning statistics, and we are interested in three of them: quantiles (or percentile ranks), the absolute rank, and the reference value.
A frequency distribution can be divided into a chosen number of parts. The percentile rank indicates the position of an observation (or value) relative to, and in comparison with, all other observations.
The operation of dividing the frequency distribution into certain parts is referred to as quantiles.
Quantiles are defined by analogy with the median. The most commonly used quantiles divide the frequency distribution into four (quartiles), five (quintiles), ten (deciles), and one hundred (percentiles).
A quantile of order α %, denoted q α, is the value of the variable for which the associated cumulative frequency is equal to α %
For frequencies: $$ F (q_{\alpha}) = \alpha \% $$
For counts: $$ N (q_{\alpha}) = \alpha \% \cdot n $$
In this section, we will limit ourselves to percentiles; other quantiles will be mentioned (with their formulas), and the reader can refer to them as needed.
Calculating the Percentile Rank
By definition, the percentile rank is defined by the percentage of observations falling below this value plus half the percentage of observations falling exactly on this value (definition to be reviewed and improved).
The calculation of the percentile rank is done using the statistical table, after calculating the cumulative frequencies (percentages), and then performing an arithmetic correction to obtain the value of the percentile rank.
[insert a table and explain the procedures]
Percentile ranks are used in standardized tests, also known as norm-referenced tests, such as IQ tests, TOEFL, SAT, GRE, and GMAT, etc. By definition, standardized tests or norm-referenced tests are assessments designed to be administered and scored consistently for all participants.
Percentiles
Percentiles are values that divide the frequency distribution into 100 equal parts.
For example, the 18th percentile, denoted C18 (C subscript 18), is the value below which 18% of the data falls (and 82% of the data is above).
A percentile of order α is denoted Cα (where α represents the value below which α % of the data falls).
The calculation of a quantile is similar to that of the median, except that 50% is replaced by α %.
Case of isolated data
Calculating a percentile in the case of isolated data is quite simple: if \(N\) corresponds to 100% of the data, then \(\alpha\) % corresponds to \(d\) data points, expressed as \(p ~~( Position ) \).
The rule of three applied to this kind of calculation is:
Starting from the following equivalence:
$$\frac {\alpha} {100} = \frac {p}{N} \xrightarrow{\hspace{3cm}} p = \frac {N \alpha}{100}$$Note, examples, and explanations:
Suppose we have the following scores for 5 students:
\[ 45, 50, 55, 60, 65 \]
We will calculate the 40th percentile (P40).
Calculation Steps:
The corresponding values are:
Linear interpolation is done as follows:
\[ P40 = \text{Value at the lower position} + (\text{Fractional part of the position} \times \text{Difference between the values}) \] \[ P40 = 50 + (0.4 \times (55 - 50)) = 50 + (0.4 \times 5) = 50 + 2 = 52 \]Result: The 40th percentile for these data is 52. This means that 40% of the students have a score of 52 or less.
We have the following scores for 8 students:
\[ 48, 55, 58, 60, 65, 68, 72, 75 \]
We will calculate the 75th percentile (P75).
Calculation Steps:
The corresponding values are:
Linear interpolation is done as follows:
\[ P75 = \text{Value at the lower position} + (\text{Fractional part of the position} \times \text{Difference between the values}) \] \[ P75 = 68 + (0.75 \times (72 - 68)) = 68 + (0.75 \times 4) = 68 + 3 = 71 \]Result: The 75th percentile for these data is 71. This means that 75% of the students have a score of 71 or less.
Case of grouped data by values
To calculate the percentile of order α, we use the formula for calculating the median for a discrete quantitative variable.
Case of grouped data by classes
In the case of data grouped into classes, we will use the formula for calculating the median as discussed previously.
The calculation of the percentile α will involve finding the value that exceeds α%.
To accurately calculate the percentile α %, replace 50% with α % and select the class containing Cα (not the median class). $$C_{\alpha}= b_{cα} \left [\frac {α- F_{cα-1}} {F_{cα}}\right] * L_{cα} $$Data: Suppose we have the following grouped scores for 40 students:
Classes | Frequency (f) |
---|---|
[ 0 - 10 [ | 5 |
[ 10 - 20 [ | 8 |
[ 20 - 30 [ | 12 |
[ 30 - 40 [ | 10 |
[ 40 - 50 [ | 5 |
Σ | 40 |
We will calculate the 70th percentile (P70).
Calculation Steps:
We need to calculate the cumulative frequency until we reach position 28:
The 28th observation falls within the cumulative frequency of 35, corresponding to the interval [30, 40].
Result: The 70th percentile for these grouped data is 33. This means that 70% of the students have a score less than or equal to 33.
The percentile rank is a simple statistic to calculate and interpret; however, it can be an inadequate measure when the distribution is not symmetrical, particularly when the sample size is small. The percentile rank does not consider statistical indices (Mean and standard deviation) in its interpretation, making it sensitive to the shape of the data distribution.
Rank helps determine the position of a single data point. There are usually three types of rank: absolute rank, fifth rank, and percentile rank.
The absolute rank indicates the position of an observation in relation to, comparatively, the extreme observations. The statistical series being arranged in ascending or descending order. Absolute rank is a positioning statistic that indicates, somewhat loosely, the position of an observation relative to the observations at the two ends of the data set. The fifth rank is a number between 1 and 5, indicating which interval a data point falls into in a distribution divided into five equal parts.
In our course, we will focus only on the percentile rank.
By definition, the percentile rank is the percentage of data points below it. The percentile rank is expressed as an integer with a value between 1 and 99. Determining the percentile rank is the reverse operation of determining the percentile.
Example and Explanation:
The following table shows the grouped scores for 40 students:
Classes | Frequency (f) |
---|---|
[ 0 - 10 [ | 5 |
[ 10 - 20 [ | 8 |
[ 20 - 30 [ | 12 |
[ 30 - 40 [ | 10 |
[ 40 - 50 [ | 5 |
Σ | 40 |
We will calculate the percentile rank for a value of 35.
Calculation Steps:
The value 35 falls within the interval [30, 40[.
Formula:
$$Percentile~~rank~~=~~\text{integer part of} \left[ \frac{X_{r} - b_{r}} {L_{r}} \times f_{r} + F_{r-1} \right]$$Result: The value of 35 is at the 75th percentile. This means that 75% of the students have a score less than or equal to 35.
The percentile rank can be determined directly using the ogive.
The ogive is a graph that represents the cumulative frequency of the data. It allows us to visualize the cumulative distribution and estimate percentiles or percentile ranks. The horizontal axis \((x)\) represents the values or classes, and the vertical axis \((y)\) represents the cumulative frequency.
We will calculate the percentile rank for a value of 35 using the ogive.
Steps to Calculate the Percentile Rank from the Ogive:
Calculate the cumulative frequency for each class and plot the points corresponding to the upper bounds of each class and their cumulative frequency.
Locate the value 35 on the x-axis. Draw a vertical line from 35 to the ogive. Draw a horizontal line from the intersection to the y-axis to read the percentile rank.
Percentile Rank Calculation:
From the ogive, the value of 35 corresponds to a cumulative frequency of 30.
The percentile rank for a value of 35 is thus:
\[ P = \left( \frac{30}{40} \right) \times 100 = 75 \]
Result: The value of 35 is at the 75th percentile. This means that 75% of the students have a score less than or equal to 35.
The Z score allows us to represent the position of an observation relative to the unit of measurement that is the standard deviation.
By definition, the Z score is the distance between a data point and the mean, expressed in standard deviations.
The Z score, also known as the Z value or standardized score, is a statistical measure that indicates how many standard deviations a data point is above or below the mean of the dataset. In other words, the Z score allows us to standardize different values within a dataset, enabling comparisons between data from different distributions or sets.
Formula:
The \(Z\) score for a value \(x\) is calculated using the following formula:
$$ Z = \frac {Value~~of~~the~~data~~ - Mean} {Standard~~Deviation}$$ This formula can be rewritten as: $$ Z = \frac {x - M} { \sigma} $$where:
The two tables below show the respective scores of twenty students in two modules: Research Methodology in Human and Social Sciences and Presentation and Data Analysis.
The goal is to rank the students based on the combined results from the two modules, in comparison to the mean, variance, and standard deviation of their scores.
Student | Research Methodology in Social Sciences |
---|---|
Student 1 | 60 |
Student 2 | 70 |
Student 3 | 80 |
Student 4 | 90 |
Student 5 | 50 |
Student 6 | 85 |
Student 7 | 75 |
Student 8 | 45 |
Student 9 | 65 |
Student 10 | 55 |
Student 11 | 70 |
Student 12 | 95 |
Student 13 | 65 |
Student 14 | 55 |
Student 15 | 85 |
Student 16 | 75 |
Student 17 | 65 |
Student 18 | 55 |
Student 19 | 60 |
Student 20 | 80 |
Student | Presentation and Data Analysis |
---|---|
Student 1 | 65 |
Student 2 | 75 |
Student 3 | 85 |
Student 4 | 95 |
Student 5 | 55 |
Student 6 | 80 |
Student 7 | 90 |
Student 8 | 50 |
Student 9 | 70 |
Student 10 | 60 |
Student 11 | 75 |
Student 12 | 95 |
Student 13 | 65 |
Student 14 | 55 |
Student 15 | 80 |
Student 16 | 70 |
Student 17 | 65 |
Student 18 | 55 |
Student 19 | 60 |
Student 20 | 85 |
\[ \text{Mean} = \frac{\displaystyle \scriptsize 60 + 70 + 80 + 90 + 50 + 85 + 75 + 45 + 65 + 55 + 70 + 95 + 65 + 55 + 85 + 75 + 65 + 55 + 60 + 80}{\scriptsize 20} = 70 \]
\[ \text{Variance} = \frac{\sum (x_i - \mu)^2}{n} = 200 \]
\[ \text{Standard Deviation} = \sqrt{200} = 14.14 \]
\[ \text{Mean} = \frac{ \displaystyle \scriptsize 65 + 75 + 85 + 95 + 55 + 80 + 90 + 50 + 70 + 60 + 75 + 95 + 65 + 55 + 80 + 70 + 65 + 55 + 60 + 85}{\displaystyle \scriptsize 20} = 72.5 \]
\[ \text{Variance} = \frac{\sum (x_i - \mu)^2}{n} = 206.25 \]
\[ \text{Standard Deviation} = \sqrt{206.25} = 14.36 \]
Using the Z Score formula, we will calculate the Z Score for each module, then, once obtained, we will calculate the average by adding them and dividing by two. We will then have the average Z Score with which we will rank the students' results.
Student | Z Score (Research Methodology in Social Sciences) | Z Score (Presentation and Data Analysis) | Average Z Score |
---|---|---|---|
Student 1 | -0.71 | -0.52 | -0.62 |
Student 2 | 0.00 | 0.17 | 0.08 |
Student 3 | 0.71 | 0.87 | 0.79 |
Student 4 | 1.41 | 1.57 | 1.49 |
Student 5 | -1.41 | -1.22 | -1.32 |
Student 6 | 1.06 | 0.52 | 0.79 |
Student 7 | 0.35 | 1.22 | 0.78 |
Student 8 | -1.77 | -1.57 | -1.67 |
Student 9 | -0.35 | -0.17 | -0.26 |
Student 10 | -1.06 | -0.87 | -0.97 |
Student 11 | 0.00 | 0.17 | 0.08 |
Student 12 | 1.77 | 1.57 | 1.67 |
Student 13 | -0.35 | -0.52 | -0.44 |
Student 14 | -1.06 | -1.22 | -1.14 |
Student 15 | 1.06 | 0.52 | 0.79 |
Student 16 | 0.35 | -0.17 | 0.09 |
Student 17 | -0.35 | -0.52 | -0.44 |
Student 18 | -1.06 | -1.22 | -1.14 |
Student 19 | -0.71 | -0.87 | -0.79 |
Student 20 | 0.71 | 0.87 | 0.79 |
After calculating the average Z-Scores, we can obtain this ranking.
Rank | Student | Average Z-Score |
---|---|---|
1 | Student 12 | 1.67 |
2 | Student 4 | 1.49 |
3 | Student 3 | 0.79 |
4 | Student 6 | 0.79 |
5 | Student 7 | 0.78 |
6 | Student 15 | 0.79 |
7 | Student 20 | 0.79 |
8 | Student 2 | 0.08 |
9 | Student 11 | 0.08 |
10 | Student 16 | 0.09 |
11 | Student 1 | -0.62 |
12 | Student 9 | -0.26 |
13 | Student 13 | -0.44 |
14 | Student 17 | -0.44 |
15 | Student 10 | -0.97 |
16 | Student 5 | -1.32 |
17 | Student 14 | -1.14 |
18 | Student 18 | -1.14 |
19 | Student 8 | -1.67 |
20 | Student 19 | -0.79 |
Table II.4.1. Student ranking according to their average Z-Scores
Discover the data editor for discrete quantitative variables. Click the link below to try entering table data to calculate statistical parameters. Learn and master the basics interactively and playfully.
Access the EditorAll editors are accessible in the Annex section of this Course.
Discover the data editor for continuous quantitative variables. Click the link below to try entering class data from a table to calculate statistical parameters. Learn and master the basics interactively and playfully.
Access the EditorAll editors are accessible in the Annex section of this Course.
In this course, we have just covered the various indices that allow us to describe a data set.
Central tendency indices are present in most documents related to data analysis. Central tendency indicators can be seen as a first approach to understanding the overall information that defines the identity of our population or survey sample.
Central tendency parameters also help guide the future analysis of our data. Therefore, it is important to grasp their significance:
In this course, we have also covered how to calculate and interpret measures of dispersion.
Measures of dispersion, when associated with measures of central tendency, provide a preliminary approach to analyzing our survey data. It is crucial to master the process:
Univariate analysis will also involve the interpretation of measures of position. These allow us to know and identify the exact location of an observation in our statistical series:
The course does not have a final bibliography (in its online version); references are inserted at the end of each block.
The following questions allow you to review the knowledge discussed during the Block. We will have a discussion during the tutorial sessions.
The MCQ consists of twelve questions covering certain parts of the course. At the end, you will receive your evaluation as well as the answer key.
To access the MCQ, click on the following icon:
In this section, you will be able to download notes related to the current course.
Sheet 1 Python Libraries: In this note, you will get to know the Python libraries dedicated to data analysis (Pandas, NumPy, Matplotlib). These libraries will help you create diagrams and calculate univariate statistical parameters. Click HERE to download the note.
To further your learning from this first Block, you can consult the following links:
On the Course App, you will find a summary of this Block, as well as related tutorial series.
There are also links to multimedia content relevant to the Block.
In the Notifications section, an update is planned based on the questions raised by students during lectures and tutorials.
An update also covers exams from previous sessions, which will be reviewed in tutorial sessions to prepare for the current year's exams.
In this Python Corner, you will learn how to calculate the descriptive statistics parameters covered in the Course, and then how to create the corresponding charts and diagrams.
Below you will find the data for the three types of variables, which you can copy and paste into the online Python editor, Trinket.
The explanations are contained in the booklet that you can download in the Course & Tutorials Notes section above. The booklet includes detailed explanations on what you need to master for calculating univariate statistical parameters.
[ "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue", "Green", "Yellow", "Red", "Blue" ]
[ 5, 7, 9, 12, 5, 8, 6, 10, 15, 8, 7, 11, 13, 14, 5, 6, 9, 7, 10, 12, 11, 8, 6, 13, 14, 15, 7, 8, 9, 10, 11, 12, 13, 14, 15, 6, 5, 8, 9, 7, 12, 11, 10, 9, 6, 7, 8, 11, 13, 14 ]
[ 5.2, 7.5, 9.1, 12.3, 5.8, 8.4, 6.9, 10.2, 15.6, 8.1, 7.7, 11.5, 13.4, 14.2, 5.9, 6.1, 9.3, 7.8, 10.6, 12.4, 11.9, 8.7, 6.5, 13.1, 14.7, 15.4, 7.2, 8.5, 9.7, 10.9, 11.3, 12.1, 13.9, 14.6, 15.1, 6.2, 5.4, 8.6, 9.8, 7.1, 12.7, 11.4, 10.3, 9.5, 6.7, 7.9, 8.8, 11.6, 13.2, 14.9 ]
The following list contains the most commonly used Python commands for calculating descriptive statistical parameters and creating diagrams. As mentioned earlier, the booklet contains more details and explanations about the use of libraries and related commands.
In the next session, we will see how to import your data directly from other formats.
Parameters | Command | Explanation |
---|---|---|
Mean | import numpy as np |
Importing the NumPy library and calculating the mean of the data. |
Median | import numpy as np |
Importing the NumPy library and calculating the median of the data. |
Mode | from scipy import stats |
Importing the SciPy library and calculating the mode of the data. |
Standard Deviation | import numpy as np |
Importing the NumPy library and calculating the standard deviation of the data. |
Variance | import numpy as np |
Importing the NumPy library and calculating the variance of the data. |
Quartiles | import numpy as np |
Importing the NumPy library and calculating the quartiles of the data. |
Deciles | import numpy as np |
Importing the NumPy library and calculating the deciles of the data. |
Two Parameters | import numpy as np |
Importing the NumPy library and calculating both the mean and the median of the data. |
Three Parameters | import numpy as np |
Importing the NumPy library and calculating the mean, median, and standard deviation of the data. |
Pie Chart | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a pie chart. |
Bar Chart | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a bar chart. |
Stem Plot | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a stem plot. |
Histogram | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a histogram. |
Frequency Polygon | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a frequency polygon. |
Cumulative Frequency Curve | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a cumulative frequency curve. |
Box Plot | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a box plot. |
Scatter Plot | import matplotlib.pyplot as plt |
Importing the Matplotlib library and creating a scatter plot. |
Using the link below, you can download the Flipbook in PDF format:
The forum allows you to discuss this first session. You will notice the presence of a subscription button so you can follow discussions about research in humanities and social sciences. It is also an opportunity for the instructor to address students' concerns and questions.