BlocK I | L2 Introduction to data analysis

" alt="top image of the page, it is used for decoration.">

Introduction and session summary

In this session we will cover the basics of statistical language. Of course, any analysis work relies on an exhaustive understanding of the components of statistical analysis logic.

Data analysis involves looking at a population and taking a representative sample, which is then analysed. Data analysis focuses on the characteristics or values of a statistical unit. The sum of these analyses enables the researcher to make an informed decision about his or her research hypotheses.

This session will cover the main decisions and procedures involved in selecting individuals for a survey. The researcher must determine the sub-population to be surveyed, select a research tool, and decide on the means of administering it.

We will define the two concepts of sample and sampling , then, we will discuss the main types and techniques of sampling.

The data analysis process begins with identifying and classifying the variables under study. This second course will teach you how to better define the basic concepts that are essential to data analysis in the humanities. We will review the various definitions accepted in statistics and data science. Additionally, we will address the concepts of numerical representations (statistical tables) and visual representations (graphs) of the collected data. In a second step, we will focus on working with measurement scales, which should be considered the most basic work on data.

During this session, we will have the opportunity to introduce the basics of sampling and probability calculation (as well as combinatorial analysis). We have deliberately used the term introduce because we will return to it in more detail in the second part of this course. This introduction follows the summary of the previous session and will, of course, be the subject of a tutorial session. The summaries and reminders are accessible under the Course & Tutorial Sheets section as well as in the Appendices section.

Starting from this session, you will notice the appearance of a new icon which indicates that we will learn the basics of the Python language for data analysis. As for WYSIWYG software, SPSS, or JAMOVI, we will begin learning them starting from the next session.
This initiation primarily aims to provide the student with the means to design a raw data analysis plan. On the other hand, the second semester of teaching focuses on the use of data analysis software. To this end, we will propose an alternative to the classical approach of using WYSIWYG software by integrating the command line analysis approach through the philosophy of a program like Python.

Objectives of the session

During this session, we aim to achieve the following objectives :

Define data analysis and statistics: Data analysis allows extracting information using, to a large extent, statistical reasoning. In this course, we will try to understand the relationship between data and decision-making;
Understand the basic concepts of statistical jargon: The language of statistics allows mastering the basics of data analysis, so it is important to understand its full content (the set of concepts to be discussed is introduced below);
Determine the nature of a variable: Data analysis focuses exclusively on variables, with data processing performed on qualitative or quantitative variables;
Determine the types of measurement scales: Measurement scales facilitate the univariate and bivariate analysis of data, we will focus on this concept because it is of certain importance for the rest of the work throughout this course;
Define and explain the types of sampling and their ramifications: In this session, we will provide definitions and explanations related to the two major families of sampling: probabilistic and non-probabilistic (as well as the techniques these two concepts encompass);
Familiarize yourself, in the tutorials, with the basic concepts of probabilities in a finite space: In the tutorial session dedicated to this course, we will learn the basic concepts that form the foundation of sampling;
Install Python and perform a first task: In the section dedicated to Python, we will learn how to install this tool which is fundamentally aimed, with its libraries, at data analysis.

Concepts and themes to be covered during the session

Statistics, Population, Sample, Statistical unit, Characteristic, Modality, Value, Measurement scales (nominal, ordinal, interval, ratio).

What is data analysis?

Data analysis is the foundation of empirical knowledge. In a research process, data analysis plays an important role in determining the validity of a theoretical approach (verificatory) or of knowledge derived from fieldwork (generative).

Data enables us to gain knowledge and understanding of a phenomenon, and when processed systematically, it aids in decision-making. Data enables us to gain knowledge and understanding of a phenomenon, and when processed systematically, it aids in decision-making. Synthesised data links theoretical approaches with empirical knowledge. It is the researcher's responsibility to interpret the meaning of the data. Quantification, like qualitative analysis, is a social process because it allows us to question the identified categories. An analysis based on a survey cannot be conducted without examining the social process that underlies the survey.

The epistemological postures that a researcher must adopt in their analytical work demonstrate the richness and depth of such an undertaking. Data analysis has been formalised based on numerical and statistical knowledge, a process that began in the seventeenth century. There is also a less formalised process that is oriented towards subjectivity and considers information in its qualitative aspect.

The following slide traces the history of data analysis, highlighting key dates as well as the most important and recurring themes and issues.

Slide II.1. A Brief History of Data Analysis.

(Thanks to Knightlab for the work they do.)

This second course will provide an understanding of the fundamentals of data analysis language. The course is focused on quantitative data, but we will discuss the nature of qualitative data and their analysis methods later on.

The population and statistical units

The population refers to the individuals or statistical units that are the focus of the research. It is crucial to have a clear and unambiguous definition of the population to accurately determine the membership of a statistical unit in this set.

The term 'population' is not exclusively reserved for human beings. It can refer to the inhabitants of a neighbourhood, village or city, or to the articles published by a newspaper regarding a given event. The term 'population' is not exclusively reserved for human beings. It can also refer to the grades obtained by students in a teaching module.

A sample is a subset of the population chosen for the purposes of research or a survey. The sample may or may not be representative, depending on the procedures used to extract it from the parent population. Probabilistic and non-probabilistic sampling are used to describe the procedures used to draw up the sample.

The following figure illustrates the concepts of Population, Sample and Statistical Unit.

Figure I.2.1. Population and Sample

Variables

The recorded observations for each characteristic constitute the data set, which represents the measurements or observations. A datum is a single measurement or observation resulting from an observation made on a variable in a population. A variable is a characteristic or condition that can vary or have different values depending on the individual or sample.

Definition I.2.1

A variable is a measurable characteristic that can be analysed and assigned several different values.

During a survey, the researcher may ask questions to find out each respondent's gender (male or female), age (expressed in years) and number of siblings. These three elements are examples of what are known in statistics as variables. The data are therefore all the observations made for each of the variables that make up each statistical unit wich is called :statistical series, each statistical series is characterised by the nature of the variables it contains.

A variable can be characterised by the type of value that defines it. There are two main types of variable categorical qualitative variables and the quantitative variables .

Qualitative variables

When we ask people about their gender or their parents' professions, we get answers that represent categories: male, female; manager, professional, self-employed, unemployed, and so on. People questioned in this way can be grouped into categories according to the answers they give. When the data generated by a variable represent categories, it is said to be qualitative.

Definition I.2.2

A categorical qualitative variable (also known as a qualitative variable) is a variable whose associated data are categories, The set of possible categories that the variable can take is called modalities.

Quantitative variables

The data we obtain for the age variable are numerical values, so we say that the age variable is a quantitative variable.

Definition I.2.3

A quantitative variable is a variable whose associated data are real numerical values.

A quantitative variable can be discrete or continuous.

The discrete quantitative variable is associated with counting and has whole number values. Examples include the number of siblings, children in a household, or students sitting an exam. The values of a discrete variable are separate and indivisible, with no intermediate values between adjacent values. An example of a discrete variable is a dice roll.

A continuous quantitative variable can be divided into an infinite number of fractions. Examples of such variables include time, age, weight, and height. The values of these variables belong to a numerical interval.

Measurement scales

A measurement scale is a variable that corresponds to a set of well-defined statistical procedures. Measurement scales and their categories were introduced by American psychologist S.S. Stevens in 1946. Stevens believed that measurement scales are determined by both the empirical operations involved in the measurement process and the formal mathematical properties of the scales.

The categories used to measure a variable form a measurement scale, and the relationship between the different categories determines the different types of scale.

Data analysis is based on mathematical principles, especially those of applied mathematics. Measurement scales are a function of the properties of numbers.

There are four types of measurement scale :

4.1. Nominal scale

The nominal scale is used to name the category to which the observation belongs. Each observation in a nominal variable belongs to only one category. The information contained in the nominal variable has no mathematical properties.

In a nominal scale, the observations are distributed exhaustively (they concern all the possibilities) and exclusively (they do not overlap); the order of the modalities is not important.

Example: names, eye colour, ethnic origin, city of birth, field of study. Postcode, credit card number, international telephone code, ISBN/DOI, etc.

4.2. Ordinal scale

The ordinal scale is a variant of the nominal scale. The modalities and values of a variable are classified according to a criterion. The measurement scale is ordinal because there is a gradation in the categories used.

The ordinal scale is used to measure the position of each observation relative to the other observations on a variable; the position is called the rank. The order relationship is transitive. Example: level of education, race results.

In an ordinal scale, the categories are arranged in an order that corresponds to the difference between the ranks.

4.3. Interval scale - relative scale

Using an abstract scale, it indicates the interval between the position of a statistical unit and the position arbitrarily assigned to the value zero (the arbitrary zero). Example: I.Q., time measurement, LICKERT scale (for measuring attitudes).

4.4. Ratio sclae

The value zero indicates the absence of the characteristic under study; zero is an absolute zero. Example: height, age.

The following table shows the properties applicable to measurement scales:

Table I.2.1. The properties of measurement scales

**Source :** [ *adapted from Stevens* ^{(1946, p. 678)}].
Scales	Basic Empirical Operations	Mathematical Group Structures	Applicable Statistical Calculations	Tests for Relationship between Variables
Nominal	Equality determination	Permutation groups \(x' = f(x)\)	Absolute and relative frequency	Khi-Square, Contingency coefficient, Phi coefficient, Lambda, linear regression
Ordinal	Determination of largest or smallest	Isotonic group: \(x' = f(x) f(x)\) means any monotonically increasing function	Those of the nominal scale plus : Median Position measure	Those on the nominal scale plus: Rank correlation, Other non-parametric tests, Ordinal logic regression
Interval	Determining equality of intervals or differences of intervals	General linear group: \(x' = ax + b\)	Those of the previous two scales plus: Measures of central tendency and dispersion	Those from the previous two scales plus: Analysis of variance, Pearson correlation, Simple and multiple regression.
Ratio	Ratio equality determination	Similarity groups: \(x' = ax\)	All	All

What is sampling ?

Sampling is the process of extracting a portion of the population of interest in order to carry out a survey. In this sense the sample must be chosen to fairly represent the characteristics of the parent population. The latter is made up of individuals, also known as statistical units.

The methods used to select samples are known as sampling plans.

A good sampling plan involves the use of so-called probabilistic methods, the aim of which is to limit subjective judgement in the choice of units for carrying out a survey. [For definitions of statistical terms, see, for example, the online glossary on the statcan website by clicking HERE :) ]. Samples drawn using probability methods are called probabilistic or random samples.

Non-probability sampling, on the other hand, is based on selection by non-random means. This may be useful for certain studies, but it provides only a weak basis for generalisation.

In both families of sampling, the term sampling bias is often used to highlight an inconsistency in the technical implementation of sampling procedures. It is true that randomly drawn samples minimise bias, but it is also true that methods of extrapolation from a probability sample to the population must take account of the method used to draw the sample; otherwise, bias may occur.

Sample

A sample is a subset of the elements or members of a population. The use of a sample study makes it possible to collect information from (or on) the elements in such a way that the result can represent the information of the population from which it was extracted. [We note another advantage of sampling in the sense that it represents an efficient and cost-effective tool for collecting data that can both reduce and improve the quality of the results obtained].

Each sample is assessed on the basis of two properties: its design and its implementation.

With regard to the design of the sample, the researcher must ensure a sampling distribution that will allow him to use confidence intervals and confidence levels. On the other hand, the researcher must use a probability design, in which each element of the population has a known, non-zero value of being selected (which is one of the conditions of the probability sample that we will see later in this course).
Each sampling plan must be implemented with a view to devoting resources to investigating each item in order to collect data from it (this is referred to as the direct and indirect collection technique). The response rate, which is the objective sought through the development of a good sampling plan, must reflect the proportion of the initial number of elements in the sample for which information has been obtained. Even if probability sampling is used, a low coverage rate may invalidate the use of sample estimates because of concerns that the loss of information is systematic rather than random.

Representativeness and sampling errors

Representativeness (of the sample) is a term used to describe the extent to which the information collected can be applied to the population and with what level of risk of error. In other words, the extent to which the characteristics of the small group of statistical units in the sample can reflect those of the population under study. (When we talk about population in research, this does not necessarily mean a certain number of people).

A population can be made up of objects, people or even events (e.g. sick people, cars, companies, etc). A complete list of cases in a population is called a sampling frame. This list can be more or less precise. A sample is therefore a certain number of cases selected from the sampling frame and on which we wish to carry out a more in-depth study.

There is no statistically valid and definitive answer to this question, which always comes to mind. In sampling theory, it is said that the larger the samples, the smaller the sampling errors.

On the other hand, smaller samples can be easier to manage and have fewer non-sampling errors. The larger the samples, the more expensive they are to produce and the longer they take to implement. The researcher should bear in mind that determining the sample size is a task that requires a certain amount of practice and informed judgement.

Sampling errors (also known as bias) can occur in the course of a survey. Let's just say it's almost inevitable, even in the case of specialist organisations! It is up to the researcher to familiarise himself with these pitfalls and to find a way of reconciling the objectives of his work with the requirements of his field of investigation. The main sources of sampling bias can be summarised as follows :

Errors related to the sampling plan ;
Errors related to the sample data ;
Errors, or selection bias;
Non-responses;
Response-related errors.

These elements will be discussed during the guided working session, which is partly devoted to them.

Sampling methods

There are two main types of sampling: probability (random) sampling and non-probability (non-random) sampling.

Probabilistic sampling techniques give the most reliable representation of the whole population, whereas non-probabilistic techniques, which rely on the judgement of the researcher, cannot be used to make generalisations (inferences) about the whole population.

The main point in this section is not to give a detailed presentation of the types of sampling - the student will have the opportunity to deal with this element in the research methods module - but to understand the principle which governs probabilities and which is at the basis of the differentiation between the two types, it constitutes in a way an extension of the very first tutorial session devoted to mathematical reminders.

Random (probability) sampling

Probability sampling is based on the use of random methods to select the sample. Probabilistic selection procedures aim to ensure that each item (statistical unit) has an equal chance of being selected and that all possible combinations of items also have an equal chance of being selected.

Simple random sampling

Simple random sampling is a special case of random sampling. It involves selecting units by a random mechanism, so that each unit has an equal and independent chance of being selected.

Simple random sampling is used when the population is uniform or has common characteristics in all cases (for example, students from the same faculty, employees of a company, issues of a newspaper). A simple form of random selection would be to assign sequential numbers to the entities in the population. A simple form of random selection would be to assign sequential numbers to the entities in the population, which would constitute a sampling frame, and then use a table of random numbers available in most statistics books or computer-generated tables.

Systematic sampling

Systematic sampling consists of selecting units with a fixed interval called the sampling step (K).

There are two common applications of systematic sampling :

The first is where there is a list of units in the population of interest that can be used as a sampling frame. In this case, the procedure consists of selecting each element of the list with the regular interval represented by (k) ;
The second is used when a list does not exist or it is impossible to create one, but sampling is carried out by selecting a flow in the survey area. In this case, the units to be sampled are selected at random.

Systematic sampling is preferred for its practical simplicity. Systematic sampling is an alternative to random sampling and can be used when the population is very large and has no known characteristics, or when the population is known to be very uniform (for example, students at the same level, at the same faculty)..

Stratified sampling

Stratified sampling is used when the population is (or can be) subdivided into distinct categories or strata (e.g. students at several levels of education for example). The presence of different strata in a population makes it possible to carry out a simple random sample within these sub-groups. within these sub-groups.

There are two types of stratified sampling: proportional and non-proportional. Proportional stratified sampling involves controlling the proportions of the sample in each stratum (subgroup) to equal the proportions of the population. If the strata are correlated with the survey measures, this will increase the precision of the survey estimates.

Non-proportional stratified sampling involves the application of different sampling fractions in different strata. The aim is often to increase the sample size by one or more important subgroups for which separate estimates are required. In this situation, non-proportional stratification generally reduces the precision of estimates for the whole population under study, but increases the precision of estimates for the oversampled stratum.

Note - Other types of random sampling are not covered in this course, mainly cluster sampling and multi-stage cluster sampling. These two types of sampling require practical applications that will be carried out in the tutorials of the quantitative approaches module at Master 1 level.

Non-probability sampling

Non-probability sampling is based on selection by non-random means. This may be useful for some studies, but it provides only a weak basis for generalising results.

Accidental sampling

Accidental sampling involves selecting sampling units that are easily accessible to the researcher. The researcher will use common sense and observation to select the units to be sampled.

Accidental sampling has the advantage of being simple to design and inexpensive. Sometimes, this form of sampling can be the most effective way of accessing a hard-to-reach population.

Accidental sampling can be used to collect qualitative or quantitative data.

Snowball sampling

Snowball sampling can be defined as a technique for gathering individuals to be surveyed through the identification of a key individual who is asked to provide the identities (contact details) of other participants who will eventually take part in the survey. Snowball sampling is mainly used in the case of sensitive or intimate subjects.

Quota sampling

Quota sampling is a type of sampling designed to balance the number of individuals surveyed in each quota by selecting responses from an equal number of different respondents.

Summary

We have seen in this course that raw data must undergo a series of conceptual and numerical transformations. The work of analysis begins once the data has been prepared.

Population is the set of individuals of interest in the survey;
A variable can be qualitative or quantitative;
Measurement scale contains information about the variable. There are four types: Nominal, Ordinal, Interval, and Ratio;
There are two types of sampling, probabilistic and non-probabilistic. The researcher will choose one based on the objectives of the research and technical considerations related to the feasibility of fieldwork.

Bibliography of the Block

The Course does not have a final bibliography (in its online version); references are inserted at the end of each Block.

Bazeley, P. (2017). Integrating analyses in mixed methods research. Sage Publications. ^[retour]
Blaikie, N. (2018). Approaches to social enquiry: Advancing knowledge. Polity. ^[retour]
Bryman, A. (2016). Social research methods (5th ed.). Oxford University Press. ^[retour]
Field, A. (2017). Discovering statistics using IBM SPSS statistics (5th ed.). Sage Publications. ^[retour]
Flick, U. (2018). An introduction to qualitative research (6th ed.). Sage Publications. ^[retour]
Leavy, P. (2017). Research design: Quantitative, qualitative, mixed methods, arts-based, and community-based participatory research approaches. Guilford Press. ^[retour]
Lewis-Beck, M. S., Bryman, A., & Liao, T. F. (Eds.). (2019). The Sage encyclopedia of social science research methods. Sage Publications. ^[retour]
Miles, M. B., Huberman, A. M., & Saldaña, J. (2019). Qualitative data analysis: A methods sourcebook (4th ed.). Sage Publications. ^[retour]
Silverman, D. (2020). Qualitative research (5th ed.). Sage Publications. ^[retour]
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680. ^[retour]

Summary Questions

The following questions will enable you to take stock of the knowledge discussed during the block, which will be discussed during the tutorial sessions.

What does a data analyst's work involve?
Why do we analyze data?
Why do we use sampling?
In a table, compare the advantages and limitations of the two main types of sampling.

M.C.Q.

The MCQ consists of ten questions, at the end of which you will receive your assessment and the answers.

To access the MCQ, click on the following icon :

Course Sheets & TD

In this section, you will be able to download sheets related to the current course.

Sheet 1 Table of Greek Letters : This sheet contains all the Greek letters necessary to understand the language of data analysis. The table also includes a column for pronunciation and another for the usage agreed upon for each letter. Click HERE to download the table.

Sheet 2 TD Sheet : This second tutorial sheet reviews the main concepts of frequency and proportion calculations, following the first memo from the initial session. The TD sheet can be downloaded by clicking this Link.

Sheet 3 Table of Mathematical Symbols: This table contains the most commonly used mathematical symbols (in our course). During the tutorial session dedicated to this course, we will work with some of these symbols. This memo should be kept as it will be needed for the remainder of our teaching. Link.

Further information

To learn more about this second Block, you can consult the following material:

Book chapter
Albarello, L., Bourgeois, É., Guyot, J. (2010). Statistique descriptive: Un outil pour les praticiens chercheurs. De Boeck Supérieur. [disponible gratuitement en vous connectant au compte de l'université ].

Book chapter
Guéguen, N. (2022). Statistique pour psychologues: Cours, QCM et exercices corrigés. Dunod. [disponible gratuitement en vous connectant au compte de l'université ].

Book chapter
Monino, J. (2017). TD de statistique descriptive. Dunod. [disponible gratuitement en vous connectant au compte de l'université ].

Video
Un lien vers une chaine youtube qui explique dans divers épisodes les bases de l'analyse des données GradCoach
Video
Un autre lien d'une vidéo qui discute de manière très simple les principaux concepts liés à l'échantillonnage : Learn free

On the course app

On the Course App, you will find a summary of this block, as well as the related series of tutorials.
There are also links to multimedia content relevant to the block.
An update is planned for the Notifications section, based on questions raised by students during lectures and tutorials.
There will also be an update of exams from previous sessions, which will be corrected in tutorial sessions in preparation for the current year's exams.

The Python Corner

In this very first Python corner, you will learn how to download and install the language.

During the directed work session dedicated to this course, you will become more familiar with what are called algorithms; designing an analysis is partly related to learning the internal logic of a program to be deployed.

To avoid cluttering the text, we present the installation procedures in the following accordion, at the end you can test by starting your program.

You can also work using the online compiler graciously provided to us by trinket whom we warmly thank.

Python is a high-level, interpreted, and general-purpose programming language, designed in part for Data Science. Python enables advanced work on data, including exploration, cleaning, and manipulation. Python is free, easy to learn, and has numerous well-developed libraries dedicated to data analysis. It is cross-platform: available on Windows, macOS, Linux, Raspberry Pi, and more.

Python was created by Guido Van Rossum in 1991.

Photograph by Daniel Stroud, first retouched version uploaded by User:Deedub1983, second retouching by User:HarJIT., CC BY-SA 4.0, via Wikimedia Commons

The name Python doesn't come from the snake, but from the British comedy group Monty Python. Guido van Rossum is a fan of "Monty Python's Flying Circus" and chose this name to reflect the playful nature of the language.

To download Python, go to the official website: https://www.python.org/, on the homepage, click on Downloads to get the latest version of the language.

Figure II.2. Python.org homepage
Python Installation Summary

Once the program download is complete, click on Run (depending on your system).
Python will display a window summarizing all the necessary installation information:

Click on the Install Now button to launch the installation utility.

Figure II.3. Installation Summary Window

Once the installation is complete, go to the search field at the bottom right of the taskbar (Windows), enter the word: IDLE, then click on the OK button. Windows will return the installed version of Python (Python 3.12.4, in our case), click on it to access IDLE.

A new window will open: IDLE Shell, which displays information about the installed version of Python and other accessible information by typing certain commands. For example, type "copyright", then perform a mathematical operation and display text to ensure everything is functioning correctly. See the next window.

Figure II.4. Python IDLE

We have chosen to install PyCharm, one of the most popular Python IDEs.
To install PyCharm, visit the website of its publisher JetBrains at: https://www.jetbrains.com/, click on the Developer Tools menu, select PyCharm from the list, as shown in the following figure:

Figure II.5. JetBrains website homepage

Click on the Download button, then select the appropriate operating system, and finally choose the Community version which is free. Click once again on the Download button to start downloading the IDE.

Once the download is complete, run the program. The installation window will appear, make note of the program's installation file if necessary:

Click on the Next button, a new window will appear, which is the installation options window (explain it), it is advisable to check all the boxes. Then click on the Next button, the installation process is quite simple, nothing else will be required.

Once installed, run the program, which will display the PyCharm welcome window, click on the Create button to create a new project.

Figure II.6. Welcome to PyCharm window
Creating a New Python Project

In the New project window, ensure that our project has a name (MyProject1, in our case), a directory where the new project file will be stored, and finally an interpreter (the version of Python we installed) to execute the lines of code we are going to write.

When creating this new project, we have the choice to leave the Create a main.py script checkbox as it is, or uncheck it and create a new file with the desired name after the project is created.

Figure II.7. Creating a New Python Project

Click on the Create button to launch our very first Python project.

Figure II.8. Configuring a New Python Project

Notice that our entire Python project is contained within the main.py file.

Once created, we will attempt to test our main.py file by writing a line of code:

def calculate_sum(numbers): 

                       return sum(numbers) 

                       numbers = [1, 2, 3, 4, 5] 

                       total = calculate_sum(numbers) 

                       print("The sum of numbers is:", total)

To execute the entire code, we can click the shortcut Shift + F10, or click the triangle at the top of our editor window. To execute specific lines, select them with the mouse, right-click, and choose Run 'main'.

To undertake data analysis work using Python, it is necessary to use what is called in the jargon: libraries (also known as packages).

A library is a collection of functions, classes, objects, etc., that allows working on a specific theme.
In our case, we will need, for example, the following libraries: Pandas, NumPy, Matplotlib, Seaborn, SciPy ....

By installing Python and its development environment, we have not installed the necessary libraries for our work. We need to download them and use them for our field work.

One solution is to download and install ANACONDA, a platform that integrates Python libraries dedicated to data analysis, along with an integrated development environment, and much more. We will discover all of this in the following paragraphs.

Downloading and Installing ANACONDA.

To download ANACONDA, go to the official platform website at: https://www.anaconda.com/. On the homepage, your operating system is automatically detected, and click the Download button.

Figure II.9. ANACONDA Platform Homepage

Once the download is complete, launch the platform installation wizard.

Figure II.10. ANACONDA Installation Wizard

Introduction to the ANACONDA platform

Upon launching the ANACONDA platform, a welcome window appears, depicted in the following figure. It contains all the components of the platform, which we will describe in the following paragraphs:

Figure II.11. ANACONDA Welcome Window

The following carousel provides key information about each component of ANACONDA:

CMD.exe Prompt

Anaconda Prompt is a command-line interpreter provided with the Anaconda distribution, specifically configured for Python users and associated tools. Unlike CMD.exe, Anaconda Prompt automatically sets up the necessary environment variables for using Conda and other Anaconda tools upon launch.

Glueviz

Glueviz is a data visualization application developed to enable interactive exploration and analysis of multidimensional datasets. Designed for scientists and data analysts, Glueviz allows linking ("gluing") multiple datasets through synchronized visualizations.

Visual Studio Code

Visual Studio Code is a free and open-source source code editor developed by Microsoft. It is highly regarded by developers for its lightweight, speed, and extensibility. VS Code offers rich code editing features such as syntax highlighting, intelligent code completion (IntelliSense), code refactoring, and code navigation. It supports numerous programming languages including Python, JavaScript, TypeScript, C++, Java, and more.

Orange 3

Orange 3 is an open-source software suite for data visualization and analysis, specializing in data exploration and machine learning. Originally developed by the Faculty of the University of Ljubljana, Orange 3 is known for its ease of use and visual approach to data analysis.

PyCharm

PyCharm, introduced earlier, is a powerful, feature-rich integrated development environment (IDE) for the Python programming language, developed by JetBrains. It is designed to maximize Python developers' productivity with a comprehensive suite of tools.

R Studio

R Studio is an open-source integrated development environment (IDE) specifically designed for the R programming language. RStudio facilitates development, visualization, and data analysis in R, and is widely used by statisticians, data scientists, and researchers.

IBM Watson Studio Cloud

IBM Watson Studio Cloud is a collaborative, cloud-based platform that enables teams to analyze, visualize, and share data, as well as develop and deploy artificial intelligence models.

Datalore

Datalore is a data science platform developed by JetBrains, designed to simplify and accelerate the process of data analysis and modeling. Datalore operates entirely in the web browser, allowing users to access their projects from anywhere without requiring additional software installation.

PyCharm Community

See Slide: PyCharm. PyCharm Community Edition is a free and open-source integrated development environment (IDE) developed by JetBrains, specifically designed for Python developers.

JupyterLab

JupyterLab is an open-source web application developed by Project Jupyter, designed to facilitate interactive development in multiple programming languages, including Python, R, and Julia.

Jupyter Notebook

Jupyter Notebook is an interactive environment that enables users to create and share documents containing executable code, visualizations, and narrative text. Highly popular for rapid prototyping, data exploration, and interactive learning, it supports over 40 programming languages.

PowerShell Prompt

PowerShell Prompt is a command-line interface developed by Microsoft, built-in natively in Windows operating systems. Unlike the traditional Command Prompt (cmd.exe), PowerShell offers advanced features and a more powerful syntax, specifically designed for system automation, configuration management, and administration of Windows systems and Microsoft cloud services.

Qt Console

Qt Console is an interactive interface based on Jupyter for executing Python code, integrated within the Jupyter Notebook environment.

Spyder

Spyder (Scientific Python Development Environment) is an integrated development environment (IDE) for Python, specifically designed for scientists and engineers. It offers advanced features such as code editing, debugging, variable exploration, and interactive analysis. Integrated with libraries like NumPy, SciPy, Matplotlib, and Pandas, Spyder is particularly suitable for data analysis and scientific development.

Python Coding Online

The following window, an Iframe from Trinket, allows you to test your Python code (for assistance, you can ask your queries in the PanelBot at the top of this page). Trinket is an online platform that enables users to code, share, and embed interactive programs in their web browsers.

Course Download

By using the links below, you can download the course in pdf format :

Discussion Forum

The forum allows you to discuss this second session. You will notice a subscription button so you can follow discussions on research in humanities and social sciences. This is also an opportunity for the instructor to address students' concerns and questions.

Block I | L 2 Introduction to data analysis

Block I - L 2 Introduction to data analysis