PS4 – Mean, Median and Mode

Central tendency

A central tendency could be defined as a central value for a probability distribution. The central tendency could be also called the center of a random variable distribution to cluster around its mean, mode, or median.

Mean

The mean usually called ‘average’, is the sum of all the data points over the number of data points. It is calculated using the following formula:

Where:

  • is the mean for a population
  • is the mean for a sample
  • N is the number of items in the data set (population)
  • n is the number of items in the data set (population)
  • is the sum of all the data points

Example

Given the data set [10, 12, 14, 16]. Calculate the mean?

Median

The median of a data set is the value separating the higher half from the lower half, otherwise, when we line up all the data points in the set from least to greatest, it may be thought of as the “middle” value when we look at the number or pair of numbers in the middle.

Example 1

Given the data set [1, 3, 3, 6, 7, 8, 9]. Find the Median.




PS3 – Histograms and Stem-and-leaf plots

Histogram

A histogram also called a frequency histogram could be defined as an accurate graphical representation of the distribution of numerical data set. It is an estimate of the probability distribution of a continuous variable which is why there are no gaps. Histogram differs from a bar graph in the sense that a bar graph relates two variables, but a histogram relates only one.

It is preferable to use a histogram instead of a bar graph when you have too many data points to plot individually.

Example:  you want to use census data to make a graph of the number of people of each age at a home party. Before creating the histogram, you might first group together 0 − 14 year-olds, 15 − 29 year-olds, 30 − 49 year-olds, etc. Each of these ages intervals should have the same size or length.

Ages 1-5 6-9 10-13 14-17 18-21 22-25
Number 5 11 23 24 9 4

Table 3.1. The distribution of people ages at a home party.

Figure 1. The way the data (people ages) is spread out in the histogram is called the distribution.

Relative frequency histogram

We can transform the data table 3.1 into a relative frequency histogram by converting the numbers into frequencies. Frequency histogram is the same as a regular histogram, except values are displayed as percentage of the total of the data.

Ages 1-5 6-9 10-13 14-17 18-21 22-25
Frequency 0.0658 0.1447 0.3026 0.3158 0.1184 0.5263

Table 3.2. The frequency distribution of people ages at a home party.

Stem-and-leaf plot

A stem-and-leaf display or stem-and-leaf plot is just another way to present quantitative data in a graphical format, similar to a histogram because both types of charts group together data points, to assist in visualizing the shape of a distribution, they are very helpful ways in exploratory data analysis to visualize how many data points fall into a certain category or range.

Example: let’s say we have the finishing scores of golfers in a round of tournament golf: 66, 67, 67, 68, 68, 68, 68, 69, 69, 69, 69, 70, 70, 71, 71, 72, 73, 75, 101, 102, 111

Stem Leaf
6 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9
7 0, 0, 1, 1, 2, 3, 5
10 1,2,11

Table 3.3. A stem plot of the scores, the “stems” are the numbers on the left whereas the “leaves” are those on the right

Key: 7|0 = 70

 




PS2 – Data distributions

Joint distribution

It is a data table (similar to a relative frequency table) that shows the distribution of one set of data against the distribution of another set of data in percentages.

    Weight lost (lbs)
    0-2 2-4 4-6 6-8 +8
Miles walked per day

 

1-3 4% 2% 21% 1% 1%
3-5 12% 8% 6% 2% 8%
5-7 1% 12% 1% 0% 10%
+7 2% 3% 4% 1% 1%

Table 2.1. a data table of a group of 50 individuals, measuring the average number of hours each participant spent walking each day over the course of the study, data about the total number of pounds of weight lost in total by each participant was gathered over that same period of time.

The table 2.1 is an example of joint distribution, it shows that 4 % of the group, which would be 2 out of the 50 people studied, spent between 1 and 3 hours per der exercising, and lost between 0 and 2 pounds.

Marginal distribution

If we add totals (by totalling up the data in each row and column) to the table 2.1. we get the following data table:

    Weight lost (lbs)
    0-2 2-4 4-6 6-8 +8 Total
Miles walked per day

 

1-3 4% 2% 21% 1% 1% 29%
3-5 12% 8% 6% 2% 8% 36%
5-7 1% 12% 1% 0% 10% 24%
+7 2% 3% 4% 1% 1% 11%
Total 19% 25% 32% 4% 20% 100%

Table 2.2. data table with marginal distributions.

Conditional distribution

Conditional distribution is the distribution of one variable, while the other variable value is already known.

    Weight lost (lbs)
    0-2 2-4 4-6 6-8 +8 Total
Miles walked per day 1-3 44% 22% 21% 11% 2% 100%
3-5 51% 19% 15% 14% 1% 100%
5-7 61% 19% 10% 0% 10% 100%
+7 22% 38% 14% 5% 21% 100%

Table 2.3. data table with 4 different conditional distributions.

The data table 2.3 shows that people who spent 1 − 3 hours walking per day, 44 % of them lost 0 − 2 pounds, 22 % of them lost 2 − 4 pounds, 21 % of them lost 4 − 6 pounds, 11 % of them lost 6 − 8 pounds and only 2 % of them lost +8 pounds. This distribution is conditional on 1 – 3 walking hours.

If we flip the two distributions, taking the miles walked per day distribution versus each weight loss variable and we calculate the percentages of each conditional variable. We will get the following data table:

    Weight lost (lbs)
    0-2 2-4 4-6 6-8 +8
Miles walked per day

 

1-3 40% 52% 21% 15% 11%
3-5 12% 8% 16% 29% 8%
5-7 10% 33% 3% 31% 21%
+7 38% 7% 60% 25% 60%
Total 100% 100% 100% 100% 100%

Table 2.4. data table with 5 different conditional distributions.

 




PS1 – Data Visualization

This article is Chapter I from the author’s book Statistics and Probability Flashcards.

Definitions

Individuals and variables

In a dataset, the individuals are the items with one or more properties, called variables. Individuals can be events, cases, objects, people, etc.

Student (individuals) Height (variables)
John 190 cm
Ali 175 cm
Paul 165 cm
Clara 160 cm

Table 1.1. example of a data set with items and variables.

Individuals and variables are called data. Table 1.1 is called a data table.

Here’s another example of a data table containing other variables:

Student Height Weight Likes football
John 190 cm 100 kg Yes
Ali 175 cm 90 kg No
Paul 165 cm 60 kg No
Clara 160 cm 63 kg Yes

Table 1.2. example of a data set with items and more than 1 variable category.

Variables can be categorical or quantitative. In table 1.1 there’s one quantitative variable: the height whereas in table 1.2 there are two quantitative variables (height and weight), and one categorical variable (likes football).

Quantitative variables are numerical variables: counts, percents, or numbers.

Categorical variables are non-numerical variables. Their values aren’t represented with numbers: words, not numbers.

This data set presented in table 1.1 and table 1.2 is called one-way data because we have just a single individual (item) that has one or many properties attached to it.

How to build a data table?

When you build a data table, it is important to think about whether you have more individuals or more variables.

In tables 1.1 and 1.2 the number of individuals listed was greater than the number of variables. If we have many variables but only a few individuals, it is advisable to list the individuals across the top and the variables down the left side.

John Ali
Height 190 cm 175 cm
Weight 90 kg 75 kg
Likes football Yes No
Likes pizza Yes Yes

Table 1.3. Since the number of variables is bigger than individuals, listing the variables vertically would make the data table more appropriate than if we had tried to list all the variables horizontally.

Data visualization

Bar graphs and pie charts

Two of the simplest ways to summarize and graphically represent data are bar graphs and pie charts.

Bar graphs apply a series of rectangular bars to show absolute values or proportions for each of the data categories whereas pie charts show how substantial each data category represents as a part or proportion of the whole, by using a circular format with different-sized “slices” for different percentages of the total.

Rank Country Oil production (bbl/day)
01 USA 15,043,000
02 Saudi Arabia (OPEC) 12,000,000
03 Russia 10,800,000
04 Iraq (OPEC) 4,451,516
05 Iran (OPEC) 3,990,956
06 China 3,980,650
07 Canada 3,662,694
08 United Arab Emirates (OPEC) 3,106,077
09 Kuwait (OPEC) 2,923,825
10 Brazil 2,515,459

Table 1.4. Top 10 world Oil producers (“Production of Crude Oil including Lease Condensate 2019” U.S. Energy Information Administration)

Figure 1. Bar chart – Top 10 world Oil producers (“Production of Crude Oil including Lease Condensate 2019” U.S. Energy Information Administration)

Notice that we have a list of the Oil producers (countries) across the bottom of the bar graph, with the count of the Oil production (bbl/day) up the left side.

The countries are the individuals, and the count is a quantitative variable because it represents the numeric property of each of the individuals. The bar graph is one of the best ways to represent this data because it is possible to get quickly an overview of which countries produce the most oil.

Figure 2. Pie chart – Top 10 world Oil producers (“Production of Crude Oil including Lease Condensate 2019” U.S. Energy Information Administration)

Now we can quickly see that the United States produces the most of the total oil daily, biggest than any other country, Saudi Arabia occupies second place, and Brazil is the 10th world’s biggest oil producer.

Venn diagrams

A Venn diagram is a diagram that shows all possible logical relations between a finite collection of different sets from a two-way table.

Good Cheap Fast Total
Expensive 10 0 10 20
Low quality 0 10 10 20
Slow delivery 10 10 0 20
Best choice 10 10 10 30
Other 20 20 20 60
Total 50 50 50 150

Table 1.5. two-way data table

Figure 3. Venn diagram

Box-and-whisker plots

Box-and-whisker plots (also called box plots) are a great method for graphically depicting groups of numerical data through their quartiles. It is very useful when you want to show the median and spread of the data (see chapter IV) at the same time.

Assuming that we have the following data set: [1, 2,2, 2, 3, 3, 4, 6, 8,8, 10, 11, 11, 16]:

Figure 4. Box-and-whisker chart

The horizontal line in the center of the box is the median of the data set, so the median of the data set represented in the chart above is 5.

The dot at the end of the bottom whisker is the minimum of the data set, and the dot at the top of the right whisker is the maximum of the data set. So in this plot, we can say that the minimum is 1, that the maximum is 16, so the range would be 16 − 1 = 15.

The IQR (interquartile range) is given by the ends of the box. Since the box above extends from 2 to 10.25, the IQR is 10.25 − 2 = 8.25.

We can summarize the information given by the Box-and-whisker chart above in the following table:

Min Q1 Median Q3 Max
1 2 5 10.25 16



List of Top foods rich in Antioxidants

The majority of living beings need oxygen to ensure their existence while oxygen can produce free radicals that are also called reactive oxygen species (ROS, for reactive oxygen species) toxic to the integrity of the body cells.  organisms have a system of antioxidants and enzymes that work together to prevent damage to cell components such as DNA , lipids, and proteins .

Many studies have attempted to study the impact of taking dietary supplements of antioxidants in the prevention of different diseases.

The best known antioxidants are ß-carotene (provitamins A), ascorbic acid (vitamin C), tocopherol (vitamin E), polyphenols and lycopene. These include flavonoids (widespread among plants), tannins (in cocoa, coffee, tea, grapes, etc.), anthocyanins (especially in red fruits) and phenolic acids (in cereals, fruits and vegetables).

Antioxidant in Beverages

Antioxidant content mmol/100 ga) n min max
Apple juice 0.27 11 0.12 0.60
Black tea, prepared 1.0 5 0.75 1.21
Cocoa with milk 0.37 4 0.26 0.45
Coffee, prepared filter and boiled 2.5 31 1.24 4.20
Cranberry juice 0.92 5 0.75 1.01
Espresso, prepared 14.2 2 12.64 15.83
Grape juice 1.2 6 0.69 1.74
Green tea, prepared 1.5 17 0.57 2.62
Orange juice 0.64 16 0.47 0.81
Pomegranate juice 2.1 2 1.59 2.57
Prune juice 1.0 3 0.83 1.13
Red wine 2.5 27 1.78 3.66
Tomato juice 0.48 14 0.19 1.06

a) Mean value when n > 1

Antioxidant in nuts, legumes and grain products

Antioxidant content mmol/100 ga) n Min Max
Barley, pearl and flour 1.0 4 0.74 1.19
Beans 0.8 25 0.11 1.97
Bread, with fiber/whole meal 0.5 3 0.41 0.63
Buckwheat, white flour 1.4 2 1.08 1.73
Buckwheat, whole meal flour 2.0 2 1.83 2.24
Chestnuts, with pellicle 4.7 1
Crisp bread, brown 1.1 3 0.93 1.13
Maize, white flour 0.6 3 0.32 0.88
Millet 1.3 1
Peanuts, roasted, with pellicle 2.0 1
Pecans, with pellicle 8.5 7 6.32 10.62
Pistachios 1.7 7 0.78 4.98
Sunflower seeds 6.4 2 5.39 7.50
Walnuts, with pellicle 21.9 13 13.13 33.29
Wheat bread, toasted 0.6 3 0.52 0.59
Whole wheat bread, toasted 1.0 2 0.93 1.00

mean value when n > 1

Antioxidants in spices and herbs

Antioxidant content mmol/100 ga) n Min Max
Allspice, dried ground 100.4 2 99.28 100.40
Basil, dried 19.9 5 9.86 30.86
Bay leaves, dried 27.8 2 24.29 31.29
Cinnamon sticks and whole bark 26.5 3 6.84 40.14
Cinnamon, dried ground 77.0 7 17.65 139.89
Clove, dried, whole and ground 277.3 6 175.31 465.32
Dill, dried ground 20.2 3 15.94 24.47
Estragon, dried ground 43.8 3 43.22 44.75
Ginger, dried 20.3 5 11.31 24.37
Mint leaves, dried 116.4 2 71.95 160.82
Nutmeg, dried ground 26.4 5 15.83 43.52
Oregano, dried ground 63.2 9 40.30 96.64
Rosemary, dried ground 44.8 5 24.34 66.92
Saffron, dried ground 44.5 3 23.83 61.72
Saffron, dried whole stigma 17.5 3 7.02 24.83
Sage, dried ground 44.3 3 34.88 58.80
Thyme, dried ground 56.3 3 42.00 63.75

a) mean value when n > 1

Source : Carlsen MH, Halvorsen BL, Holte K, et al. The total antioxidant content of more than 3100 foods, beverages, spices, herbs and supplements used worldwide. Nutrition Journal. 2010;9:3. doi:10.1186/1475-2891-9-3.