- Statistics is the branch of mathematics that deals with the collection, presentation, analysis and interpretation of numerical data.
- Data can be primary (collected firsthand by the investigator) or secondary (obtained from existing sources).
- Raw data is organised using frequency distribution tables with tally marks, class intervals and class marks.
- Graphical representations include bar graphs (discrete/categorical data, gaps between bars) and histograms (continuous data, NO gaps between bars).
- A frequency polygon is formed by joining the mid-points of the tops of histogram bars; it can also be drawn directly without a histogram.
- Three measures of central tendency: Mean (arithmetic average), Median (middle value when sorted), Mode (most frequent value).
- Board weightage: ~6–8 marks/year — typically one graphical question (2–3 marks) and one central-tendency calculation (3–4 marks).
1. Primary vs Secondary Data
Data is any collection of facts, numbers, measurements or other information gathered for a purpose.
- Primary data: Information collected first-hand by the investigator themselves — through surveys, direct measurements, experiments. The collector knows the context, source and accuracy exactly. Example: a teacher recording attendance each day.
- Secondary data: Information taken from existing sources not originally created by the investigator. Examples: census reports, newspaper statistics, government publications, internet databases, published research.
Raw data is data that has just been collected, unsorted, and looks like a jumble of numbers. The first step in statistics is always to organise this raw data.
Arrayed data: When raw data is arranged in ascending or descending order it becomes an array. This is a necessary first step before constructing frequency tables or finding the median.
Range: The difference between the largest and smallest values in the data. It gives a rough idea of how spread out the data is.
The blood groups of 30 students are recorded: A, B, O, O, AB, O, A, O, B, A, O, B, A, A, O, A, AB, O, A, A, O, O, AB, B, A, O, B, A, B, O. This list is raw primary data. A frequency table immediately reveals that O is the most common blood group (12 students) and AB is the rarest (3 students).
2. Frequency Distribution — Tally Marks, Class Intervals, Class Width, Class Mark
A frequency distribution table lists each value (or range of values) with the number of times it occurs. This compresses raw data into a clear summary.
Tally marks
While building the table, we go through the data one value at a time. Instead of writing numerals, we draw tally marks — four vertical strokes, and the fifth diagonally crosses the bundle (like a "gate"). Counting bundles of five is fast and error-free. Each complete bundle represents 5.
Ungrouped frequency distribution
Lists every distinct individual value with its frequency. Best when the range of values is small (e.g., blood groups A, B, O, AB).
Grouped frequency distribution
Groups data into class intervals. Used when the range is large (e.g., marks 0–100). Each class interval has a lower class limit and upper class limit.
| Term | Definition | Example (class 20–30) |
|---|---|---|
| Lower class limit | Smaller boundary of the class | 20 |
| Upper class limit | Larger boundary of the class | 30 |
| Class width (class size) | Upper limit minus lower limit | 30 – 20 = 10 |
| Class mark (mid-point) | (Lower + Upper) divided by 2 | (20 + 30) / 2 = 25 |
| Frequency | Number of observations in the class | e.g., 8 students scored 20–30 |
Exclusive (continuous) form vs inclusive form
Exclusive form: 10–20, 20–30, ... means a value of exactly 20 goes into 20–30 (upper limit excluded from the lower class). Convenient for continuous data.
Inclusive form: 1–10, 11–20, ... means both limits are included. To draw a histogram from inclusive data, first convert by subtracting 0.5 from each lower limit and adding 0.5 to each upper limit.
Blood groups of 30 students: A, B, O, O, AB, O, A, O, B, A, O, B, A, A, O, A, AB, O, A, A, O, O, AB, B, A, O, B, A, B, O.
| Blood Group | Tally | Frequency |
|---|---|---|
| A | |||| |||| | 9 |
| B | |||| | | 6 |
| O | |||| |||| || | 12 |
| AB | ||| | 3 |
| Total | 30 |
Conclusion: O is most common; AB is rarest.
The daily income (in Rs) of 50 workers: 100–120 (12 workers), 120–140 (14), 140–160 (8), 160–180 (6), 180–200 (10). Class width = 20. Class marks: 110, 130, 150, 170, 190.
3. Bar Graphs vs Histograms
Both use rectangular bars and look similar at first glance, but represent fundamentally different data types with one critical visual difference.
| Feature | Bar Graph | Histogram |
|---|---|---|
| Data type | Discrete or categorical | Continuous (grouped into intervals) |
| Gaps between bars | Yes — bars do NOT touch | No — bars are adjacent, no gaps |
| Width of bars | Uniform, chosen for appearance | Equal to class width; must reflect the interval |
| X-axis labels | Categories (subjects, months, cities) | Class intervals on a continuous numerical scale |
| Y-axis | Frequency or count | Frequency (area of bar proportional to frequency) |
| Area meaning | Not meaningful | Area = frequency (when all class widths are equal) |
Why no gaps in a histogram? Class intervals are continuous — 10–20 ends exactly where 20–30 begins. There is no gap in the data scale, so there must be no gap between the bars.
| Daily income (Rs) | Number of workers |
|---|---|
| 100–120 | 12 |
| 120–140 | 14 |
| 140–160 | 8 |
| 160–180 | 6 |
| 180–200 | 10 |
Steps: (1) Draw a continuous x-axis from 100 to 200. (2) Mark the y-axis as frequency. (3) Draw adjacent bars of heights 12, 14, 8, 6, 10 — each bar touching the next. (4) The class 120–140 has the tallest bar, showing most workers earn in this range.
Note: If the class intervals were unequal (e.g., 100–120, 120–150), we would use frequency density (frequency divided by class width) on the y-axis so the area still represents frequency.
A histogram shows the ages of 360 patients admitted over a year, with classes 10–20, 20–30, 30–40, 40–50, 50–60, 60–70 and frequencies 90, 50, 60, 80, 50, 30 respectively. The 10–20 age group has the most admissions (90 patients). Total = 360.
4. Frequency Polygon
A frequency polygon is a line graph that conveys the same information as a histogram but uses connected points instead of bars. Its main advantage: two frequency polygons can be drawn on the same axes to compare two distributions.
Method 1 — From a histogram
- Mark the class mark (mid-point) at the top centre of each bar.
- Join consecutive mid-points with straight line segments.
- Extend the line to the mid-point of a "phantom" class with zero frequency before the first class and after the last class, touching the x-axis. This closes the polygon.
Method 2 — Direct (without histogram)
- Compute the class mark of each class interval.
- Plot the points (class mark, frequency) for each class.
- Also plot (mid-point of phantom class before first, 0) and (mid-point of phantom class after last, 0).
- Join all points in order with straight line segments.
Marks of 100 students in a test:
| Marks | Frequency | Class mark |
|---|---|---|
| 0–10 | 5 | 5 |
| 10–20 | 10 | 15 |
| 20–30 | 4 | 25 |
| 30–40 | 6 | 35 |
| 40–50 | 7 | 45 |
| 50–60 | 3 | 55 |
| 60–70 | 2 | 65 |
| 70–80 | 2 | 75 |
| 80–90 | 3 | 85 |
| 90–100 | 9 | 95 |
Plot (5,5), (15,10), (25,4), (35,6), (45,7), (55,3), (65,2), (75,2), (85,3), (95,9). Add anchor points (-5, 0) and (105, 0). Join all with straight line segments to form a closed polygon.
The polygon shows the highest peak in the 10–20 range and a secondary peak at 90–100.
5. Mean for Ungrouped Data
The arithmetic mean (usually called "the mean" or "average") is found by summing all observations and dividing by the total count.
Here $x_1, x_2, \dots, x_n$ are the $n$ observations and $\bar{x}$ (read "x-bar") is their mean.
The marks (out of 100) obtained by 5 students in a mathematics test are: 55, 60, 48, 72, 65. Find the mean.
Sum $= 55 + 60 + 48 + 72 + 65 = 300.$ Number of students $n = 5.$
$\bar{x} = \dfrac{300}{5} = \mathbf{60}.$
Interpretation: On average a student scored 60 marks. Note that no student actually scored exactly 60 — the mean need not equal any observation in the data.
The mean of $6, 4, 7, p$ and $10$ is $8$. Find $p$.
$\dfrac{6 + 4 + 7 + p + 10}{5} = 8$
$27 + p = 40 \Rightarrow p = \mathbf{13}.$
Runs scored by Sachin Tendulkar in 10 innings: 52, 15, 11, 65, 0, 99, 8, 70, 29, 51. Find his mean score.
Sum $= 52 + 15 + 11 + 65 + 0 + 99 + 8 + 70 + 29 + 51 = 400.$
$\bar{x} = \dfrac{400}{10} = \mathbf{40}$ runs per innings.
Key properties of the mean:
- Unique — every dataset has exactly one mean.
- Uses every observation — the most "information-rich" measure.
- Sensitive to extreme values (outliers). A single very large value can pull the mean far above the "typical" value.
- The sum of deviations from the mean is always zero: $\displaystyle\sum_{i=1}^{n}(x_i - \bar{x}) = 0.$
6. Mean for Grouped Data — Direct Method and Assumed Mean Method
When data is presented in a grouped frequency distribution, individual values are not known. We use the class mark as the representative value for all observations in that class.
Direct Method
where $x_i$ is the class mark of the $i$-th class and $f_i$ is its frequency. Multiply each class mark by its frequency, add up all products, then divide by the total frequency.
Assumed Mean Method (Shortcut)
When class marks are large numbers, the direct multiplications $f_i x_i$ become tedious. Choose an assumed mean $a$ (typically the class mark of the middle class or the class with the highest frequency). Calculate deviations $d_i = x_i - a$ for each class. Then:
This gives the same result with much simpler arithmetic — the $d_i$ values are small (often negative and positive, cancelling out).
| Daily wages (Rs) | Frequency $f_i$ | Class mark $x_i$ | $f_i x_i$ |
|---|---|---|---|
| 100–120 | 12 | 110 | 1320 |
| 120–140 | 14 | 130 | 1820 |
| 140–160 | 8 | 150 | 1200 |
| 160–180 | 6 | 170 | 1020 |
| 180–200 | 10 | 190 | 1900 |
| Total | 50 | 7260 |
$\bar{x} = \dfrac{7260}{50} = \mathbf{Rs\;145.20}$
Let assumed mean $a = 150$ (class mark of the middle class).
| Class interval | $f_i$ | $x_i$ | $d_i = x_i - 150$ | $f_i d_i$ |
|---|---|---|---|---|
| 100–120 | 12 | 110 | -40 | -480 |
| 120–140 | 14 | 130 | -20 | -280 |
| 140–160 | 8 | 150 | 0 | 0 |
| 160–180 | 6 | 170 | +20 | +120 |
| 180–200 | 10 | 190 | +40 | +400 |
| Total | 50 | -240 |
$\bar{x} = 150 + \dfrac{-240}{50} = 150 - 4.8 = \mathbf{Rs\;145.20}$ — same answer, far less arithmetic.
Tip: The choice of assumed mean $a$ does not affect the final answer. Choose it to make $d_i$ values as small as possible.
7. Median for Ungrouped Data
The median is the value that sits in the exact middle of the data when arranged in ascending order. It divides the distribution into two equal halves — half the values lie below and half above.
Formula
If $n$ is odd: $\text{Median} = \left(\dfrac{n+1}{2}\right)\text{-th observation}$
If $n$ is even: $\text{Median} = \dfrac{\left(\dfrac{n}{2}\right)\text{-th observation} + \left(\dfrac{n}{2}+1\right)\text{-th observation}}{2}$
The heights (in cm) of 9 students: 162, 155, 160, 148, 152, 170, 165, 158, 175.
Arrange in ascending order: 148, 152, 155, 158, 160, 162, 165, 170, 175.
$n = 9$ (odd). Median $= \left(\dfrac{9+1}{2}\right)\text{-th} = 5\text{-th value} = \mathbf{160}$ cm.
The weights (in kg) of 10 students: 55, 60, 65, 45, 50, 70, 45, 65, 55, 70.
Sorted: 45, 45, 50, 55, 55, 60, 65, 65, 70, 70.
$n = 10$ (even). 5th value $= 55$, 6th value $= 60$.
Median $= \dfrac{55 + 60}{2} = \dfrac{115}{2} = \mathbf{57.5}$ kg.
Why median beats mean when outliers exist: Consider salaries of 5 employees: Rs 10,000; 12,000; 11,000; 13,000; 1,00,000.
Mean $= \dfrac{1,46,000}{5} = \text{Rs }29,200$ — this misrepresents 4 of the 5 employees.
Median (3rd value of sorted data) $= \text{Rs }12,000$ — far more representative of a "typical" salary.
8. Mode for Ungrouped Data
The mode is the observation that appears most frequently in the dataset — the value with the highest frequency.
- Unimodal: One value appears most often (most common case).
- Bimodal: Two values tie for most frequent.
- Multimodal: More than two values tie for most frequent.
- No mode: All values appear equally often.
Marks of 15 students: 14, 25, 14, 28, 18, 17, 18, 14, 23, 22, 14, 18, 14, 13, 14.
Count: 14 appears 6 times, 18 appears 3 times, 25, 28, 17, 23, 22, 13 each once.
Mode $= \mathbf{14}.$
A shoe manufacturer records daily sales. Size 7 sells 60 pairs/day; size 6 sells 40; size 8 sells 35; size 5 sells 20. The modal shoe size is 7. The manufacturer should produce most pairs of size 7.
The mean shoe size might be 6.7 — which is not a real shoe size and tells the manufacturer nothing actionable. Mode is the right measure for this manufacturing decision.
Mode for grouped data (Class 9 introduction): The class interval with the highest frequency is the modal class. The exact formula to find the precise mode within a class is studied in Class 10.
9. Choosing Which Measure to Use
Mean, median and mode each answer a subtly different question about "what is the central value?" Choosing the wrong one gives a misleading picture.
| Situation | Best measure | Reason |
|---|---|---|
| Exam scores, temperatures, no extreme outliers | Mean | Uses every value; most mathematically precise |
| Incomes, house prices, data with extreme outliers | Median | Not pulled by very high or very low values |
| Shoe size, shirt size, dress size (manufacturing) | Mode | Tells which size is most popular to produce/stock |
| Qualitative / categorical data (blood groups, colours) | Mode | Mean and median are undefined for non-numeric categories |
| Open-ended distributions ("Rs 10,000 and above") | Median | Mean is undefined; median can still be located |
Empirical relationship (for moderately skewed distributions):
This is an approximate empirical formula, not a definition. It is useful in CBSE problems when one of the three measures is unknown and the other two are given.
Summary comparison
- Mean: Unique; uses all values; affected by outliers; need not equal any actual observation.
- Median: Unique; unaffected by outliers; equals an actual observation (odd $n$) or average of two (even $n$).
- Mode: May not be unique; unaffected by extreme values; always an actual observation in the dataset.
10. Common Mistakes to Avoid
- Forgetting to sort data before finding the median — the position formula only works on ordered data.
- Leaving gaps between histogram bars — histograms show continuous data; bars must be adjacent with no spaces.
- Using class limits instead of class marks in the mean formula for grouped data — always compute $x_i = \dfrac{\text{lower} + \text{upper}}{2}$ first.
- Applying HCF-LCM product rule to three numbers (unrelated but a common slip in exam pressure) — stay focused on the chapter's scope.
- Wrong median position for even n — for $n = 10$ the median is the average of the 5th and 6th values, NOT just the 5th.
- Not closing the frequency polygon — always bring the line to the x-axis using zero-frequency anchor points at both ends.
- Computing the mean for qualitative (categorical) data — never take the mean of blood groups, colours, or other non-numeric categories; use mode.
- Confusing bar graph and histogram — bar graphs have gaps (discrete categories), histograms have no gaps (continuous intervals).
11. Quick Revision Checklist
- Primary data = collected by investigator; Secondary data = from existing sources.
- Class mark $= \dfrac{\text{lower} + \text{upper}}{2}$; Class width $= \text{upper} - \text{lower}$.
- Histogram: no gaps, continuous data, bars touch each other.
- Bar graph: gaps between bars, discrete or categorical data.
- Frequency polygon: join class marks at heights = frequencies; close with zero-frequency anchor points.
- Mean (ungrouped) $= \dfrac{\sum x_i}{n}$; Mean (grouped) $= \dfrac{\sum f_i x_i}{\sum f_i}$.
- Assumed mean shortcut: $\bar{x} = a + \dfrac{\sum f_i d_i}{\sum f_i}$ where $d_i = x_i - a$.
- Median: sort first; odd $n$: middle value; even $n$: average of two middle values.
- Mode: most frequent value; use for categorical data or "most popular size" questions.
- Empirical relation: Mode $\approx 3 \times \text{Median} - 2 \times \text{Mean}$.
- 25
- 35
- 30
- 10
- 9
- 10
- 11
- 12
- 5
- 6
- 7
- 8
- have equal gaps between them
- represent only categorical data
- have no gaps between them
- can be of different widths for the same class size
- 2
- 5
- 3
- 8
- secondary data
- raw data
- primary data
- arrayed data
- $\bar{x}$
- $\bar{x} + 3$
- $3\bar{x}$
- $\dfrac{\bar{x}}{3}$
- 12
- 14
- 16
- 15
- 15
- 20
- 25
- 10
- Mean
- Median
- Mode
- Range
Sorted ($n=11$): $6, 8, 10, 10, 15, 15, 15, 50, 80, 100, 120.$ Median $= 6$-th value $= \mathbf{15}.$
Mode $= \mathbf{15}$ (appears 3 times).
Note: Mean (39) is much higher than median/mode because of outliers 100 and 120. Median better represents the typical score here.
Book a free demo class