Statistics Terms
In applied statistics lots of terms used to represent output values. we will go though some of the common terms which are used frequently :
a) Population :
It includes all members or elements of a group that you are interested in studying. It can consist of people, objects, events, measurements, or any defined group.
For Examples : i) All students in a university. ii) All households in a city.
b) Sample :
It’s a subset of the population that is selected for analysis. The sample should ideally represent the population accurately.
For Examples : i) 500 students selected randomly from a university. ii) A survey of 1,000 households in a city.
c) Parameter :
It’s a numerical value that describes a characteristic of an entire population. It is a fixed value, although it is often unknown because measuring the entire population is impractical.
For Examples : i) The average height of all students in a university. ii) The proportion of voters in a country who support a particular candidate.
Notation:
- Population Mean: μ (mu)
- Population Variance: σ² (sigma squared)
- Population Standard Deviation: σ (sigma)
- Population Proportion: P
import numpy as np import scipy.stats as stats # Sample data data = np.random.normal(loc=100, scale=15, size=50) # Calculate mean and standard error sample_mean = np.mean(data) std_error = stats.sem(data) # Confidence interval (95%) conf_interval = stats.t.interval(0.95, len(data)-1, loc=sample_mean, scale=std_error) print(f"95% Confidence Interval: {conf_interval}") # Output = 95% Confidence Interval: (95.6663181952571, 104.56751769318811)
d) Statistics :
It’s a numerical value that describes a characteristic of a sample. It is calculated directly from sample data and is used to estimate population parameters.
For Examples: i) The average height of 100 students randomly selected from a university. ii) The proportion of 1,000 voters surveyed who support a particular candidate.
Notation:
- Sample Mean: x̄ (x-bar)
- Sample Variance: s²
- Sample Standard Deviation: s
- Sample Proportion: p̂ (p-hat)
e) Sampling Error :
It’s a concept in statistics that refers to the difference between a sample statistic and the corresponding population parameter it is intended to estimate. It occurs because a sample is only a subset of the population, and thus it may not perfectly represent the entire population.
For Example:
- Population mean (μ) of student test scores is 75.
- A sample of 30 students has a mean (x̄) of 73.
- Sampling error = 73 −75 = −2.
import scipy.stats as stats sample_mean = np.mean(data) # Calculate mean std_error = stats.sem(data) # Standard error of the mean confidence = 0.95 # Confidence interval conf_interval = stats.t.interval(confidence, len(data)-1, loc=sample_mean, scale=std_error) print(f"95% Confidence Interval: {conf_interval}") #Output = 95% Confidence Interval: (94.50101347116151, 100.72483591745058)
f) Confidence Intervals (CI) :
It’s a range of values, derived from sample data, that is likely to contain the true value of a population parameter (such as the mean or proportion) with a specified level of confidence. It provides a measure of uncertainty or precision about the sample estimate.
Below are the key terms used in Confidence Interval :
i) Point Estimate : The sample statistic used as the best estimate of the population parameter.
Examples: Sample mean (x̄), sample proportion (p̂).
ii) Confidence Level : The probability that the confidence interval contains the true population parameter.
Common confidence levels: 90%, 95%, 99%.
A 95% confidence level means that if we were to repeatedly take 100 random samples from the same population and construct a confidence interval for each sample, we would expect approximately 95 of those intervals to contain the true population parameter (e.g., the true mean or proportion).
iii) Margin of Error (MOE) : Represents the range of uncertainty around the point estimate. Depends on the variability in the data, sample size, and confidence level.
iv) Critical Value : A multiplier based on the confidence level and the sampling distribution.
For Examples : 𝑧-scores for a normal distribution or 𝑡-scores for smaller samples.
confidence = 0.95 # Confidence interval conf_interval = stats.t.interval(confidence, len(data)-1, loc=sample_mean, scale=std_error) print(f"95% Confidence Interval: {conf_interval}") #Output = 95% Confidence Interval: (94.50101347116151, 100.72483591745058)
g) Hypothesis Testing :
It’s a statistical method used to make decisions or inferences about a population parameter based on sample data. It involves testing an assumption (the hypothesis) about a population and determining whether there is enough evidence to support or reject that assumption.
Below are the key components of Hypothesis Testing :
i) Null Hypothesis (H0) : It represents the default or no-effect assumption. It is the hypothesis that is tested and assumed to be true unless there is strong evidence against it.
Example H0 : 𝜇= 50H (The population mean is 50).
ii) Alternative Hypothesis (Ha) : It represents what you want to prove or the effect you’re testing for.
Example: 𝐻a : 𝜇 > 50 (The population mean is greater than 50).
iii) Test Statistic : A value calculated from the sample data that is used to test the null hypothesis.
Common test statistics include 𝑧-scores, 𝑡-scores, and chi-square values.
iv) Significance Level (𝛼) : The threshold for rejecting the null hypothesis.
Commonly set at 𝛼=0.05, meaning there is a 5% risk of rejecting Null Hypothesis(𝐻0 )when it is true.
v) P-value : The probability of obtaining a test statistic as extreme as the observed value, assuming Null Hypothesis(𝐻0 ) is true.
If P-value ≤ 𝛼, reject 𝐻0.
# One-sample t-test t_stat, p_value = stats.ttest_1samp(data, 100) # Test if sample mean is 100 print(f"T-statistic: {t_stat}, P-value: {p_value}") #Output = T-statistic:-1.5220470464769882, P-value: 0.1311851003166461 if p_value < 0.05: print("Reject the null hypothesis.") else: print("Fail to reject the null hypothesis.") #Output = Fail to reject the null hypothesis.