Statistics Terms

In applied statistics lots of terms used to represent output values. we will go though some of the common terms which are used frequently :

a) Population :

It includes all members or elements of a group that you are interested in studying. It can consist of people, objects, events, measurements, or any defined group.

For Examples : i) All students in a university. ii) All households in a city.

b) Sample :

It’s a subset of the population that is selected for analysis. The sample should ideally represent the population accurately.

For Examples : i) 500 students selected randomly from a university. ii) A survey of 1,000 households in a city.

c) Parameter :

It’s a numerical value that describes a characteristic of an entire population. It is a fixed value, although it is often unknown because measuring the entire population is impractical.

For Examples : i) The average height of all students in a university. ii) The proportion of voters in a country who support a particular candidate.

Notation:

- Population Mean: μ (mu)
- Population Variance: σ² (sigma squared)
- Population Standard Deviation: σ (sigma)
- Population Proportion: P

import numpy as np
import scipy.stats as stats

# Sample data
data = np.random.normal(loc=100, scale=15, size=50)

# Calculate mean and standard error
sample_mean = np.mean(data)
std_error = stats.sem(data)

# Confidence interval (95%)
conf_interval = stats.t.interval(0.95, len(data)-1, loc=sample_mean, scale=std_error)
print(f"95% Confidence Interval: {conf_interval}")
# Output = 95% Confidence Interval: (95.6663181952571, 104.56751769318811)

d) Statistics :

It’s a numerical value that describes a characteristic of a sample. It is calculated directly from sample data and is used to estimate population parameters.

For Examples: i) The average height of 100 students randomly selected from a university. ii) The proportion of 1,000 voters surveyed who support a particular candidate.

Notation:

- Sample Mean: (x-bar)
- Sample Variance: s²
- Sample Standard Deviation:
- Sample Proportion: (p-hat)

e) Sampling Error :

It’s a concept in statistics that refers to the difference between a sample statistic and the corresponding population parameter it is intended to estimate. It occurs because a sample is only a subset of the population, and thus it may not perfectly represent the entire population.

For Example:

- Population mean (μ) of student test scores is 75.
- A sample of 30 students has a mean (x̄) of 73.
- Sampling error = .

import scipy.stats as stats

sample_mean = np.mean(data)         # Calculate mean
std_error = stats.sem(data)         # Standard error of the mean
confidence = 0.95

# Confidence interval
conf_interval = stats.t.interval(confidence, len(data)-1, loc=sample_mean, scale=std_error)
print(f"95% Confidence Interval: {conf_interval}") 
#Output = 95% Confidence Interval: (94.50101347116151, 100.72483591745058)

f) Confidence Intervals (CI) :

It’s a range of values, derived from sample data, that is likely to contain the true value of a population parameter (such as the mean or proportion) with a specified level of confidence. It provides a measure of uncertainty or precision about the sample estimate.

Below are the key terms used in Confidence Interval :

i) Point Estimate : The sample statistic used as the best estimate of the population parameter.

Examples: Sample mean (x̄), sample proportion (p̂).

ii) Confidence Level : The probability that the confidence interval contains the true population parameter.

Common confidence levels: 90%, 95%, 99%.

A 95% confidence level means that if we were to repeatedly take 100 random samples from the same population and construct a confidence interval for each sample, we would expect approximately 95 of those intervals to contain the true population parameter (e.g., the true mean or proportion).

iii) Margin of Error (MOE) : Represents the range of uncertainty around the point estimate. Depends on the variability in the data, sample size, and confidence level.

iv) Critical Value : A multiplier based on the confidence level and the sampling distribution.

For Examples : 𝑧-scores for a normal distribution or 𝑡-scores for smaller samples.

confidence = 0.95

# Confidence interval
conf_interval = stats.t.interval(confidence, len(data)-1, loc=sample_mean, scale=std_error)
print(f"95% Confidence Interval: {conf_interval}") 
#Output = 95% Confidence Interval: (94.50101347116151, 100.72483591745058)

g) Hypothesis Testing :

It’s a statistical method used to make decisions or inferences about a population parameter based on sample data. It involves testing an assumption (the hypothesis) about a population and determining whether there is enough evidence to support or reject that assumption.

Below are the key components of Hypothesis Testing :

i) Null Hypothesis (H₀) : It represents the default or no-effect assumption. It is the hypothesis that is tested and assumed to be true unless there is strong evidence against it.

Example H₀ : 𝜇= 50H (The population mean is 50).

ii) Alternative Hypothesis (H_a) : It represents what you want to prove or the effect you’re testing for.

Example: 𝐻_a : 𝜇 > 50 (The population mean is greater than 50).

iii) Test Statistic : A value calculated from the sample data that is used to test the null hypothesis.

Common test statistics include 𝑧-scores, 𝑡-scores, and chi-square values.

iv) Significance Level (𝛼) : The threshold for rejecting the null hypothesis.

Commonly set at 𝛼=0.05, meaning there is a 5% risk of rejecting Null Hypothesis(𝐻₀ )when it is true.

v) P-value : The probability of obtaining a test statistic as extreme as the observed value, assuming Null Hypothesis(𝐻0 ) is true.

If P-value ≤ 𝛼, reject 𝐻0.

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(data, 100)  # Test if sample mean is 100
print(f"T-statistic: {t_stat}, P-value: {p_value}") #Output = T-statistic:-1.5220470464769882, P-value: 0.1311851003166461

if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

#Output = Fail to reject the null hypothesis.