🐍Python Programming

Machine Learning Basics (Stats & Data Distribution)

Updated 2026-05-15

30 min read

Machine Learning Basics (Stats & Data Distribution)

In the realm of machine learning, understanding the statistical properties of your data is crucial. This knowledge helps you make informed decisions about which algorithms to use and how to preprocess your data effectively. In this tutorial, we'll explore key concepts such as mean, median, mode, standard deviation, percentiles, normal distribution, scatter plots, and how to analyze data distributions using Python.

Introduction

Statistics provides the tools to understand and interpret data. Whether you're analyzing stock market trends or predicting customer behavior, a solid grasp of statistical measures is essential. In this tutorial, we'll dive into fundamental statistics concepts and learn how to apply them using Python libraries like NumPy and Matplotlib.

Core Concepts

Mean, Median, Mode

The mean, median, and mode are basic measures of central tendency that help you understand the center of your data distribution.

Mean

The mean is the average value of a dataset. It's calculated by summing all the values and dividing by the number of observations.

mean.py

1import numpy as np
2 
3data = [1, 2, 3, 4, 5]
4mean_value = np.mean(data)
5print("Mean:", mean_value)

Output

Mode

The mode is the value that appears most frequently in a dataset.

mode.py

1from scipy import stats
2 
3data = [1, 2, 2, 3, 4]
4mode_value = stats.mode(data)
5print("Mode:", mode_value.mode[0])

Output

Normal Distribution

A normal distribution, also known as a Gaussian distribution, is symmetric and bell-shaped. Many natural phenomena follow this distribution.

normal_distribution.py

1import numpy as np
2import matplotlib.pyplot as plt
3 
4data = np.random.normal(loc=0, scale=1, size=1000)
5plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
6plt.title('Normal Distribution')
7plt.xlabel('Value')
8plt.ylabel('Density')
9plt.show()

Tip

Remember to run this code in an environment that supports plotting, such as Jupyter Notebook or a Python IDE with Matplotlib support.

Scatter Plots

Scatter plots are used to visualize the relationship between two variables. Each point on the plot represents an observation.

scatter_plot.py

1import numpy as np
2import matplotlib.pyplot as plt
3 
4x = np.random.rand(50)
5y = 2 * x + np.random.normal(loc=0, scale=0.1, size=50)
6 
7plt.scatter(x, y)
8plt.title('Scatter Plot')
9plt.xlabel('X')
10plt.ylabel('Y')
11plt.show()

Tip

Scatter plots are particularly useful for identifying patterns or correlations between variables.

Practical Example

Let's analyze a dataset to understand its distribution and relationships.

practical_example.py

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4 
5# Generate sample data
6np.random.seed(0)
7data = np.random.normal(loc=50, scale=10, size=1000)
8 
9# Calculate statistics
10mean_value = np.mean(data)
11median_value = np.median(data)
12mode_value = stats.mode(data).mode[0]
13std_dev = np.std(data)
14percentile_25 = np.percentile(data, 25)
15 
16print(f"Mean: {mean_value}")
17print(f"Median: {median_value}")
18print(f"Mode: {mode_value}")
19print(f"Standard Deviation: {std_dev}")
20print(f"25th Percentile: {percentile_25}")
21 
22# Plot histogram
23plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
24plt.title('Data Distribution')
25plt.xlabel('Value')
26plt.ylabel('Density')
27plt.show()

Tip

Run this code to see the output statistics and the histogram of the data distribution.

Summary

Concept	Description
Mean	Average value of a dataset.
Median	Middle value when ordered from smallest to largest.
Mode	Most frequently occurring value in a dataset.
Standard Deviation	Measure of variation or dispersion in a dataset.
Percentiles	Divide data into 100 equal parts.
Normal Distribution	Symmetric, bell-shaped distribution common in nature.
Scatter Plot	Visual representation of the relationship between two variables.

What's Next?

In the next tutorial, we'll explore linear and polynomial regression, which are fundamental techniques for modeling relationships between variables. Understanding these concepts will help you build predictive models that can make accurate predictions based on data.

Stay tuned!

🐍Python Programming

Machine Learning Basics (Stats & Data Distribution)

Updated 2026-05-15

30 min read

Machine Learning Basics (Stats & Data Distribution)

Introduction

Core Concepts

Mean, Median, Mode

The mean, median, and mode are basic measures of central tendency that help you understand the center of your data distribution.

Mean

The mean is the average value of a dataset. It's calculated by summing all the values and dividing by the number of observations.

mean.py

1import numpy as np
2 
3data = [1, 2, 3, 4, 5]
4mean_value = np.mean(data)
5print("Mean:", mean_value)

Output

Mode

The mode is the value that appears most frequently in a dataset.

mode.py

1from scipy import stats
2 
3data = [1, 2, 2, 3, 4]
4mode_value = stats.mode(data)
5print("Mode:", mode_value.mode[0])

Output

Normal Distribution

A normal distribution, also known as a Gaussian distribution, is symmetric and bell-shaped. Many natural phenomena follow this distribution.

normal_distribution.py

1import numpy as np
2import matplotlib.pyplot as plt
3 
4data = np.random.normal(loc=0, scale=1, size=1000)
5plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
6plt.title('Normal Distribution')
7plt.xlabel('Value')
8plt.ylabel('Density')
9plt.show()

Tip

Remember to run this code in an environment that supports plotting, such as Jupyter Notebook or a Python IDE with Matplotlib support.

Scatter Plots

Scatter plots are used to visualize the relationship between two variables. Each point on the plot represents an observation.

scatter_plot.py

1import numpy as np
2import matplotlib.pyplot as plt
3 
4x = np.random.rand(50)
5y = 2 * x + np.random.normal(loc=0, scale=0.1, size=50)
6 
7plt.scatter(x, y)
8plt.title('Scatter Plot')
9plt.xlabel('X')
10plt.ylabel('Y')
11plt.show()

Tip

Scatter plots are particularly useful for identifying patterns or correlations between variables.

Practical Example

Let's analyze a dataset to understand its distribution and relationships.

practical_example.py

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4 
5# Generate sample data
6np.random.seed(0)
7data = np.random.normal(loc=50, scale=10, size=1000)
8 
9# Calculate statistics
10mean_value = np.mean(data)
11median_value = np.median(data)
12mode_value = stats.mode(data).mode[0]
13std_dev = np.std(data)
14percentile_25 = np.percentile(data, 25)
15 
16print(f"Mean: {mean_value}")
17print(f"Median: {median_value}")
18print(f"Mode: {mode_value}")
19print(f"Standard Deviation: {std_dev}")
20print(f"25th Percentile: {percentile_25}")
21 
22# Plot histogram
23plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
24plt.title('Data Distribution')
25plt.xlabel('Value')
26plt.ylabel('Density')
27plt.show()

Tip

Run this code to see the output statistics and the histogram of the data distribution.

Summary

Concept	Description
Mean	Average value of a dataset.
Median	Middle value when ordered from smallest to largest.
Mode	Most frequently occurring value in a dataset.
Standard Deviation	Measure of variation or dispersion in a dataset.
Percentiles	Divide data into 100 equal parts.
Normal Distribution	Symmetric, bell-shaped distribution common in nature.
Scatter Plot	Visual representation of the relationship between two variables.

What's Next?

Stay tuned!