codingstuff.io
ExploreTutorialsProblemsCS Subjects
Get Started
ExploreTutorialsProblemsCS Subjects
Get Started
codingstuff.io

Master the art of building software through interactive tutorials, real-world problems, and guided projects.

Pune, Maharashtra, India

codingstuffmail@gmail.com

Product

  • Explore
  • Tutorials
  • Problems
  • CS Subjects

Company

  • About
  • Contact
  • Privacy Policy
  • Terms & Conditions
  • Sitemap

© 2026 codingstuff.io. All rights reserved.

Built with ❤️ for developers everywhere

/
/
All Tutorials
🐍

Python Programming

55 / 68 topics
51NumPy Tutorial52Pandas Tutorial53SciPy Tutorial54Matplotlib & Seaborn Basics55Machine Learning Basics (Stats & Data Distribution)56Linear & Polynomial Regression57Classification & Clustering (Decision Trees, K-Means)58TensorFlow & PyTorch Basics
Tutorials/Python Programming/Machine Learning Basics (Stats & Data Distribution)
🐍Python Programming

Machine Learning Basics (Stats & Data Distribution)

Updated 2026-05-15
30 min read

Machine Learning Basics (Stats & Data Distribution)

In the realm of machine learning, understanding the statistical properties of your data is crucial. This knowledge helps you make informed decisions about which algorithms to use and how to preprocess your data effectively. In this tutorial, we'll explore key concepts such as mean, median, mode, standard deviation, percentiles, normal distribution, scatter plots, and how to analyze data distributions using Python.

Introduction

Statistics provides the tools to understand and interpret data. Whether you're analyzing stock market trends or predicting customer behavior, a solid grasp of statistical measures is essential. In this tutorial, we'll dive into fundamental statistics concepts and learn how to apply them using Python libraries like NumPy and Matplotlib.

Core Concepts

Mean, Median, Mode

The mean, median, and mode are basic measures of central tendency that help you understand the center of your data distribution.

Mean

The mean is the average value of a dataset. It's calculated by summing all the values and dividing by the number of observations.

mean.py
1import numpy as np
2
3data = [1, 2, 3, 4, 5]
4mean_value = np.mean(data)
5print("Mean:", mean_value)
Output

Mode

The mode is the value that appears most frequently in a dataset.

mode.py
1from scipy import stats
2
3data = [1, 2, 2, 3, 4]
4mode_value = stats.mode(data)
5print("Mode:", mode_value.mode[0])
Output

Normal Distribution

A normal distribution, also known as a Gaussian distribution, is symmetric and bell-shaped. Many natural phenomena follow this distribution.

normal_distribution.py
1import numpy as np
2import matplotlib.pyplot as plt
3
4data = np.random.normal(loc=0, scale=1, size=1000)
5plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
6plt.title('Normal Distribution')
7plt.xlabel('Value')
8plt.ylabel('Density')
9plt.show()

Tip

Remember to run this code in an environment that supports plotting, such as Jupyter Notebook or a Python IDE with Matplotlib support.

Scatter Plots

Scatter plots are used to visualize the relationship between two variables. Each point on the plot represents an observation.

scatter_plot.py
1import numpy as np
2import matplotlib.pyplot as plt
3
4x = np.random.rand(50)
5y = 2 * x + np.random.normal(loc=0, scale=0.1, size=50)
6
7plt.scatter(x, y)
8plt.title('Scatter Plot')
9plt.xlabel('X')
10plt.ylabel('Y')
11plt.show()

Tip

Scatter plots are particularly useful for identifying patterns or correlations between variables.

Practical Example

Let's analyze a dataset to understand its distribution and relationships.

practical_example.py
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Generate sample data
6np.random.seed(0)
7data = np.random.normal(loc=50, scale=10, size=1000)
8
9# Calculate statistics
10mean_value = np.mean(data)
11median_value = np.median(data)
12mode_value = stats.mode(data).mode[0]
13std_dev = np.std(data)
14percentile_25 = np.percentile(data, 25)
15
16print(f"Mean: {mean_value}")
17print(f"Median: {median_value}")
18print(f"Mode: {mode_value}")
19print(f"Standard Deviation: {std_dev}")
20print(f"25th Percentile: {percentile_25}")
21
22# Plot histogram
23plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
24plt.title('Data Distribution')
25plt.xlabel('Value')
26plt.ylabel('Density')
27plt.show()

Tip

Run this code to see the output statistics and the histogram of the data distribution.

Summary

ConceptDescription
MeanAverage value of a dataset.
MedianMiddle value when ordered from smallest to largest.
ModeMost frequently occurring value in a dataset.
Standard DeviationMeasure of variation or dispersion in a dataset.
PercentilesDivide data into 100 equal parts.
Normal DistributionSymmetric, bell-shaped distribution common in nature.
Scatter PlotVisual representation of the relationship between two variables.

What's Next?

In the next tutorial, we'll explore linear and polynomial regression, which are fundamental techniques for modeling relationships between variables. Understanding these concepts will help you build predictive models that can make accurate predictions based on data.

Stay tuned!


PreviousMatplotlib & Seaborn BasicsNext Linear & Polynomial Regression

Recommended Gear

Matplotlib & Seaborn BasicsLinear & Polynomial Regression