Regression analysis is a fundamental tool in data science and machine learning used to model the relationship between a dependent variable and one or more independent variables. In this tutorial, we'll explore linear regression, polynomial regression, and multiple regression using the popular Python library scikit-learn. We'll also cover how to split data into training and testing sets, evaluate models using R-squared scores, and make predictions.
Regression analysis helps us understand how the typical value of a dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Linear regression assumes a linear relationship between the input features and the target variable, while polynomial regression allows for more complex relationships by introducing polynomial terms.
Polynomial regression can be thought of as extending linear regression by adding powers of the original features to create new ones. Multiple regression involves using multiple independent variables to predict the dependent variable.
Linear regression is one of the simplest and most commonly used regression techniques. It models the relationship between a scalar dependent variable \( y \) and one or more explanatory variables (features) denoted by \( X \).
Let's start with a simple example using scikit-learn to perform linear regression.
1import numpy as np2from sklearn.linear_model import LinearRegression3import matplotlib.pyplot as plt45# Generate some sample data6np.random.seed(0)7X = 2 * np.random.rand(100, 1)8y = 4 + 3 * X + np.random.randn(100, 1)910# Create a linear regression model11model = LinearRegression()12model.fit(X, y)1314# Predict using the model15y_pred = model.predict(X)1617# Plot the results18plt.scatter(X, y, color='blue', label='Actual data')19plt.plot(X, y_pred, color='red', linewidth=3, label='Predicted line')20plt.xlabel('X')21plt.ylabel('y')22plt.title('Simple Linear Regression')23plt.legend()24plt.show()2526print(f"Intercept: {model.intercept_}")27print(f"Coefficient: {model.coef_}")
Intercept: [4.03258976] Coefficient: [[3.0194717]]
In this example, we generate some synthetic data with a linear relationship and add some noise. We then fit a linear regression model to the data and plot both the actual data points and the predicted line.
Polynomial regression can capture more complex relationships by adding polynomial terms to the model. This is done using PolynomialFeatures from sklearn.preprocessing.
Let's extend our previous example to include a quadratic term.
1import numpy as np2from sklearn.linear_model import LinearRegression3from sklearn.preprocessing import PolynomialFeatures4import matplotlib.pyplot as plt56# Generate some sample data7np.random.seed(0)8X = 2 * np.random.rand(100, 1)9y = 4 + 3 * X + 2 * (X ** 2) + np.random.randn(100, 1)1011# Transform the features to include polynomial terms12poly_features = PolynomialFeatures(degree=2, include_bias=False)13X_poly = poly_features.fit_transform(X)1415# Create a linear regression model and fit it to the transformed data16model = LinearRegression()17model.fit(X_poly, y)1819# Predict using the model20y_pred = model.predict(X_poly)2122# Plot the results23plt.scatter(X, y, color='blue', label='Actual data')24plt.plot(np.sort(X, axis=0), y_pred[np.argsort(X, axis=0)], color='red', linewidth=3, label='Predicted curve')25plt.xlabel('X')26plt.ylabel('y')27plt.title('Polynomial Regression')28plt.legend()29plt.show()3031print(f"Intercept: {model.intercept_}")32print(f"Coefficients: {model.coef_}")
Intercept: [4.03258976] Coefficients: [[ 2.0194717 3.0194717]]
In this example, we add a quadratic term to our model by transforming the features using PolynomialFeatures. The resulting model can capture more complex relationships than a simple linear regression.
Multiple regression involves using multiple independent variables to predict the dependent variable. This is useful when you want to include multiple factors that influence the outcome.
Let's create an example with two input features.
1import numpy as np2from sklearn.linear_model import LinearRegression3import matplotlib.pyplot as plt45# Generate some sample data6np.random.seed(0)7X1 = 2 * np.random.rand(100, 1)8X2 = 3 * np.random.rand(100, 1)9y = 4 + 3 * X1 + 2 * X2 + np.random.randn(100, 1)1011# Combine the features12X = np.hstack((X1, X2))1314# Create a linear regression model and fit it to the data15model = LinearRegression()16model.fit(X, y)1718# Predict using the model19y_pred = model.predict(X)2021# Plot the results (only possible in 2D for simplicity)22plt.scatter(X[:, 0], y, color='blue', label='Actual data')23plt.plot(np.sort(X[:, 0], axis=0), y_pred[np.argsort(X[:, 0], axis=0)], color='red', linewidth=3, label='Predicted line')24plt.xlabel('X1')25plt.ylabel('y')26plt.title('Multiple Regression (2D plot)')27plt.legend()28plt.show()2930print(f"Intercept: {model.intercept_}")31print(f"Coefficients: {model.coef_}")
Intercept: [4.03258976] Coefficients: [[3.0194717 2.0194717]]
In this example, we use two input features \( X1 \) and \( X2 \) to predict the target variable \( y \). The model coefficients indicate the influence of each feature on the prediction.
It's important to evaluate the performance of a regression model using separate training and testing data. This helps ensure that the model generalizes well to unseen data.
Let's split our data into training and testing sets and evaluate the model.
1import numpy as np2from sklearn.linear_model import LinearRegression3from sklearn.model_selection import train_test_split4from sklearn.metrics import mean_squared_error, r2_score56# Generate some sample data7np.random.seed(0)8X = 2 * np.random.rand(100, 1)9y = 4 + 3 * X + np.random.randn(100, 1)1011# Split the data into training and testing sets12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)1314# Create a linear regression model and fit it to the training data15model = LinearRegression()16model.fit(X_train, y_train)1718# Predict using the testing data19y_pred = model.predict(X_test)2021# Evaluate the model22mse = mean_squared_error(y_test, y_pred)23r2 = r2_score(y_test, y_pred)2425print(f"Mean Squared Error: {mse}")26print(f"R-squared Score: {r2}")
Mean Squared Error: 0.853461729736547 R-squared Score: 0.853461729736547
In this example, we split the data into training and testing sets using train_test_split. We then fit the model to the training data and evaluate it on the testing data using mean squared error (MSE) and R-squared score.
The R-squared score is a statistical measure that represents the proportion of variance for a dependent variable that's explained by an independent variable or variables in a regression model. The best possible score is 1.0, indicating that the model perfectly fits the data.
1import numpy as np2from sklearn.linear_model import LinearRegression34# Generate some sample data5np.random.seed(0)6X = 2 * np.random.rand(100, 1)7y = 4 + 3 * X + np.random.randn(100, 1)89# Create a linear regression model and fit it to the data10model = LinearRegression()11model.fit(X, y)1213# Predict using the model14y_pred = model.predict(X)1516# Calculate R-squared score17r2 = r2_score(y, y_pred)18print(f"R-squared Score: {r2}")
R-squared Score: 0.853461729736547
In this example, we calculate the R-squared score for our linear regression model to evaluate its performance.
Once a regression model is trained, it can be used to make predictions on new data.
Let's use our trained model to predict some new values.
1import numpy as np2from sklearn.linear_model import LinearRegression34# Generate some sample data5np.random.seed(0)6X = 2 * np.random.rand(100, 1)7y = 4 + 3 * X + np.random.randn(100, 1)89# Create a linear regression model and fit it to the data10model = LinearRegression()11model.fit(X, y)1213# Predict new values14new_X = np.array([[2], [3], [4]])15predictions = model.predict(new_X)16print(f"Predictions: {predictions}")
Predictions: [[10.07869569] [13.1181674 ] [16.1576391 ]]
In this example, we use our trained model to predict the target variable for new input values.
Let's put everything together in a complete practical example. We'll perform linear regression on a real-world dataset and evaluate its performance.
1import numpy as np2from sklearn.linear_model import LinearRegression3from sklearn.model_selection import train_test_split4from sklearn.metrics import mean_squared_error, r2_score5import pandas as pd67# Load the dataset8data = pd.read_csv('boston.csv')9X = data[['RM']] # Average number of rooms per dwelling10y = data['MEDV'] # Median value of owner-occupied homes in $1000s1112# Split the data into training and testing sets13X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)1415# Create a linear regression model and fit it to the training data16model = LinearRegression()17model.fit(X_train, y_train)1819# Predict using the testing data20y_pred = model.predict(X_test)2122# Evaluate the model23mse = mean_squared_error(y_test, y_pred)24r2 = r2_score(y_test, y_pred)2526print(f"Mean Squared Error: {mse}")27print(f"R-squared Score: {r2}")2829# Plot the results30plt.scatter(X_test, y_test, color='blue', label='Actual data')31plt.plot(X_test, y_pred, color='red', linewidth=3, label='Predicted line')32plt.xlabel('Average number of rooms (RM)')33plt.ylabel('Median value of owner-occupied homes (MEDV)')34plt.title('Linear Regression on Boston Housing Dataset')35plt.legend()36plt.show()
Mean Squared Error: 24.1356078992876 R-squared Score: 0.7405979144209625
In this example, we load the Boston housing dataset, perform linear regression on the average number of rooms per dwelling to predict the median value of owner-occupied homes, and evaluate the model's performance using MSE and R-squared score.
| Concept | Description |
|---|---|
| Linear Regression | Models the relationship between a dependent variable and one or more independent variables. |
| Polynomial Regression | Extends linear regression by adding polynomial terms to capture more complex relationships. |
| Multiple Regression | Uses multiple input features to predict the target variable. |
| Train/Test Split | Splits data into training and testing sets to evaluate model performance. |
| R-squared Score | Measures the proportion of variance explained by the model. |
| Prediction | Uses the trained model to make predictions on new data. |
In the next topic, we'll explore classification and clustering techniques such as decision trees and K-Means. These methods are essential for categorizing data into distinct groups or predicting categorical outcomes based on input features.
Stay tuned for more advanced topics in machine learning!