Learn Python – Linear Regression model

what is scikit learn?

Scikit-learn, also known as sklearn, is a popular open-source machine learning library for Python. It provides a wide range of tools and algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing of data. Scikit-learn is built on top of other scientific Python libraries such as NumPy, SciPy, and Matplotlib, and it integrates well with the Python data science ecosystem.

Scikit-learn offers a wide range of machine learning models that you can use for different types of tasks. Here are some commonly used models provided by scikit-learn:

Linear Models:
- Linear Regression
- Logistic Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
Support Vector Machines (SVM):
- Support Vector Classifier (SVC)
- Support Vector Regression (SVR)
Decision Trees:
- Decision Tree Classifier
- Decision Tree Regression
Ensemble Methods:
- Random Forest Classifier/Regressor
- Gradient Boosting Classifier/Regressor
- AdaBoost Classifier/Regressor
- Extra Trees Classifier/Regressor
- Voting Classifier/Regressor
Naive Bayes:
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Bernoulli Naive Bayes
Nearest Neighbors:
- K-Nearest Neighbors (KNN) Classifier/Regressor
Neural Networks:
- Multi-Layer Perceptron Classifier/Regressor
Clustering:
- K-Means Clustering
- DBSCAN
- Hierarchical Clustering
Dimensionality Reduction:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- t-SNE
Gaussian Processes:
- Gaussian Process Classifier/Regressor

what is linear regression?

Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables, meaning that the relationship can be represented by a straight line.

The goal of linear regression is to find the best-fitting line that minimizes the difference between the actual observed values of the dependent variable and the predicted values by the model. This line is determined by estimating the slope and intercept coefficients that define the equation of the line.

In simple linear regression, there is only one independent variable, and the relationship between the dependent variable and independent variable is represented by a straight line. The equation of the line can be written as:

y = b0 + b1 * x

Where:

y is the dependent variable (the variable being predicted or explained).
x is the independent variable (the variable used to predict or explain the dependent variable).
b0 is the intercept (the value of y when x is 0).
b1 is the slope (the change in y associated with a one-unit change in x).

The coefficients b0 and b1 are estimated using a method called least squares, which minimizes the sum of the squared differences between the observed and predicted values. Once the coefficients are estimated, the linear regression model can be used to predict the values of the dependent variable for new values of the independent variable.

Linear regression can also be extended to multiple independent variables, called multiple linear regression. The equation of the line becomes:

y = b0 + b1 * x1 + b2 * x2 + … + bn * xn

Where x1, x2, …, xn are the independent variables, and b1, b2, …, bn are the corresponding coefficients.

Linear regression is widely used in various fields for tasks such as predicting sales, analyzing the impact of variables on outcomes, and making forecasts. It provides a simple and interpretable model for understanding the relationship between variables, although it assumes linearity and has certain limitations when dealing with complex relationships or non-linear data.

how can I use it in gis?

Linear regression can be applied in Geographic Information Systems (GIS) to analyse spatial relationships and make predictions based on spatial data. Here are a few examples of how linear regression can be used in GIS:

Spatial Analysis: Linear regression can be used to analyse the relationship between spatial variables. For example, you can examine the relationship between population density and various socio-economic factors such as income, education, or crime rates. By performing a linear regression analysis, you can quantify the strength and direction of the relationship between these variables.
Spatial Prediction: Linear regression can be used to make spatial predictions. Suppose you have a set of spatial data with known attributes, such as temperature, rainfall, and elevation, and you want to predict another attribute, such as crop yield or disease incidence, at different locations. By training a linear regression model using the known attribute values and their corresponding spatial coordinates, you can predict the attribute values at new locations based on their coordinates.
Geostatistics: Linear regression is often used in geostatistical analysis, such as kriging, which is a spatial interpolation technique. Kriging involves fitting a spatial model, typically a linear regression model, to observed data and using it to estimate values at unsampled locations. The linear regression model helps determine the spatial correlation structure and spatial trends in the data, which are essential for accurate interpolation.
Spatial Trend Analysis: Linear regression can be used to identify and analyse spatial trends in data. For example, you can investigate the relationship between land prices and distance to urban centres, or the relationship between air pollution levels and proximity to industrial areas. By fitting a linear regression model, you can quantify the trend and understand how the dependent variable changes spatially.

a very simple use case

Predicting scotland’s future population

Here we use two public datasets:

Historic population data by local authority (1981 – 2021) – Link
Local authority boundaries shapefile – Link

Firstly, we need to import the required modules. For linear regression I’ll be using the scikit learn module.

import pandas as pd
from sklearn.linear_model import LinearRegression
import geopandas as gpd
import webbrowser
import os

Next the data sets (once you’ve extracted them into the correct folder(s)) are read by the pandas/geopandas modules.

# Load the population data
    scot_pop = pd.read_csv("W:/Python/geospatial/ScotPop/scotland_pop_estimates.csv",
                           encoding="ISO-8859-1")
    scot_las = gpd.read_file("W:/Python/geospatial/pub_las.shp")

In this particular dataset, the sexes are split into “Persons” (Male and Female), “Males” and “Females”. In this instance, I only want the “Persons” data so the DataFrame is updated to contain only this data.

# Filter data for persons only
    scot_pop_df = scot_pop[scot_pop['Sex'] == "Persons"]

A replica DataFrame is then created to prepare for the future population predicitions.

# Prepare the new DataFrame for future projections
    future_projections = pd.DataFrame()

The model then takes each local authority, trains from its related data and produces a future population prediction based off the existing data. This data is then appended to the new DataFrame “future projections”.

Firstly the original DataFrame is looped through to get each local authority (“Area name”).

# Train and predict for each area
    for local_authority, local_authority_df in scot_pop_df.groupby('Area name'):

The model then takes X_train (independant variable “Year”) and y_train (dependant variable “All Ages” – The population data) the linear regression model is then called to then train using the X and y data.

X_train = local_authority_df['Year'].values.reshape(-1, 1)
        y_train = local_authority_df['All Ages']

        # Train a linear regression model
        model = LinearRegression()
        model.fit(X_train, y_train)

A Quick Note on X_train, y_train and reshape(-1, 1)

In the context of machine learning and regression models, “X_train” and “y_train” are commonly used variable names to represent the training data that is used to train a model.

X_train: This typically represents the input features or independent variables used to predict the target variable. The “X_train” variable contains a subset of the dataset that includes the independent variables (or features) you want to use to train the model. It is usually represented as a matrix or a DataFrame, where each row corresponds to a data point and each column represents a different feature.
y_train: This represents the target variable or dependent variable that you want to predict using the input features. The “y_train” variable contains the corresponding values of the target variable for the data points in X_train. It is typically represented as a vector or a 1-dimensional array.

In the case of linear regression, you use X_train to fit the model by finding the best-fitting line that minimizes the difference between the actual observed values of y_train and the predicted values by the model. The relationship between X_train and y_train is used to estimate the coefficients (slope and intercept) of the linear regression equation. Once the model is trained, you can use it to predict the target variable for new data points by providing the corresponding features as input (X_test).

reshape(-1, 1) is used to modify the shape of the X_train array.

The reshape() function is a NumPy function that allows you to change the shape of an array without modifying its data. In this case, reshape(-1, 1) is used to convert the X_train array from a 1-dimensional array (vector) to a 2-dimensional array with a single column. The -1 in the reshape function indicates that NumPy should automatically infer the number of rows based on the size of the original array and the specified column size of 1.

Why is this necessary? In machine learning models, including linear regression, the input features are typically expected to be in a 2-dimensional array format. The first dimension represents the number of samples (rows), and the second dimension represents the number of features (columns). Reshaping the X_train array to have a shape of (n_samples, 1) ensures that it aligns with this expected format.

By reshaping X_train to (n_samples, 1), you are essentially converting a list of values into a column vector. This is important because scikit-learn, the library used for linear regression in the code snippet, expects the features to be in a 2-dimensional format.

In summary, reshape(-1, 1) is used to transform a 1-dimensional array into a 2-dimensional array with a single column, aligning the shape of X_train with the expected format for input features in scikit-learn’s linear regression model.

okay, let’s continue:

A new dataframe is created. This DataFrame takes the original DataFrame’s “Year” and adds 1 through 31 to each value.

 # Make predictions for the next 30 years
        future_years = pd.DataFrame({'Year': range(max(local_authority_df['Year']) + 1,
                                                   max(local_authority_df['Year']) + 31)})

The model then fills the population column for the years 2022 through 2052 for each local authority based on the data is was trained with. The predictions are then applied to the new DataFrame we created earlier “future_projections”.

# Make predictions for the next 30 years
        future_years = pd.DataFrame({'Year': range(local_authority_df['Year'].max() + 1,
                                                   local_authority_df['Year'].max() + 31)})

        predictions = model.predict(future_years)

        # Create a DataFrame for the future projections of the current area
        future_projections = future_projections.append(pd.DataFrame({'Year': future_years['Year'],
                                         'Population': predictions,
                                         'Area name': local_authority}),
                                                       ignore_index=True)

The ignore_index=True argument is included to reset the index of the consolidated DataFrame for each appended DataFrame.

A new column “2022 to 2052 population difference” is then created and is the difference between the 2022 and 2052 for each local authority.

# Calculate population change and additional statistics
    future_projections["2022 to 2052 population difference"] = future_projections.groupby("Area name")[
        "Population"].transform(lambda x: x.iloc[-1] - x.iloc[0])

The .iloc indexer in pandas is used to access data in a DataFrame by using integer-based indexing. It stands for “integer location.”

The .iloc indexer allows you to select rows and columns based on their integer positions, regardless of the row and column labels. It provides a way to access data in a DataFrame using zero-based indexing, similar to indexing in Python lists.

As I only want one polygon for each area, a condensed DataFrame using only the first row of each local authority is created. A percentage change column is created and the 2052 population prediction is added to the new DataFrame using the .last() function

condensed = future_projections.groupby("Area name").first()
    condensed["2052 Population Prediction"] = future_projections.groupby("Area name")["Population"].last()
    condensed["Percentage Difference"] = ((condensed["2052 Population Prediction"] - condensed["Population"]) / condensed["2050 Population Prediction"]) * 100

Using our GeoDataFrame with the local authority geometry within it, we merge our GeoDataFrame with the condensed DataFrame to create our final GeoDataFrame. An inner join is used. In an inner join, only the rows with matching values in the join columns of both tables are included in the result. The unmatched rows from either table are not included in the result.

Finally, the values in the GeoDataFrame are rounded to 1 decimal place and a map is created using the .explore() function of the geopandas module. The map is then saved to a .html file and opened in our default browser.

scot_las = scot_las.round({
        "Population": 1,
        "Change": 1,
        "2022 to 2052 population difference": 1,
        "2052 Population Prediction": 1,
        "Percentage Difference": 1
    })

    # Generate the map and save it as an HTML file
    m = scot_las.explore(column="2022 to 2052 population difference", cmap="RdYlGn", legend=False)
    m.save("W:\Python\geospatial\Edinburgh\Population.html")

    # Open the HTML file in the default web browser
    webbrowser.open("file://" + os.path.realpath("W:\Python\geospatial\Edinburgh\Population.html"))

scotland_pop_prediction()

Here is the entire script, put inside a function:

import pandas as pd
from sklearn.linear_model import LinearRegression
import geopandas as gpd
import webbrowser
import os

def scotland_pop_prediction():
    # Load the population data
    scot_pop = pd.read_csv("W:/Python/geospatial/ScotPop/scotland_pop_estimates.csv",
                           encoding="ISO-8859-1")
    scot_las = gpd.read_file("W:/Python/geospatial/pub_las.shp")

    # Filter data for persons only
    scot_pop_df = scot_pop[scot_pop['Sex'] == "Persons"]

    # Prepare the new DataFrame for future projections
    future_projections = pd.DataFrame()

    # Train and predict for each area
    for local_authority, local_authority_df in scot_pop_df.groupby('Area name'):
        X_train = local_authority_df['Year'].values.reshape(-1, 1)
        y_train = local_authority_df['All Ages']

        # Train a linear regression model
        model = LinearRegression()
        model.fit(X_train, y_train)

        # Make predictions for the next 30 years
        future_years = pd.DataFrame({'Year': range(local_authority_df['Year'].max() + 1,
                                                   local_authority_df['Year'].max() + 31)})

        predictions = model.predict(future_years)

        # Create a DataFrame for the future projections of the current area
        future_projections = future_projections.append(pd.DataFrame({'Year': future_years['Year'],
                                         'Population': predictions,
                                         'Area name': local_authority}),
                                                       ignore_index=True)

    print(future_projections)

    # Calculate population change and additional statistics
    future_projections["2022 to 2052 population difference"] = future_projections.groupby("Area name")[
        "Population"].transform(lambda x: x.iloc[-1] - x.iloc[0])

    print(future_projections)

    condensed = future_projections.groupby("Area name").first()
    condensed["2052 Population Prediction"] = future_projections.groupby("Area name")["Population"].last()
    condensed["Percentage Difference"] = ((condensed["2052 Population Prediction"] - condensed["Population"]) / condensed["2050 Population Prediction"]) * 100

    scot_las = scot_las.merge(condensed, how="inner", left_on="local_auth", right_on="Area name")

    scot_las = scot_las.round({
        "Population": 1,
        "Change": 1,
        "2022 to 2052 population difference": 1,
        "2052 Population Prediction": 1,
        "Percentage Difference": 1
    })

    # Generate the map and save it as an HTML file
    m = scot_las.explore(column="2022 to 2052 population difference", cmap="RdYlGn", legend=False)
    m.save("W:\Python\geospatial\Edinburgh\Population.html")

    # Open the HTML file in the default web browser
    webbrowser.open("file://" + os.path.realpath("W:\Python\geospatial\Edinburgh\Population.html"))

scotland_pop_prediction()

You can use additonal modules like topojson to simplify the geometry and improve map performance. Here’s a link to the output map from the script:

output map

There are numerous other models we could have used including:

Decision Tree Regression: Decision tree regression builds a decision tree model to predict the target variable based on a set of features. It can handle non-linear relationships and interactions between variables.
Random Forest Regression: Random forest regression is an ensemble method that combines multiple decision trees to make predictions. It can capture complex relationships and reduce overfitting.
Gradient Boosting Regression: Gradient boosting regression is another ensemble method that combines weak learners (usually decision trees) sequentially to make predictions. It is known for its high predictive accuracy.
Support Vector Regression: Support vector regression is a regression extension of support vector machines. It uses support vectors to find a hyperplane that best fits the data. It can handle non-linear relationships using kernel functions.
Neural Network Regression: Neural networks consist of interconnected layers of nodes (neurons) that can model complex relationships in the data. They are particularly useful for capturing non-linear patterns.

So if you have time, why not have a go with some other models!

some final notes and considerations

Data Limitations: It’s important to acknowledge the limitations of using limited data for population projections. The accuracy and reliability of the predictions greatly depend on the quality, representativeness, and coverage of the available data. Having limited historical data may restrict the model’s ability to capture long-term trends accurately.
Linear Regression Assumptions: The script uses linear regression for population projection. It assumes a linear relationship between the predictor variable (Year) and the target variable (Population). However, population dynamics are often influenced by various complex factors, and the linear relationship may not hold in all cases. Consider exploring other regression models that can capture non-linear relationships if necessary.
Extrapolation Risks: The script extrapolates population trends by making predictions for the next 30 years based on historical data. Extrapolation involves assuming that past trends will continue into the future, which may not always be valid. It’s important to be cautious when making long-term projections, as unexpected events or changes in underlying factors can significantly impact population dynamics.
Lack of External Factors: The script does not incorporate external factors such as social or environmental variables that can influence population growth or decline. Population projections that consider external factors may provide more accurate and realistic predictions. Consider incorporating relevant socio-economic, environmental, or policy-related variables to enhance the predictive power of the model.
Model Evaluation: It’s essential to assess the performance and reliability of the population projection model. Consider evaluating the model using appropriate evaluation metrics and techniques, such as cross-validation or out-of-sample testing, to estimate the model’s generalization capability and potential sources of error.
Interpretation and Communication: When presenting population projections, clearly communicate the limitations and uncertainties associated with the predictions based on the available data and modeling approach. It’s crucial to emphasize that population projections are estimates and subject to change as new data or external factors emerge.
Regular Updates: Population dynamics can change over time due to various factors, such as migration patterns, birth rates, or government policies. It’s advisable to regularly update the population projection model with the latest data and review the model’s performance to ensure the projections remain relevant and accurate.

overall

Scikit-learn is a powerful machine learning library that provides a wide range of models and tools for building predictive models. Linear regression is a fundamental technique for modeling relationships between variables and making predictions. However, it’s crucial to remember that no single model fits all scenarios, and selecting the right model requires careful consideration. Researchers and practitioners should explore and compare different models, experiment with different parameters, and evaluate the performance of each model using appropriate metrics. It’s important not to take results at face value and to critically analyse the findings. Conducting thorough research and selecting the most suitable model can significantly impact the accuracy and effectiveness of predictions, ultimately leading to better decision-making and outcomes.

The Spatial Space