If you happen to know me, you know I’m a big fan of the Linux ecosystem and I’m constantly trying to convince everyone to migrate from windows to any of the available Linux distributions. My…

This article will be divided into 5 parts, each with the purpose of explaining in further detail regarding Linear Regression and its analysis:

Medical cost is an important variable in one’s life. According to Taloba et al. (2022) and Jödicke et al. (2019), the intrepid growth of healthcare cost is an important problem to tackle in developed countries. Taloba et al. (2022) also states that medical costs are one of the most common expenses that recurs in one’s life. Being able to at least simulate and predict one’s cost in medical healthcare will significantly help two parties: Patients and medical institutions. With regards to patients, they are able to input their data, and they will have a calculated prediction of their healthcare (assuming the model is precise) which they can use for insurance or other things related. On the other side, medical institutions can predict which patient groups comfortably incur with their business model.

This dataset is a simulation of medical cost with regards to insurance made by Brett Lantz in his Machine Learning with R tutorial.

Machine learning is divided into two type: Supervised Learning and Unsupervised Learning. Supervised Learning is the method of machine learning where the agent is given labeled datasets to predict an input; meanwhile on the other hand, unsupervised learning is the method of machine learning where the agent is given a dataset without label to learn and make correlations on one another.

As to provide prediction for one’s medical cost, linear regression is one of the several ways used in practice for this dataset, and is supervised learning as it has labeled datasets. The output for the model’s prediction is continuous; hence, there are several ways of implying machine learning algorithm to find a continuous output and one of the simplest ones is linear regression (Bharti & Malik, 2022).

**Linear regression** is an approach and one of many machine learning algorithm that is used for modeling the relationship between a response and one or more explanatory variables which in Machine Learning term is known as dependent and independent variables. For cases involving only** **one independent variable, it is called **simple linear regression**. If more than one independent variable is used, the process is called **multiple linear regression.**

Alongside linear regression, some other well-known regression algorithms that can output continuous predicted values are decision trees and random forests.

Although these algorithms are all capable of outputting continuous predicted values, there are some advantages and disadvantages that differentiate them from each other.

To simplify, Linear Regression is chosen for this project because they work well with all ranges of dataset, is simple, and gives relevant information regarding its features. On the other hand, it forces the independent variables to stay inside the linear equation of the regression.

**Setting up the dataset**

Create a new GitHub repository where we will put the dataset. After creating a new repository, download the dataset from Kaggle. Afterwards, upload the dataset that has been downloaded, “insurance.csv” to the repository.

To confirm that the data is valid, check each row and columns by viewing the raw output of the dataset file.

**Importing to .ipynb**

We will be using Google Colab to do our analysis. After entering Google Colab, create a new notebook.

**Importing libraries and verifying data**

Import linear regression related libraries as below.

Check and verify if dataset is imported while also knowing the column names and the type of the values, discrete or continuous.

**Verifying and removing null values**

Verify datatype and number of rows of each column by using the info() function from pandas.

The dataset consists of 7 columns.

From the code above we found out that from 7 columns there are 0 columns that have missing values, therefore we do not have to drop any rows.

**Correlation before mapping**

The reason as to why we find correlation before mapping is that correlation between independent variable and dependent variable is more pleasable to see with discrete values intact rather than seeing correlations with continuous values.

**Univariate EDA and Multivariate EDA**

EDA is Exploratory Data Analysis, is a term in statistics to analyze data groups to summarize its characteristics. What differentiates Univariate and Multivariate is Univariate looks at one variable whereas Multivariate looks at two or more variable, both to explore relationships in the dataset

**Correlation between age and charges (Univariate EDA)**

From grouping costs mean from varying ages, it is evident that as age increase medical costs tend to rise.

The correlation data with regards to age and medical costs is depicted in the line graph below.

**Smoking and Price (Multivariate EDA)**

Bar graph is used to visualize the difference between the data distribution of smoker and non-smoker.

Hence, sex is taken into account when grouping non-smoker and smoker by their means and standard deviation.

From the plot above, smokers irrelevant of sex cost higher in healthcare.

**Region and price (Multivariate EDA)**

Bar graph is used to count the individual region.

From the plot above it can be concluded that southwest region has the highest medical cost charges than the other regions.

**Children and price (Multivariate EDA)**

Bar graph is used to visualize the difference between the varying number of children conceived by the individuals.

Hence, sex is taken into account when grouping varying number of children conceived by the individuals by their means and standard deviation.

**Mapping of discrete-valued columns to continuous-valued columns**

The purpose of mapping is not to only display the correlation in heatmap and pairplot but to also include the columns that were discrete as independent values in the linear regression model.

Check unique values of each column to reference into dictionary later on

Make dictionary reference to convert discrete values to continuous values with the purpose of adding more reference to correlation.

**Displaying Multivariate EDA of the dataset**

Display Multivariate EDA using pairplot

This plot shows us the correlation of numerical values of columns in the dataset

Display Multivariate EDA using heatmap

The heatmap shows us the correlation strength of columns in the dataset. Blue depicts strong positive correlation whereas red depicts strong negative correlation.

From the heatmap above, strong positive correlation is shown between age and charges. Along with age and charges, smoker and charges also depicts strong positive correlation.

**Displaying Univariate EDA of the dataset**

Display Multivariate EDA using describe function

The table above shows the count, mean, standard deviation, and all quartiles of all columns.

Display Multivariate EDA using boxplot

The boxplot works the same way as using the describe function, where the count, mean, standard deviation, and all quartiles of each column are shown. In addition, we do not need to display the boxplot for columns “sex”, “smoker”, and “region” because the data is discrete. Lastly, the predictor and predicted boxplot are seperated to better view the plot of each value.

**Assign Independent and Dependent Variable**

Since medical cost charges needs to be predicted therefore it will be the dependent variable, whereas age, sex, bmi, children, smoker, and region are factors that affects medical cost charges, therefore it will be the independent variables.

**Split the dataset into data for training and data for testing**

The dataset is split into both testing data and training data. Testing data will consist of 20% of the overall data and 80% is reserved for training.

**Building the model for regression equation based on the dataset**

Create a linear regression model with the values from the training data, fit the linear regression model, and find the intercept of the slope and the regression coefficient of each independent variables.

Theory: Where the model is based from

Y-hat is the predicted continuous value from the linear regression equation that builds the model above. The intercept is added to the sum of each coefficient of each independent variable multiplied to the prediction independent variable.

**Predict the medical cost of someone with 6 known values of the independent variables (age, sex, bmi, amount of children, smoker or not, and their region)**

Predicting medical insurance costs of 3 simulated people based on different independent variables. In the first instance, we have a 50 year-old male with a good BMI, which is 20, has 2 children, is not a smoker, and lives in southeast is predicted to have around 7910.65935803 medical charges (assuming no currency). In the second instance, we have a 20 year-old female with a low BMI (underweight), which is 16, has no children, is a smoker, and lives in southwest is predicted to have around 21400.98785959 medical charges (assuming no currency). In the last instance, we have a 50 year-old male with a high BMI (overweight), which is 23. has no children, is not a smoker, and lives in northwest is predicted to have around 8339.10518512 medical charges (assuming no currency)

**Summary of the model**

The model summarizes whether the independent variables affect the cost of medical charges. Based on the model summary, it can be concluded that age, bmi, children, and smoker has an impact on the cost of medical charges. This is because the value of p of each independent variable is below 0.05, whereas sex and region does not have an impact on the cost of medical charges because its p value is above 0.05.

**Calculating Root Mean Squared Error (RMSE)**

Based on an article we found on the web, acceptable ranges of the data lies between the value of 0.25 (1st quartile) and 0.5 (median/2nd quartile), whereas values of 0.75 (3rd quartile) and above are considered very good value for showing the accuracy.

We can refer to Multivariate EDA data that was displayed using the describe function. Hence, we can see that the 1st quartile of the medical cost of charges is 4740.287150, the 2nd quartile is 9362.033000, the 3rd quartile is 16639.912515, and finally the max is 63770.428010. The result shown above shows that the RMSE value is 5799.587091438356. Therefore we can concur that the RMSE value is acceptable and the value of the data can be predicted because it lies within 4740.287150 (1st Quartile) until 63770.428010 (max) of the medical cost charges data, even though it does not have a relatively high accuracy.

**Calculating Mean Absolute Percentage Error (MAPE)**

Mean Absolute Percentage Error (MAPE) is an error performance measure that is used to measure the performance of Machine Learning models.

Based on the table above, MAPE score of less than 10 % is considered very good, where any MAPE score above 50% is considered bad or not good.

The result shown above shows that the MAPE value of our model is 47.09302952729565 %, which is considered OK based on the metric above.

**References**

Bharti, A. Malik, L. (2022). *REGRESSION ANALYSIS AND PREDICTION OF MEDICAL INSURANCE COST*.* IJCRT*, 10(3), 1–6.

The past few years have seen an exponential increase in the amount of data produced in the world. At Planet, we have had a front row seat to watch that explosion of data, including satellite imagery…

Verified PayPal accounts are far from totally free. But for those that want to get ahead in PayPal’s burgeoning marketplace. They’re worth their weight in virtual gold. PayPal Verified Accounts are…