Unveiling the Impact of Seasonal Trends in Time Series Data

Seasonality trends are recurring patterns or fluctuations that transpire at regular intervals within a time series data set. These patterns usually correspond to specific time frames, like seasons, months, days of the week, or even hours within a day, exhibiting predictable and repetitive behavior. Exploring seasonality trends involves analyzing how these regular fluctuations impact the data over time.

Identifying seasonality begins with visually inspecting the data, looking for repetitive patterns that occur at consistent intervals. For instance, in retail sales data, there might be a noticeable increase in purchases during the holiday season each year. Analytical tools like time series decomposition can further help separate the data into its trend, seasonal, and residual components, making it easier to pinpoint and understand these periodic fluctuations.

Understanding seasonality trends is crucial in several domains. In economics, it helps predict consumer behavior during specific times of the year. In finance, recognizing seasonal trends can aid in investment strategies. For instance, certain stocks might exhibit recurring patterns during certain months.

Moreover, handling seasonality in forecasting or modeling is essential. Models like SARIMA (Seasonal Autoregressive Integrated Moving Average) or seasonal adjustments in regression analysis can account for these patterns, allowing for more accurate predictions and assessments by factoring in these regular cyclical changes within the data.

The Power of RandomForestClassifier in Machine Learning

Within the realm of machine learning algorithms, the RandomForestClassifier reigns supreme, celebrated for its versatility and robustness across various tasks. As a stalwart member of ensemble learning, it harnesses the collective might of multiple models to elevate predictive performance. During its training regimen, this classifier orchestrates a symphony of decision trees, amalgamating their insights through voting in classifications or averaging for regressions. But what sets it apart is its ingenious embrace of randomness during this process.

This randomness isn’t happenstance; it’s a deliberate strategy. By cherry-picking random subsets of features for each tree and training them on bootstrapped data samples—known affectionately as bagging—the RandomForestClassifier weaves a shield against overfitting. This infusion of randomness fosters diversity among individual trees, paving the way for the model’s exceptional ability to generalize.

The RandomForestClassifier isn’t a rigid framework; it’s a canvas of possibilities. Its hyperparameters offer a spectrum of customization options. Users can fine-tune elements like the number of trees (n_estimation), the depth of each tree (max_depth), and the consideration of features for every split (max_feature) to craft a model perfectly attuned to their dataset.

In the real world, the RandomForestClassifier is the go-to for classification tasks. Its prowess lies in deciphering intricate data relationships, warding off overfitting perils, and furnishing robust predictions. Its allure—versatility, simplicity, and unwavering effectiveness—propels it to the forefront of diverse machine-learning applications

Mastering Data Detective Skills: Navigating Outliers in Property Evaluation

In today’s project journey, I developed a critical skill: detecting outliers within the property evaluation dataset. I used sophisticated statistical tools such as box plots and Z-scores to identify these anomalies—those peculiar data points that could potentially throw a wrench into our findings. It’s like discovering an oddly shaped puzzle piece that doesn’t quite fit the picture.

Why is this significant?

These outliers have the potential to skew our predictions and compromise the accuracy of our models. Consider forecasting real estate costs and suddenly encountering a mansion within a dataset of standard residences. The presence of such an outlier could significantly disrupt our projections, given its stark contrast to the norm. Understanding and addressing these anomalies is vital for refining the precision and trustworthiness of our forecasts.

Spotting outliers is essential, but understanding their impact is equally crucial. It’s akin to gauging how much that peculiar jigsaw piece alters the overall image. While some outliers might have minimal impact, others could entirely reshape our interpretation of the data.

Consequently, today’s lesson isn’t solely about pinpointing anomalies; it’s about ensuring they don’t derail our analysis and forecasts. Managing these outliers resembles being a data detective—detecting and effectively handling them to augment the accuracy and reliability of our project.

Essential Steps in SARIMA Model Estimation and Forecasting

The estimation and forecasting process for SARIMA (Seasonal Autoregressive Integrated Moving Average) models involves several fundamental steps. Initially, it requires a comprehensive understanding of the time series data under consideration. This involves examining its inherent patterns, trends, and any seasonal fluctuations. These observations are pivotal in determining the specific parameters required for the SARIMA model.

Once the parameters are specified—encompassing seasonal and non-seasonal autoregressive, differencing, and moving average terms—the next step involves estimating these values. Techniques like maximum likelihood estimation (MLE) or iterative methods are typically employed for this purpose. With the parameters estimated, the SARIMA model is constructed and fitted to historical data.

A critical aspect of this process involves analyzing the residuals—the differences between observed and predicted values—to ensure the model captures the underlying patterns effectively and doesn’t exhibit significant systematic errors. Following this validation, the model becomes a powerful tool for forecasting future values of the time series, projecting ahead based on the identified patterns and historical data.

Subsequent to generating forecasts, it’s essential to assess their accuracy. Comparing forecasted values against actual observations helps refine the model, enabling adjustments to parameters or considering alternative models to enhance predictive performance. Additionally, computing prediction intervals or confidence intervals around the forecasted values quantifies the uncertainty associated with these predictions, providing a clearer understanding of the forecast’s reliability and potential variability. These steps collectively form a robust methodology for SARIMA model estimation and forecasting, facilitating the analysis, modeling, and prediction of time series data while accounting for both seasonal and non-seasonal variations.

Decision Trees Unveiled: Crafting Insightful Predictive Models

A decision tree, a form of machine learning system, employs a tree-like structure to depict decisions and their potential outcomes. The algorithm recursively partitions the dataset based on specific attribute values, starting from the root node containing all the data. Each decision node signifies an attribute test, while each leaf node represents the ultimate choice or result. The objective of the splitting criteria is to optimize both heterogeneity within and between groupings. The transparency and interpretability of decision trees enhance the understanding of the model’s decision-making process. To address the tendency toward overfitting, pruning is a method commonly used. Popular techniques for building decision trees include CART, C4.5, and ID3, with applications in both regression and classification tasks. Decision trees also serve as the foundation for ensemble techniques like Gradient Boosting and Random Forests, improving predictive performance.

Here’s an overview of how a decision tree operates:

  • Root Node: Represents the entire dataset, divided into subgroups based on the selected attribute’s value.
  • Decision Nodes (Internal Nodes): Nodes representing decisions based on attribute values, featuring branches leading to child nodes with various attribute values.
  • Leaf Nodes: Terminal nodes signifying the ultimate choices or results. In classification tasks, each leaf node is associated with a specific class label; in regression tasks, it corresponds to a numerical value.
  • Splitting Criteria: The algorithm selects the feature that optimally divides the data into homogeneous groups at each decision node, aiming to maximize both subset heterogeneity and homogeneity.
  • Recursive Process: The splitting process is applied recursively to each subset to create a tree structure. This continues until a specified point is reached, such as a certain depth, a minimum sample requirement in a node, or when homogeneity cannot be further improved.

Sculpting Simplicity: The Art of Decision Tree Pruning

Pruning a decision tree is a method employed to simplify its structure, preventing it from becoming overly complex and thus ensuring optimal performance on new, unseen data. The primary goal of pruning is to streamline the tree by eliminating unnecessary branches while preserving its predictive capabilities. Two main pruning approaches are utilized: pre-pruning and post-pruning.

Pre-pruning, also referred to as early stopping, involves imposing constraints during the tree-building process. This may include setting limits on the tree’s maximum depth, specifying the minimum number of samples needed to split a node, or establishing a threshold for the minimum number of samples permitted in a leaf node. These constraints serve as safeguards to prevent the tree from growing excessively intricate or becoming too tailored to the training data.

On the other hand, post-pruning, also known as cost-complexity pruning, follows a process where the full tree is initially constructed, and later, branches that contribute minimally to improving predictive performance are removed. The decision tree is allowed to grow without restrictions initially, and then nodes are pruned based on a cost-complexity measure considering both the accuracy of the tree and its size. Nodes that do not significantly improve accuracy are pruned, resulting in a simplified overall model.

Exploring ARIMA Models: Estimation, Fitting, and Forecasting in Time Series Analysis

The process of estimating and forecasting with ARIMA models encompasses several essential steps. After identifying and analyzing a time series, the next phase involves determining suitable values for the model parameters (p, d, q). This often entails scrutinizing autocorrelation and partial autocorrelation plots to guide the selection of autoregressive and moving average orders. To achieve stationarity, differencing is applied, and the order of differencing (d) is determined accordingly.

The estimation of ARIMA parameters typically employs maximum likelihood estimation (MLE) methods. Subsequently, the model is fitted to historical data, and the residuals (differences between observed and predicted values) undergo examination to ensure the absence of significant patterns, indicating a well-fitted model.

Once the ARIMA model is successfully estimated and validated, it becomes a valuable tool for forecasting future values of the time series. Forecasting involves advancing the model forward in time, generating predicted values based on the estimated autoregressive and moving average parameters. Additionally, confidence intervals can be computed to offer a measure of uncertainty around the point forecasts.

Despite the widespread utilization of ARIMA models, they have limitations, such as assuming linearity and stationarity. In practical applications, other advanced time series models like SARIMA (Seasonal ARIMA) or machine learning approaches may be employed to address these limitations and enhance forecasting accuracy. Nevertheless, ARIMA models retain their value as an accessible and valuable tool for time series analysis and forecasting.

Versatile Applications of SARIMA Models in Time Series Forecasting Across Industries

Seasonal AutoRegressive Integrated Moving Average (SARIMA) models are an extension of the ARIMA model that incorporates seasonality. SARIMA models find applications in various fields where time series data exhibits recurring patterns and seasonal fluctuations. Here are some notable applications:

1. Retail Sales Forecasting:
SARIMA models are used to forecast retail sales by capturing the seasonality associated with holidays, promotions, and other recurring patterns. Retailers can optimize inventory and staffing based on accurate sales predictions.

2. Demand Forecasting in Supply Chain:
In supply chain management, SARIMA models help forecast demand for products, considering seasonal variations. This is crucial for optimizing production schedules, inventory levels, and distribution plans.

3. Energy Consumption Prediction:
SARIMA models are applied in the energy sector to predict electricity consumption. Utilities use these forecasts for efficient resource allocation, managing demand peaks, and planning maintenance activities.

4. Tourism and Hospitality:
SARIMA models are employed in predicting tourist arrivals, hotel bookings, and other tourismrelated activities. This aids in optimizing staffing levels, pricing strategies, and marketing efforts.

5. Financial Time Series Analysis:
SARIMA models are used in finance for modeling and forecasting financial time series with recurring patterns, such as stock prices or currency exchange rates. This helps investors and financial institutions make informed decisions.

6. Economic Indicators Forecasting:
SARIMA models are applied to forecast economic indicators, such as quarterly GDP, unemployment rates, and consumer spending. Governments and policymakers use these forecasts for economic planning and decisionmaking.

7. Weather and Climate Modeling:
SARIMA models can be used in meteorology to forecast climate variables with a strong seasonal component, such as temperature, precipitation, or humidity. These forecasts are essential for agricultural planning and disaster preparedness.

8. Public Health:
SARIMA models are employed in public health for predicting the seasonal patterns of diseases. For example, forecasting the spread of flu or other infectious diseases helps healthcare providers allocate resources effectively.

9. Traffic and Transportation Planning:
SARIMA models can be utilized to forecast traffic patterns and transportation demand, considering daily or weekly variations. This aids in optimizing traffic signal timings, public transportation schedules, and infrastructure planning.

10. Manufacturing Production Planning:
SARIMA models are applied in manufacturing to forecast production levels, considering seasonality and cyclic patterns. This assists in optimizing inventory levels and production schedules.

SARIMA models are versatile and effective tools for time series forecasting, especially when the data exhibits both trend and seasonality. Their applications span various industries, providing valuable insights for decisionmaking, resource optimization, and planning.

Analyzing Time-Series Data:

Data Preprocessing:
Data preprocessing is a critical step in preparing timeseries data for analysis. It involves several key tasks:

1. Cleaning Data:
Address missing values by imputation or removal, ensuring a complete dataset.
Handle outliers to prevent them from disproportionately influencing analysis and model performance.

2. Ensuring Stationarity:
Confirm or achieve stationarity by examining mean and variance over time. If necessary, apply differencing to stabilize the data.

3. Handling Time Stamps:
Ensure consistent and accurate time stamps. This involves sorting data chronologically and handling irregular time intervals.

4. Resampling:
Adjust the frequency of observations if needed, such as aggregating or interpolating data to a common time interval.

5. Scaling:
Normalize or scale the data if there are significant differences in magnitudes between variables.

Autocorrelation Analysis:
Autocorrelation analysis is crucial for understanding the temporal dependencies within a time series. Key steps include:

1. Autocorrelation Function (ACF):
Plot the ACF to visualize the correlation between a time series and its lagged values. Peaks in the ACF indicate potential lag values for autoregressive components.

2. Partial Autocorrelation Function (PACF):
The PACF isolates the direct relationship between a point and its lag, helping to identify the optimal lag order for autoregressive terms.

3. Interpretation:
Analyze the decay of correlation values in ACF and PACF plots to determine the presence of seasonality and the appropriate lag values for model components.

Model Selection and Validation:
Selecting an appropriate model and validating its performance are crucial for accurate predictions. Key steps include:

1. Choosing a Model:
Consider ARIMA, SARIMA, or machine learning models like LSTM based on the data’s characteristics and temporal patterns.

2. Training and Testing Sets:
Split the data into training and testing sets, reserving a portion for model validation.

3. Model Fitting:
Train the selected model on the training set using appropriate parameters.

4. Evaluation Metrics:
Validate the model using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

5. Iterative Adjustment:
Adjust the model parameters iteratively based on performance evaluation, ensuring optimal accuracy.

Visualize the Time Series:
Visualizing the time series aids in understanding its patterns and structure:

1. Time Series Plot:
Plot the raw time series data to identify overall trends, seasonality, and potential outliers.

2. Decomposition:
Decompose the time series into trend, seasonality, and residual components to better understand underlying patterns.

3. Component Plots:
Plot individual components (trend, seasonality, residuals) to analyze their contribution to the overall time series behavior.

4. Forecasting Visualization:
Plot actual vs. predicted values to assess the model’s performance in capturing the observed patterns.

Effective data preprocessing, autocorrelation analysis, model selection, and visualization collectively contribute to a robust time series analysis, enabling accurate forecasting and insightful interpretation of temporal patterns.

Exploring Boston’s Economic Landscape: A Deep Dive into the 2013 Dataset

Today, I delved into a comprehensive dataset centered on Boston’s economic indicators for the year 2013. This trove of information provides a detailed panorama of key facets shaping the city’s economic tapestry.

One of the pivotal segments of this dataset revolves around tourism, spotlighting metrics such as passenger traffic and international flight activities at Logan Airport. These insights serve as a barometer of Boston’s connectivity and allure to visitors, offering vital clues about the city’s tourism industry dynamics.

Shifting gears, the dataset delves into the realms of the hotel market and labor sector, presenting a meticulous analysis of hotel occupancy rates, average daily rates, total employment figures, and unemployment rates. These granular metrics paint a vivid picture of Boston’s hospitality landscape and labor market, providing invaluable insights into the factors influencing employment trends and economic resilience.

Moreover, the dataset delves deeper into the real estate domain, unearthing details about approved development projects, foreclosure rates, housing sales, and construction permits. This segment unveils a multifaceted view of Boston’s real estate dynamics, capturing trends in housing demand, affordability, and the pulse of development activities across the city.

In essence, this dataset stands as a treasure trove for anyone seeking a comprehensive understanding of Boston’s economic ecosystem in the year 2013. Its nuanced insights into tourism, labor, and real estate paint a rich portrait of the city’s economic vitality and underlying trends.

Understanding the Framework of Generalized Linear Models (GLMs)

Generalized Linear Models (GLMs) are a versatile class of statistical models that extend the framework of traditional linear regression. The “model” in a GLM context refers to the relationship between a response variable and one or more predictor variables.

Key elements of a GLM:

  1. Link Function: GLMs accommodate various types of response variables (e.g., binary, count, continuous) by introducing a link function that connects the linear predictor to the expected value of the response. This link function accounts for the non-normal distribution of the response variable.
  2. Linear Predictor: Similar to linear regression, GLMs involve a linear combination of predictor variables, each weighted by its corresponding coefficient. However, the link function transforms this linear predictor to suit the distributional properties of the response variable.
  3. Family of Distributions: GLMs can handle a wide array of distributions for the response variable, such as Gaussian (normal), binomial, Poisson, and gamma distributions, among others. Each distribution within the GLM family has its own set of link functions.
  4. Estimation of Parameters: The parameters in a GLM, including coefficients for predictors and dispersion parameters, are typically estimated using maximum likelihood estimation or iteratively reweighted least squares, depending on the specific distributional assumptions.

Overall, GLMs offer a flexible framework for modeling relationships between variables in diverse settings where traditional linear regression might not be appropriate due to non-normality, heteroscedasticity, or other distributional issues in the response variable. They find extensive applications in fields such as healthcare, economics, biology, and social sciences

Key Aspects

In machine learning and statistical analysis, decision-making involves using algorithms to analyze data and make predictions or classifications. Decision-making is crucial in various applications, from identifying patterns in data to making informed predictions about future outcomes.

Key Aspects:

  1. Decision Trees:

– Decision trees are a common tool for decision-making in machine learning. They involve creating a tree-like structure where decisions are made at each node based on specific features.

  1. Classification and Regression:

– Decision-making is often categorized into classification (assigning labels to data) and regression (predicting numeric values). Decision trees can be used for both tasks.

  1. Training and Testing:

– Models are trained on a subset of data to learn patterns and relationships. The performance is then evaluated on a separate test set to ensure the model generalizes well to new, unseen data.

  1. Performance Metrics:

– The performance of decision-making models is assessed using metrics such as accuracy, precision, recall, F1 score (for classification), and mean squared error (for regression). These metrics quantify how well the model aligns with the actual outcomes.

  1. Overfitting and Underfitting:

– Overfitting occurs when a model is too complex and performs well on training data but poorly on new data. Underfitting happens when a model is too simple and cannot capture the underlying patterns. Balancing these extremes is crucial for optimal performance.

  1. Cross-Validation:

– Cross-validation is a technique where the dataset is split into multiple subsets, and the model is trained and tested multiple times. This helps provide a more robust evaluation of performance.

  1. Hyperparameter Tuning:

– Adjusting hyperparameters, such as the depth of a decision tree, is essential for optimizing model performance. Grid search and random search are common techniques for hyperparameter tuning.

  1. Ensemble Methods:

– Ensemble methods, like Random Forests, combine multiple decision-making models to improve overall performance and reduce overfitting.

Overall, effective decision-making in machine learning involves designing models that can generalize well to new data, optimizing hyperparameters, and utilizing performance metrics to assess the model’s accuracy and reliability.

Chi-Square Test

The Chi-Square test is a statistical method used to determine if there is a significant association or dependence between two categorical variables. It is particularly valuable in analyzing data that is organized into categories and is often employed in various fields such as statistics, biology, sociology, and market research.

The test assesses whether the observed distribution of data in a contingency table (a table that displays the frequency of occurrences for various combinations of two categorical variables) is significantly different from what would be expected under the assumption that the variables are independent. In other words, the Chi-Square test helps researchers and analysts understand if there is a relationship between the variables beyond what would occur by chance.

There are different versions of the Chi-Square test, each designed for specific purposes:

Chi-Square Test for Independence (or 2 Test for Independence):

Determines if there is a significant association between two categorical variables. It is often used to explore the dependency of one variable on another in research studies.

Chi-Square Goodness-of-Fit Test:

Examines whether observed data follows a particular distribution, like the normal or uniform distribution. It is commonly used to assess how well a model or hypothesis fits the observed data.

Chi-Square Test for Homogeneity:

Assesses whether the distribution of a categorical variable remains consistent across different groups or populations. This version is useful when comparing the distribution of a variable in multiple categories.

The Chi-Square test is a powerful tool for detecting patterns and relationships in categorical data, providing insights into the underlying structure of the variables being studied.

Clustering Techniques

Other clustering techniques-

Hierarchical Clustering:

Agglomerative Hierarchical Clustering: This method starts with individual data points as separate clusters and merges them based on similarity until one cluster is formed. The result is a tree-like structure or dendrogram.

Divisive Hierarchical Clustering: The opposite of agglomerative clustering, divisive hierarchical clustering starts with one cluster that includes all data points and recursively divides it into smaller clusters.

K-Medoids:

K-Medoids is similar to K-means but instead of using the mean as a center, it uses the medoid, which is the most centrally located point in a cluster. This makes K-medoids less sensitive to outliers than K-means.

Gaussian Mixture Model (GMM):

GMM assumes that the data is generated by a mixture of several Gaussian distributions. It is a probabilistic model that assigns a probability to each point belonging to a certain cluster, allowing for soft assignments.

OPTICS (Ordering Points To Identify the Clustering Structure):

OPTICS is a density-based clustering algorithm similar to DBSCAN but with a different approach to ordering points. It creates a reachability plot, which helps in identifying clusters of varying shapes and densities.

Sensitivity to Parameter Choices

Sensitivity to Parameter Choices:

DBSCAN requires configuring hyperparameters like ε (maximum distance defining a point’s neighborhood) and the minimum points to establish a dense region. These choices significantly influence resulting clusters. K-means, requiring the number of clusters (K), is generally easier to determine, as it directly reflects the desired cluster count. DBSCAN’s abstract parameters introduce sensitivity to value selection.

Boundary Points and Noise:

DBSCAN explicitly identifies noise points (those not belonging to any cluster) and handles outliers well. However, the delineation of boundary points within DBSCAN can be arbitrary. In K-means, points at cluster boundaries may be assigned to neighboring clusters, potentially causing instability when a point is close to the boundary shared by two clusters

K-means Clustering:

K-means Clustering:

K-means is a popular clustering algorithm that aims to partition a dataset into K clusters, where K is a user-defined parameter. It works by iteratively assigning data points to the nearest cluster center and updating the cluster centers to minimize the within-cluster sum of squares. K-means is efficient and works well when clusters are spherical and have roughly equal sizes. It’s widely used for data segmentation, customer segmentation, image compression, and more.

 

K-medoids Clustering:

K-medoids, a variation of K-means, is a clustering algorithm that selects data points as cluster representatives (medoids) rather than the mean of the data in each cluster. K-medoids aims to minimize the total dissimilarity between data points and their respective medoids. This makes K-medoids more robust to outliers and noise compared to K-means. It’s used in scenarios where the mean might not be a suitable representative, such as when working with non-Euclidean distances or categorical data.

 

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

DBSCAN is a density-based clustering algorithm that identifies clusters as dense regions of data points separated by areas of lower point density. Unlike K-means, DBSCAN does not require the user to specify the number of clusters beforehand. It works by defining core points, which have a minimum number of data points within a specified radius, and connecting these core points to form clusters. Data points that are not part of any cluster are considered outliers. DBSCAN is effective at identifying clusters of arbitrary shapes and is robust to noise. It’s used in applications like anomaly detection, image segmentation, and geographic data analysis.

 

These three clustering algorithms offer different approaches to partitioning data into clusters and are suited to various types of data and applications. K-means and K-medoids are partitional clustering methods, while DBSCAN is a density-based method. The choice of which algorithm to use depends on the data, the desired number of clusters, the shape of the clusters, and the presence of noise or outliers in the dataset

Logistic & Multinomial regression

Binary Logistic Regression, Ordinal Logistic Regression, and Multinomial Logistic Regression:

 

Binary Logistic Regression, Ordinal Logistic Regression, and Multinomial Logistic Regression are three distinct types of logistic regression models, each tailored to specific scenarios and types of dependent variables:

 

Binary Logistic Regression:

– *Dependent Variable:* Binary Logistic Regression is used when the dependent variable is binary, meaning it has only two possible categories or outcomes (e.g., yes/no, 0/1, true/false).

– *Examples:* Predicting whether a customer will make a purchase (yes/no), determining if a patient has a particular medical condition (positive/negative), or forecasting whether a student will pass an exam (pass/fail).

– *Number of Outcomes:* It deals with a binary (two-category) dependent variable.

– *Model Type:* Binary Logistic Regression models the log-odds of one category relative to the other and utilizes a logistic function to transform these log-odds into probabilities.

 

Ordinal Logistic Regression:

– *Dependent Variable:* Ordinal Logistic Regression is employed when the dependent variable is ordinal, meaning it has ordered categories with a clear sequence but not necessarily equally spaced intervals.

– *Examples:* Predicting student performance categories (e.g., poor, average, good), analyzing customer satisfaction levels (e.g., low, medium, high), or assessing patient pain levels (e.g., mild, moderate, severe).

– *Number of Outcomes:* It is suitable for dependent variables with multiple ordered categories.

– *Model Type:* Ordinal Logistic Regression models the cumulative probabilities of the ordinal categories using a proportional odds or cumulative logit model.

 

Multinomial Logistic Regression:

– *Dependent Variable:* Multinomial Logistic Regression is used when the dependent variable is nominal, meaning it has multiple categories with no inherent order or ranking.

– *Examples:* Predicting a person’s job type (e.g., teacher, engineer, doctor), analyzing the preferred mode of transportation (e.g., car, bus, bicycle), or evaluating product color choices (e.g., red, blue, green).

– *Number of Outcomes:* It is suitable for dependent variables with more than two non-ordered categories.

– *Model Type:* Multinomial Logistic Regression models the probability of each category relative to a reference category, often using dummy variables.

 

In summary, the choice of logistic regression type depends on the nature of the dependent variable. If it has two categories and no inherent order, Binary Logistic Regression is appropriate. If the categories are ordered, Ordinal Logistic Regression is the choice. When the categories are nominal and have no order, Multinomial Logistic Regression is the suitable model. Each of these regression types serves as a valuable tool for analyzing and making predictions based on categorical outcomes in various fields.

Logistic regression:

Logistic regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. It is specifically designed for situations where the dependent variable is binary, meaning it has only two possible outcomes, often denoted as 0 and 1, or as “success” and “failure.” Logistic regression is widely used for various applications, including predicting the probability of an event happening based on certain factors.

The key elements of logistic regression and its analysis include:

 

  1. Binary Outcome: Logistic regression is employed when the dependent variable is categorical with two levels (binary). For example, it can be used to predict whether a customer will make a purchase (1) or not (0) based on factors like age, income, and past behavior.

 

  1. Log-Odds Transformation: Logistic regression models the relationship between the independent variables and the log-odds of the binary outcome. The log-odds are then transformed into a probability using the logistic function, which produces an S-shaped curve.

 

  1. Coefficients: Logistic regression estimates coefficients for each independent variable, which determine the direction and strength of the relationship with the binary outcome. These coefficients can be used to assess the impact of the independent variables on the probability of the event occurring.

 

  1. Odds Ratio: The exponentiation of the coefficient for an independent variable gives the odds ratio. It quantifies how a one-unit change in the independent variable affects the odds of the binary outcome. An odds ratio greater than 1 indicates an increase in the odds of the event, while an odds ratio less than 1 suggests a decrease.

 

  1. Model Evaluation: The performance of a logistic regression model is typically assessed using various metrics, such as accuracy, precision, recall, and the receiver operating characteristic (ROC) curve. These metrics help determine how well the model predicts the binary outcome.

Logistic regression analysis involves fitting the model to the data, estimating coefficients, and using the model to make predictions. It is a valuable tool in fields like healthcare, marketing, finance, and social sciences for understanding and predicting binary outcomes and making informed decisions based on data.

Key features of GLMMs

Generalized Linear Mixed Models (GLMMs) combine aspects of Generalized Linear Models (GLMs) and mixed effects models, offering a versatile statistical approach. They are particularly valuable for handling non-normally distributed data with correlations and hierarchical structures, making them ideal for analyzing complex datasets like police fatal shootings. In this context, GLMMs serve to uncover demographic disparities, identify temporal patterns, analyze geographic distributions, quantify risk factors, and evaluate the impact of policy changes, contributing to a deeper understanding of this critical issue in law enforcement.

  1. Generalization of GLMs: GLMMs extend the capabilities of GLMs, which are used for modeling relationships between a response variable and predictor variables, by allowing for the modeling of non-Gaussian distributions, like binomial or Poisson distributions, and by incorporating random effects.
  2. Random Effects: GLMMs include random effects to model the variability between groups or clusters in the data. These random effects account for the correlation and non-independence of observations within the same group.
  3. Fixed Effects: Like in GLMs, GLMMs also include fixed effects, which model the relationships between predictor variables and the response variable. Fixed effects are often of primary interest in statistical analysis.
  4. Link Function: Similar to GLMs, GLMMs use a link function to relate the linear combination of predictor variables and the response variable. Common link functions include the logit, probit, and log for binomial, Poisson, and Gaussian distributions, respectively.

5.            Likelihood Estimation: GLMMs typically use maximum likelihood estimation to estimate model parameters, including fixed and random effects.

Understanding Bias in Data

In my Python-based analysis of the ‘fatal-police-shootings-data’ dataset, I’ve examined various variables and their distributions. Notably, ‘age’ provides insights into the ages of individuals involved in these tragic incidents, and the dataset includes precise latitude and longitude information. During this preliminary assessment, I’ve identified an insignificant ‘id’ column and explored missing values and duplicate records. As we proceed, the next phase of our analysis will focus on the distribution of ‘age.’

 

In a recent classroom session, we learned how to compute geospatial distances, enabling us to create GeoHistograms for visualizing geographic data trends, identifying hotspots, and uncovering clusters within location-related datasets. This newfound expertise enhances our understanding of underlying phenomena within the data

Clustering and diff between K-Means and Hierarchical Clustering:

A cluster, in the context of data analysis and machine learning, refers to a group of data points or objects that are similar to each other in some way. The goal of cluster analysis is to group data points into clusters based on their similarities or dissimilarities. These clusters can help reveal patterns, structures, or natural groupings within the data that might not be apparent through other means.

 

Cluster Analysis:

Cluster analysis, also known as clustering, is a technique used to discover and group similar data points or objects into clusters. It is a form of unsupervised learning, as it doesn’t require predefined labels or categories for the data. Cluster analysis is employed in various fields, such as marketing for customer segmentation, biology for species classification, and image processing for object recognition, among others.

 

Differences Between K-Means and Hierarchical Clustering:

K-Means and Hierarchical Clustering are two common approaches in cluster analysis, and they differ in several key ways:

 

  1. Number of Clusters:

– K-Means: Requires specifying the number of clusters (K) in advance. It aims to partition the data into exactly K clusters.

– Hierarchical Clustering: Does not require specifying the number of clusters in advance. It produces a hierarchy of clusters, and you can choose the number of clusters at a later stage by cutting the dendrogram at an appropriate level.

 

  1. Hierarchy:

– K-Means: Doesn’t produce a hierarchy. It assigns data points to fixed clusters, and the process is not inherently nested.

– Hierarchical Clustering: Creates a hierarchy of clusters, which allows for exploring clusters at different levels of granularity within the data.

 

  1. Initialization:

– K-Means: Requires initial cluster centroids, which can affect the final clustering result. Multiple runs with different initializations are often performed to mitigate this.

– Hierarchical Clustering: Doesn’t require initializations as it builds clusters incrementally through merging or dividing.

 

  1. Robustness to Outliers:

– K-Means: Sensitive to outliers, as a single outlier can significantly impact the position of the cluster centroids.

– Hierarchical Clustering: Tends to be more robust to outliers, as the impact of a single outlier is diluted when forming clusters hierarchically.

 

  1. Complexity:

– K-Means: Generally computationally more efficient and is preferred for larger datasets.

– Hierarchical Clustering: Can be computationally expensive, especially for very large datasets.

In summary, K-Means clustering requires specifying the number of clusters in advance and assigns data points to fixed clusters, while Hierarchical Clustering creates a hierarchy of clusters without needing the number of clusters predetermined. The choice between them depends on the nature of the data, the objectives of the analysis, and computational considerations.

Desc() ‘fatal-police-shootings-data’

At the outset of my work with the two CSV files, ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies,’ I commenced by importing them into Jupyter Notebook. Here’s a concise overview of the initial tasks and obstacles I faced:

ID: Each fatal police shooting incident is assigned a unique ID, ranging from 3 to 8696, indicating 8002 distinct incidents in the dataset without missing or duplicate IDs.

Date: The dataset covers incidents from January 2, 2015, to December 1, 2022, with the mean date around January 12, 2019. About 25% of incidents occurred before January 18, 2017, and 75% before January 21, 2021.

Age: Victim ages range from 2 to 92 years old, with an average age of 37.209. Approximately 25% of victims are 27 or younger, and 75% are 45 or younger. The standard deviation is around 12.979, indicating age variability.

Longitude: Longitude coordinates vary from about -160.007 to -67.867, with the mean around -97.041. Roughly 25% of incidents occurred west of -112.028, and 75% west of -83.152. The standard deviation is approximately 16.525, indicating location dispersion.

Latitude: Latitude values range from about 19.498 to 71.301, with the mean around 36.676. About 25% of incidents occurred south of 33.480, and 75% south of 40.027. The standard deviation is about 5.380, indicating location dispersion along the latitude axis.

Overview and intro of Fatal Force Database

The “Fatal Force Database,” launched by The Washington Post in 2015, stands as a meticulous and all-encompassing endeavor, dedicated to vigilantly monitoring and cataloging incidents of civilians being shot and fatally wounded by law enforcement officers while on active duty in the United States. It exclusively zeroes in on such cases and furnishes indispensable information, encompassing the racial identity of the deceased, the circumstances surrounding the shootings, whether the victims were armed, and whether they were undergoing a mental health crisis. The process of collecting data involves aggregating information from various sources, ranging from local news reports and law enforcement websites to social media platforms and independent databases like Fatal Encounters.

 

Significantly, in 2022, the database underwent an upgrade to establish a standard protocol for releasing the names of the law enforcement agencies involved, thereby augmenting transparency and accountability at the departmental level. This dataset differs from federal sources such as the FBI and CDC and consistently registers more than twice the number of fatal police shootings since its inception in 2015, underscoring a conspicuous void in data collection and underscoring the imperative for comprehensive tracking. Continually refreshed, it remains an invaluable resource for researchers, policymakers, and the general populace. It provides a window into the realm of police-involved shootings, advocates for openness, and contributes to ongoing dialogues concerning police responsibility and reform

Difference between Sk-learn & OLS model on Simple Linear Regression

Diff b/w sk-learn & OLS model on Simple Linear Regression

Scikit-learn (sklearn) and Ordinary Least Squares (OLS) are two different approaches for implementing and fitting linear regression models, including simple linear regression (SLR). Here’s a short theory explaining the key differences between them:

  1. Scikit-Learn (sklearn):

– Machine Learning Library: Scikit-learn is a popular Python library for machine learning and data science. It provides a wide range of machine learning algorithms, including linear regression, in a unified and user-friendly API.

– Usage: Scikit-learn is a versatile tool for various machine learning tasks, not just linear regression. It’s suitable for building predictive models, classification, clustering, and more.

– Model Selection: Scikit-learn offers a straightforward way to select and fit models to data. For simple linear regression, you can use the `LinearRegression` class.

– Flexibility: Scikit-learn is designed to handle a variety of machine learning problems, so it’s a great choice when you need to explore different algorithms and techniques for your problem.

 

  1. Ordinary Least Squares (OLS):

– Statistical Technique: OLS is a statistical method used for estimating the coefficients in linear regression models. It’s a classical and fundamental approach in statistics.

– Usage: OLS is primarily used for linear regression and related statistical analyses. It focuses specifically on linear models and their interpretation.

– Model Fitting: In OLS, you typically use a statistical software or library (e.g., StatsModels in Python) that specializes in statistical analysis. OLS provides detailed statistics about the model, including coefficient estimates, standard errors, p-values, and more.

– Interpretation: OLS is often favored when the goal is not just prediction but also a deep understanding of the relationships between variables and the statistical significance of coefficients.

 

Key Differences:

– Purpose: Scikit-learn is a machine learning library with a broader range of applications, while OLS is a statistical technique primarily focused on linear regression.

– Flexibility vs. Specialization: Scikit-learn is more flexible and suitable for various machine learning tasks, whereas OLS is specialized for linear regression and related statistical analyses.

– Output: Scikit-learn typically provides fewer statistical details about the model but is more focused on predictive performance. OLS, on the other hand, offers extensive statistical summaries for deeper analysis and interpretation.

– Approach: Scikit-learn uses optimization techniques to fit linear regression models, aiming to minimize prediction error. OLS follows a statistical approach, estimating coefficients based on statistical principles, specifically minimizing the sum of squared residuals.

 

In summary, the choice between scikit-learn and OLS for simple linear regression depends on your goals. If you need a versatile tool for various machine learning tasks and prioritize prediction accuracy, scikit-learn is a good choice. If you require in-depth statistical analysis and interpretation of coefficients, especially in the context of linear regression, OLS is more suitable.

Multiple Linear Regression With SK-learn model

Multiple Linear Regression is a supervised machine learning algorithm used for predicting a continuous target variable based on two or more independent variables (features). In the context of scikit-learn (SK-Learn), a widely used machine learning library in Python, multiple linear regression can be implemented easily. Here’s a short description of Multiple Linear Regression with scikit-learn:

  1. Problem: Multiple Linear Regression is used when you have a dataset with multiple features, and you want to build a predictive model to understand the linear relationship between those features and a continuous target variable.

Implementation with scikit-learn:

To perform Multiple Linear Regression with scikit-learn, follow these steps:

– Import the Library: Import the necessary libraries, including scikit-learn.

– Load and Prepare Data: Load your dataset and split it into features (independent variables) and the target variable.

– Create a Linear Regression Model: Create an instance of the LinearRegression class from scikit-learn.

– Fit the Model: Use the fit() method to train the model on your training data. The model will estimate the coefficients based on the training data.

– Make Predictions: Once the model is trained, you can use it to make predictions on new or unseen data.

– Evaluate the Model: Use appropriate metrics to assess the performance of your model, such as Mean Squared Error (MSE) or R-squared (\(R^2\)).

  1. Use Cases:

Multiple Linear Regression is commonly used for various predictive tasks and analysis, including:

– Predicting house prices based on features like square footage, number of bedrooms, and location.

– Analyzing the impact of advertising spending on sales.

– Predicting a person’s income based on factors like education, experience, and location.

  1. Assumptions:

– Linearity: The relationship between the independent and dependent variables is assumed to be linear.

– Independence of Errors: The errors (residuals) are assumed to be independent of each other.

– Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.

– No or Little Multicollinearity: The independent variables are not highly correlated with each other.

  1. Model Interpretation:

The coefficients indicate the strength and direction of the relationship between each independent variable and the target variable. For example, a positive  suggests that an increase in \(x_1\) is associated with an increase in the target variable \(y\).

 

In summary, Multiple Linear Regression in scikit-learn is a valuable tool for building predictive models when you have multiple features and want to understand the linear relationships between those features and a continuous target variable. It’s a fundamental technique in regression analysis and data science.

Residuals in Linear Regression

  1. Residuals in Linear Regression:

In linear regression, the goal is to model the relationship between independent variables (features) and a dependent variable (target) using a linear equation. The model makes predictions based on this equation. The difference between the actual observed values and the predicted values is called the “residual.”

  1. Purpose of Residual Plots:

Residual plots are used to assess the validity of several key assumptions made in linear regression:

– Linearity Assumption: Residual plots help visualize whether the relationship between the independent variables and the dependent variable is adequately captured by a linear model. In a good linear model, the residuals should exhibit a random scatter around zero.

– Homoscedasticity Assumption: Homoscedasticity means that the variance of the residuals is constant across all levels of the independent variables. Residual plots help check for homoscedasticity. In a homoscedastic model, the spread of residuals should be roughly uniform across the predicted values.

– Normality Assumption: The residuals should follow a normal distribution. Residual plots, particularly histograms of residuals, can help assess whether the residuals are approximately normally distributed.

  1. Components of the Code:

Here’s a breakdown of the code’s components and how they relate to the theory:

– Calculate Residuals: `residuals = model.resid` calculates the residuals for each observation by subtracting the predicted values from the actual observed values. These residuals represent the discrepancies between the model’s predictions and the actual data.

– Scatterplot of Residuals vs. Predicted Values: `plt.scatter(model.fittedvalues, residuals)` creates a scatterplot with the predicted values on the x-axis and residuals on the y-axis. This plot assesses the linearity assumption. A good linear model would have residuals scattered randomly around zero, indicating that the model captures the linear relationship well.

– Horizontal Red Line: `plt.axhline(0, color=’red’, linestyle=’–‘)` adds a horizontal dashed line at y=0. This line represents the ideal situation where residuals are zero for all predicted values.

– Histogram of Residuals: `plt.hist(residuals, bins=30, edgecolor=’k’)` creates a histogram of residuals to assess the normality assumption. In a normally distributed model, the histogram should resemble a bell-shaped curve.

  1. Interpretation:

– In the scatterplot, a random scatter of residuals around the zero line suggests that the linearity assumption is reasonable.

– In the histogram, a roughly bell-shaped curve indicates that the normality assumption is met.

– For homoscedasticity, examine the scatter of residuals in the scatterplot. Ideally, there should be no discernible pattern or funnel shape in the residuals as you move along the x-axis.

Residual plots provide valuable diagnostic information to understand how well your linear regression model aligns with the underlying assumptions. If the plots reveal significant departures from the assumptions, it may be necessary to revisit the model or consider transformations or other adjustments to improve its performance.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used in data analysis and machine learning. It transforms a dataset into a new coordinate system, capturing the most important information in a smaller number of features called principal components. By retaining only the most significant components, PCA simplifies data analysis, aids visualization, and reduces the risk of overfitting in machine learning models. It’s a powerful tool for handling high-dimensional data and extracting meaningful patterns while reducing noise and complexity.

 

Principal Component Analysis (PCA) works by transforming a dataset into a new coordinate system, where the new axes, called principal components, capture the maximum variance in the data. Here’s how PCA works step by step:

 

  1. Data Standardization:

– PCA typically begins with standardizing the data. This involves subtracting the mean from each feature and dividing by the standard deviation. Standardization ensures that all features have a similar scale and prevents features with larger variances from dominating the analysis.

  1. Covariance Matrix:

– PCA calculates the covariance matrix of the standardized data. The covariance matrix describes the relationships between pairs of features. A positive covariance indicates that two features tend to increase or decrease together, while a negative covariance suggests an inverse relationship.

  1. Dimensionality Reduction:

– The selected principal components are used to transform the original data into a lower-dimensional space. This reduces the number of features while preserving the most essential information.

– The transformed data, known as the scores or loadings, can be used for subsequent analysis or visualization.

PCA is a powerful technique for dimensionality reduction, noise reduction, and data exploration. It helps simplify complex datasets while retaining essential information, making it a valuable tool in various fields, including statistics, machine learning, and data analysis.

K-fold cross-validation

K-fold cross-validation is a valuable technique for model assessment and selection, ensuring that a machine learning model generalizes well to new, unseen data and performs consistently across different data subsets. Here are the steps to validate and create a K-fold cross-validation:

 

  1. Data Preparation:

– Start with a dataset that is typically divided into two parts: a training set and a testing set. The training set is used to build and train the model, while the testing set is reserved for validation.

 

  1. Choose the Number of Folds (K):

– Decide on the number of folds (K) you want to use for cross-validation. Common choices are 5 and 10, but it can vary based on the dataset size and computational resources.

 

  1. Data Splitting:

– Split the training set into K roughly equal-sized subsets or folds. Each fold represents a different subset of the training data.

 

  1. Model Training and Evaluation:

– Perform K iterations, where in each iteration:

– One fold is used as the validation/test set.

– The remaining K-1 folds are used as the training set.

– Train the machine learning model on the training set.

– Evaluate the model’s performance on the validation/test set using an appropriate evaluation metric (e.g., accuracy, mean squared error, etc.).

– Record the performance metric for this iteration.

 

  1. Performance Aggregation:

– After completing all K iterations, you will have K performance metrics, one for each fold. Calculate the average (or other summary statistics) of these metrics to get an overall assessment of the model’s performance.

 

  1. Model Tuning:

– Based on the cross-validation results, you may decide to adjust hyperparameters or make other modifications to improve the model’s performance.

 

  1. Final Model Training:

– Once you are satisfied with the model’s performance, train the final model using the entire training set (without cross-validation) with the chosen hyperparameters and settings.

 

  1. Model Evaluation:

– Finally, evaluate the model’s performance on the held-out testing set to get an estimate of its performance on new, unseen data.

 

K-fold cross-validation helps ensure that your model’s performance assessment is robust and less dependent on the specific random splitting of the data. It provides a more reliable estimate of the model’s generalization performance compared to a single train-test split. This technique is essential for model selection, hyperparameter tuning, and assessing how well your machine learning model is likely to perform in real-world applications.

Task 4

Probability value:

P-value is a key concept in hypothesis testing. It shows the probability of getting a result that is equally likely as or more likely than the observed data. This statistic, which effectively quantifies the likelihood that an event will occur, is crucial in evaluating the degree of marginal significance in hypothesis testing.

Hypothesis Testing:

In the context of hypothesis testing, we frequently use data analysis and visualization to glean insights from sample datasets. In this procedure, the P-value approach becomes crucial since it assesses the importance of a given Null Hypothesis. Depending on a predefined significance level, this hypothesis will either be accepted or rejected.

Linear Model:

One or more independent variables and a dependent variable are assumed to have a linear relationship in a statistical procedure known as a linear model. It is used for modeling and forecasting, and common applications include simple linear regression (with a single predictor) and multiple linear regression (with several predictors)

Monte Carlo Test:

A statistical method for testing hypotheses and determining uncertainty is the Monte Carlo test. To approximate the distribution of a statistic or test statistic under the null hypothesis, it entails simulating a large number of random scenarios or samples. Researchers can determine a p-value, which denotes the likelihood of arriving at the observed result by chance, by comparing the observed statistic to the distribution of simulated values.

Task 3

Multiple Linear Regression:

A statistical modeling method called multiple linear regression builds on the ideas of simple linear regression to examine and forecast the relationship between a dependent variable (the result) and a number of independent variables (predictors). When the dependent variable is influenced by a number of different variables, this strategy is quite useful. Listed below is a succinct explanation of multiple linear regression:

Multiple Variables: Multiple Linear Regression takes into account a number of independent variables, enabling a more intricate examination of how numerous variables influence the dependent variable at the same time.

Linear Relationship: The dependent variable and each of the independent variables are assumed to have a linear relationship, just like in simple linear regression. However, it allows for a number of independent factors.

Coefficient Interpretation: In this model, each independent variable has a unique coefficient that, while holding all other variables constant, indicates the change in the dependent variable caused by a one-unit change in that specific independent variable.

Intercept: The dependent variable’s value when all independent variables are set to zero is represented by the intercept term, which is similar to the term in the simple linear regression.

Model Fitting: By changing the coefficients, the model attempts to identify the best-fitting linear equation that reduces the discrepancy between anticipated and actual values.

Multiple linear regression makes the following assumptions: multicollinearity (strong correlation between independent variables) and the residuals (the discrepancies between observed and predicted values) are both assumed to be normally distributed.

Applications: To study complex relationships, make predictions, and comprehend the relative relevance of many factors on a result, this method is utilized in a variety of sectors, including economics, finance, social sciences, and engineering.

Model Evaluation: R-squared, which measures how well the model fits the data, and statistical tests to determine the significance of each coefficient and the model as a whole are common assessment metrics for multiple linear regression.

Feature Selection: To ascertain which independent factors have the greatest influence on the dependent variable, researchers frequently use feature selection.

Limitations: Multiple Linear Regression makes the assumption that data is linear, which may not always be true in practice. Additionally, the model’s dependability may be impacted by the existence of outliers or assumptions that are broken.

In conclusion, multiple linear regression is an effective statistical method for investigating and simulating the associations between a number of independent factors and a dependent variable. It assists researchers in gaining knowledge, formulating forecasts, and comprehending the intricate interaction of variables affecting a result.

Task 2

CDC Dataset visualization:

 

Here we are going to visualize the cdc Diabetic data, we are using seaborn library to plot the pair plots of the datasets.

From the above is the pair plot representation of diabetic dataset. There is no correlation between them as we clearly observed in the above fig. In the diabetics plot we can clearly observed the data is left skewed.

The above image is represents as there is no null values in the diabetics dataset. for this I have used the command as isnull().sum() function.

 

The above image we have seen that there is no nan values in the diabetics dataset as we have observe the above image. For this I have used the isna().sum() function.

CDC Dataset Task-1

I have observed in CDC Dataset there are 3 Excel sheets of data.  In that Diabetics Data there are 4 Numerical Columns(Int Columns) and 2 Object columns data type. In the Obesity & Inactivity Excel Sheet has same data types. In this post we are considering about Diabetics Data.

we have used the info function to get the datatypes of each columns as I mentioned above. We can clearly observer the data types of the below screen shots.

I have used describe function to get the statistics of the Diabetics data.

 

we can clearly observed from above picture we have 3142 rows of data, mean of diabetics column is 8.719796, with SD of 1.794854, min 3.800000, max 17.90000. If we consider about the inter quatile ranges we can clearly observe how the data is distributed in the each quartile.

In the first 25% quartile the data is distributed as 7.30000, for 50% it shows 8.40000, for 75% it displays 9.70000.

 

As we can see and observe from the above screenshot that there is no-correlation between the columns but there might be a correlation between while merging the 3 excel sheets.