Essential Linear Algebra For Data Science And Machine Learning

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

فالونگ: 0

بینجمن او تایو کی تصویر۔

Linear Algebra is a branch of mathematics that is extremely useful in data science and machine learning. Linear algebra is the most important math skill in machine learning. Most machine learning models can be expressed in matrix form. A dataset itself is often represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation. Here are the topics you need to be familiar with:

ویکٹر
میٹرک
میٹرکس کی منتقلی۔
میٹرکس کا الٹا
میٹرکس کا تعین کنندہ
میٹرکس کا سراغ
نقطہ مصنوعات
ایگن ویلیوز
ایجین ویکٹرز

In this article, we illustrate the application of linear algebra in data science and machine learning using the tech stocks dataset, which can be found یہاں.

1. Linear Algebra for Data Preprocessing

We begin by illustrating how linear algebra is used in data preprocessing.

1.1 Import necessary libraries for linear algebra

import numpy as np import pandas as pd import pylab import matplotlib.pyplot as plt import seaborn as sns

1.2 Read dataset and display features

data = pd.read_csv("tech-stocks-04-2021.csv") data.head()

ٹیبل 1. Stock prices for selected stock prices for the first 16 days in April 2021.

print(data.shape) output = (11,5)

۔ data.shape function enables us to know the size of our dataset. In this case, the dataset has 5 features (date, AAPL, TSLA, GOOGL, and AMZN), and each feature has 11 observations. تاریخ refers to the trading days in April 2021 (up to April 16). AAPL, TSLA, GOOGL, and AMZN are the closing stock prices for Apple, Tesla, Google, and Amazon, respectively.

1.3 ڈیٹا ویژولائزیشن

To perform data visualization, we would need to define column matrices for the features to be visualized:

x = data['date'] y = data['TSLA'] plt.plot(x,y) plt.xticks(np.array([0,4,9]), ['Apr 1','Apr 8','Apr 15']) plt.title('Tesla stock price (in dollars) for April 2021',size=14) plt.show()

اعداد و شمار 1. Tesla stock price for first 16 days in April 2021.

2. Covariance Matrix

۔ ہم آہنگی میٹرکس is one of the most important matrices in data science and machine learning. It provides information about co-movement (correlation) between features. Suppose we have a features matrix with 4 خصوصیات اور n observations as shown in ٹیبل 2:

ٹیبل 2. Features matrix with 4 variables and n observations.

To visualize the correlations between the features, we can generate a scatter pairplot:

cols=data.columns[1:5] print(cols) output = Index(['AAPL', 'TSLA', 'GOOGL', 'AMZN'], dtype='object') sns.pairplot(data[cols], height=3.0)

چترا 2. Scatter pairplot for selected tech stocks.

To quantify the degree of correlation between features (multicollinearity), we can compute the covariance matrix using this equation:

where and are the mean and standard deviation of feature , respectively. This equation indicates that when features are standardized, the covariance matrix is simply the نقطہ مصنوعات between features.

In matrix form, the covariance matrix can be expressed as a 4 x 4 real and symmetric matrix:

This matrix can be diagonalized by performing a وحدانی تبدیلی, also referred to as Principal Component Analysis (PCA) transformation, to obtain the following:

چونکہ trace of a matrix remains invariant under a unitary transformation, we observe that the sum of the eigenvalues of the diagonal matrix is equal to the total variance contained in features X₁، ایکس₂، ایکس₃, and X₄.

2.1 Computing the covariance matrix for tech stocks

from sklearn.preprocessing import StandardScaler stdsc = StandardScaler() X_std = stdsc.fit_transform(data[cols].iloc[:,range(0,4)].values) cov_mat = np.cov(X_std.T, bias= True)

Note that this uses the ٹرانسپوز کریں of the standardized matrix.

2.2 Visualization of covariance matrix

plt.figure(figsize=(8,8)) sns.set(font_scale=1.2) hm = sns.heatmap(cov_mat, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 12}, yticklabels=cols, xticklabels=cols) plt.title('Covariance matrix showing correlation coefficients') plt.tight_layout() plt.show()

چترا 3. Covariance matrix plot for selected tech stocks.

We observe from Figure 3 that AAPL correlates strongly with GOOGL and AMZN, and weakly with TSLA. TSLA correlates generally weakly with AAPL, GOOGL and AMZN, while AAPL, GOOGL, and AMZN correlate strongly among each other.

2.3 Compute eigenvalues of the covariance matrix

np.linalg.eigvals(cov_mat) output = array([3.41582227, 0.4527295 , 0.02045092, 0.11099732]) np.sum(np.linalg.eigvals(cov_mat)) output = 4.000000000000006 np.trace(cov_mat) output = 4.000000000000001

We observe that the trace of the covariance matrix is equal to the sum of the eigenvalues as expected.

2.4 Compute the cumulative variance

Since the trace of a matrix remains invariant under a unitary transformation, we observe that the sum of the eigenvalues of the diagonal matrix is equal to the total variance contained in features X₁، ایکس₂، ایکس₃, and X₄. لہذا، ہم مندرجہ ذیل مقدار کی وضاحت کر سکتے ہیں:

نوٹ کریں کہ جب p = 4, the cumulative variance becomes equal to 1 as expected.

eigen = np.linalg.eigvals(cov_mat) cum_var = eigen/np.sum(eigen) print(cum_var) output = [0.85395557 0.11318237 0.00511273 0.02774933] print(np.sum(cum_var)) output = 1.0

We observe from the cumulative variance (cum_var) that 85% of the variance is contained in the first eigenvalue and 11% in the second. This means when PCA is implemented, only the first two principal components could be used, as 97% of the total variance is contributed by these 2 components. This can essentially reduce the dimensionally of the feature space from 4 to 2 when PCA is implemented.

3. Linear Regression Matrix

Suppose we have a dataset that has 4 predictor features and n observations, as shown below.

ٹیبل 3. Features matrix with 4 variables and n observations. Column 5 is the target variable (y).

We would like to build a multi-regression model for predicting the y values (column 5). Our model can thus be expressed in the form

In matrix form, this equation can be written as

کہاں X is the ( n x 4) features matrix, w is the (4 x 1) matrix representing the regression coefficients to be determined, and y is the (n x 1) matrix containing the n observations of the target variable y.

یاد رکھیں کہ X is a rectangular matrix, so we can’t solve the equation above by taking the inverse of X.

تبدیل کرنا X into a square matrix, we multiple the left-hand side and right-hand side of our equation by the ٹرانسپوز کریں of X، یہ ہے کہ

This equation can also be expressed as

کہاں

is the (4×4) regression matrix. Clearly, we observe that R is a real and symmetric matrix. Note that in linear algebra, the transpose of the product of two matrices obeys the following relationship

Now that we’ve reduced our regression problem and expressed it in terms of the (4×4) real, symmetric, and invertible regression matrix R, it is straightforward to show that the exact solution of the regression equation is then

Examples of regression analysis for predicting continuous and discrete variables are given in the following:

Linear Regression Basics for Absolute Beginners

Building a Perceptron Classifier Using the Least Squares Method

4. Linear Discriminant Analysis Matrix

Another example of a real and symmetric matrix in data science is the Linear Discriminant Analysis (LDA) matrix. This matrix can be expressed in the form:

کہاں S_W is the within-feature scatter matrix, and S_Bis the between-feature scatter matrix. Since both matrices S_W اور S_B are real and symmetric, it follows that L is also real and symmetric. The diagonalization of L produces a feature subspace that optimizes class separability and reduces dimensionality. Hence LDA is a supervised algorithm, while PCA is not.

For more details about the implementation of LDA, please see the following references:

Machine Learning: Dimensionality Reduction via Linear Discriminant Analysis

GitHub repository for LDA implementation using Iris dataset

Python Machine Learning by Sebastian Raschka, 3rd Edition (Chapter 5)

خلاصہ

In summary, we’ve discussed several applications of linear algebra in data science and machine learning. Using the tech stocks dataset, we illustrated important concepts such as the size of a matrix, column matrices, square matrices, covariance matrix, transpose of a matrix, eigenvalues, dot products, etc. Linear algebra is an essential tool in data science and machine learning. Thus, beginners interested in data science must familiarize themselves with essential concepts in linear algebra.

ٹائم اسٹیمپ: 10 فرمائے، 2021