Being Data Scientist for one and half years have made me realise that Calculus, Linear Algebra, Probability and statistics, Discrete Math, Graph Theory, Information theory, Functional Analysis, Combinatorics, Geometric Analysis, Topological Data Analysis.
Despite Data Science cycle being quite vast involving many sequential steps, then knowledge/skills needed varies a lot. Let’s see what a typical Data Scientist do day to day -> Business Understanding, Data Collection, Data Preparation, Exploratory Data Analysis, Modelling, Model Evaluation, Model Deployment.
Let’s see which Mathematical concepts are used at each of these steps of such a vast Data Science Life Cycle.
Well being in the industry for quite a few years this aspect of Data Science Life Cycle doesn’t at all involve Maths rather it’s usual understanding of Business Operations in society and which service/product it’s offering to customers. How that offering can be either improved or some insight which can be drawn from sales already made.
Typically Business Understanding involves: – Understanding Business Process, Define and Frame Business Understanding, Define Business Objective, Agree on Success Criteria according to Srivatsan Srinivasan, Chief Data Scientist at Cognizant. It’s quite clear that all of aspects of Business Understanding doesn’t need any Math rather usual understanding of Business is required.
Moreover what I’ve experienced over my short career is that Business Analyst is best possible job role for doing Business Understanding although that’s not totally correct but it’s quite close.
Thus there is almost no use of Maths in Business Understanding aspect of Data Science Lifecycle.
This is one of the most crucial part of Data Science Lifecycle, after identifying what’s the problem to be solved or product to be made. Nowadays most of the companies use data generated by real users itself rather then earlier version of data collection using methods like Surveys etc.
Now as we are collecting data which for sure may be some numbers(can be images as well but lets keep it simple)then it need to be defined whether something like weight would be in terms of kg/g. That’s just a simple example but in the Real Industry Level projects that’s more complex. But for doing this only simple High School or at most first year university level of mathematics is required.
Moving on from Data Collection to preparing it for using in the project there is a lot what need to be done. For example – Data collected may be in some text file format and need to be converted into tables.
That’s called Data Preparation, as this process just involves doing some sort of formatting of data so there is not much Maths required expect simple Algebraic concepts.
Exploratory Data Analysis
Well here comes the most Math heavy part of Data Science, in this step a lot of Mathematical stuff is used. This step is very Statistics heavy and specifically Inferential Statistics. Which is used for seeing the patterns in data.
This involves using Probability/Statistics/Graph Theory to draw meaningful insights about Data already collected. Let’s see how different parts of Math can be used for Exploratory Data Analysis one at a time: –
Making Distributions Using Graph Theory – By using the very basic stuff from Graph Theory different kind of visuals/graphs can be made helping in Exploring Data. For example – An idea about Outliers, Effect Size, Variance can easily be taken from graphs drawn. This truly helps in further steps of Data Science Lifecycle.
That’s what the most important part of Data Science Lifecycle in which a model need to made by using data. The purpose of model is to make certain guesses based upon data it have been trained. For example – For a simple scenario like predicting sales made by a company in Nov 2020 by using data from Nov 2019. So our model will take data from Nov 2019 on top of that would consider some parameters, then make predictions for sales in Nov 2020.
Let’s see how much/which parts of Mathematics is needed for doing Modelling by taking an example from well know book Machine Learning Models and Algorithms fro Big Data Classification ->
A system produces responses y using ?=2?+3y=2x+3 over the domain D = [0, 2] without errors. Then it is easy to find the responses for any data points in the data domain. For example, if we select two points x = 1. 1 and x = 1. 2 in D, then we have their responses y = 5. 2 and y = 5. 4, respectively. Similarly, we can calculate responses for all the points in the domain, and this will result in the response set C = [3, 7]. The data domain and the response set of this system are as show in picture below.
It shows the need for mapping between these two sets to establish models and algorithms for classification.
We can represent this relationship with a mathematical function y = f2, 3(x) showing its slope and y-intercept as index. Therefore, if the relationship is defined as ?=?−2,3.4(?)y=f−2,3.4(x) with the same domain D, then the corresponding equation is ?=−2?+3.4y=−2x+3.4 with the response set ?=[−0.6,3.4]C=[−0.6,3.4]. Following this mathematical process, we can define a set of straight lines by using slope and y-intercept parameters a and b with the equation as y = fa, b(x). Hence, its corresponding parametrized straight line equation is ?=??+?y=ax+b. This linear equation becomes the parametrized model for this particular example.
In the first example, the domain D = [0, 2] and the parametrized mapping y = fa, b(x) with a = 2 and b = 3 were given, and thus we were able to determine the system’s response set C = [3, 7]. However, if the domain D = [0, 2] and the response set C = [3, 7] are given, then deriving a model is not a straightforward task. We need two tasks to solve this problem: the first task is to derive the parametrized model y = fa, b(x), and the second task is to develop an algorithm for searching for optimal values for a( = 2) and b( = 3) from a large pool of parameter values. For simplicity, suppose that we have derived a model ?=??+?y=ax+b and developed a learning algorithm that provides a = 1. 99 and b = 3. 01 then, they are a reasonably well-defined model and an algorithm, because they give the tuples (1. 1, 5. 199) and (1. 2, 5. 398), which are close to (1. 1, 5. 2) and (1. 2, 5. 4).
The process of deriving a parametrized model and developing a learning algorithm to find an optimal value for the parameters is considered a machine-learning task, and a simple interpretation of this definition as shown in image below: –
It first shows an input model y = fa, b(x) to the modeling unit, which establishes the parametrized models ??1,?1(?),??2,?2(?),…,???,??(?)fa1,b1(x),fa2,b2(x),…,fan,bn(x) for classification. Then the next learning-algorithm unit takes them as inputs and provides measures that can help optimization. In the subsequent step, these measures (in this case, a simple distance measure is shown) are compared, and the parameters which give the minimum are selected as optimal values. It also shows the propagation and utilization of the actual labels in this comparison as well as the trained model as the final output ???,??(?)fai,bi(x), which gives the minimum distance, assuming a i and b i are optimal.
As is clear from above example calculus, functions and graphs have been used. Moreover in this part of Data Science Lifecycle stuff like fourier series can also be used.
For more information regarding Modelling Download this research paper.
As in the last step a Machine Learning Model have been created, but we need to test it as well. It’s same as like marking criterion for an exam. Firstly if you’re not aware of evaluation metrics for models the see 20 Popular Machine Learning Metrics. Part 1: Classification & Regression Evaluation Metrics.
In order to evaluate a model different metrics are used, some of these are as following: –
|Evaluation Metric||Maths Required|
|Classification Metrics||High School Level Maths|
|Regression Metrics||High School Level Maths|
|Statistical Metrics||College Level Statistics|
|F1 Score||High School Level Maths|
For evaluating simple Models college/high school level mathematics is more than enough. But it model’ve many parameters then some complex evaluation metrics need to be used.
That’s the end step in Data Science Lifecycle, it depends upon organisation to organisation which platform to use for deployment of model.
For example – Some companies use Google Cloud while some uses AWS. Now deploying the model doesn’t require any Maths instead it require programming skills and understanding of platform on which model is to be deployed.