You will learn about the prerequisites for data science in this section.
Contrary to the popular belief, data science is not a new term or concept; it has been popular since many years now, except previously it was known as statistics and data scientists were known as statisticians. Data science is surely one of the best, if not the best tool that can help you process data and to find out various inferences and predictions through it. Under the fancy name of the concept, lies hardcore statistics. This base provides businesses the necessary power and capacity to turn around data for improved productivity. It is impossible to master data science without making oneself accustomed to statistics.
There are multiple topics under statistics that a data scientist needs to be aware of. Statistics can be broadly divided into two groups, which are:
- Descriptive Statistics
- Inferential Statistics
Descriptive statistics means presenting and expressing the data. This concept allows you to read the data and focuses on the quantitative summarization of the data using numerical representation and/or graphs. To understand Descriptive Statistics, you need to understand the following topics:
- Normal Distribution
- Central Tendency
It is also known as Gaussian distribution. In this large amount of data is showcased using a graphical plot. To show the data points the probability function is used. Generally in a Gaussian distribution a symmetrical bell-shaped curve is observed. In this the group that topped is shown in the center. This represents the average. When the data points move farther from the center, they situate themselves evenly on both sides. The data must be distributed normally if you plan to do inferential statistics.
In central tendency a center point displays the amount of data. There are three different sections of this tendency, and they are median, mean, and mode. These are basically the same as normal distribution where the mean is at the center of the point in the Data. This mean is also known as the arithmetic average. It is also the total points divided by the number of values.
The second measure of central tendency is the median. It is the middle value of data when the data is arranged from the ascending order of the value. In the case of an odd set of values it is quite easy to find the middle value, however in the case of even data points, it is quite difficult to do so. For this, the mean of the two middle points is calculated the result is the median.
Mode is the third part of the central tendency. The value that appears the most number of times in a data set is the mode.
It is the concept that shows how farther are the data points from the mean value. It also shows the amount of difference between the chosen data points. The indicators of variability would be a range, standard deviation, and variation.
The difference between the largest and the smallest value is known as the range of the dataset.
Kurtosis and Skewness
In a dataset the lack of symmetry is represented using skewness. If the data is uniformly distributed, it will be presented in the shape of normal distribution. The curves are evenly shaped to both sides of the average. For a uniformly distributed data set, the skewness is zero or null. When the data is stronger on the right side, it shows negative skew. When the data is more powerful on the left side, it shows a positive skew.
The measure of all the tails of the probability distribution is known as kurtosis. Kurtosis displays whether the data is light-tailed or heavy-tailed. With large Kurtosis, data sets are heavy-tailed, but with less Kurtosis, data sets have light tails.
Descriptive Statistics is about describing the data, but Inferential Statistics is about getting insights from the data. Generally, inferential statistics is a method of concluding inferences about the whole larger population, a smaller data set, or a sample.
For instance, let us say that you have to count the number of Indians who have been vaccinated from polio. This could be done in 2 possible ways:
- You can either go around asking every Indian if they are vaccinated, or
- You can sample a small number of people in the city and then extrapolate it to a large audience across India.
The first method of going around and asking seems impossible, as it is very difficult to go around and ask in the country. The second way is by implementing a certain method that involves statistics that would help draw conclusions or draw insights from a sample and use it to get the insights and patterns of a larger population. Here are a few inferential statistical tools:
Central Limit Theorem
According to the Central limit theorem, the average of the entire population is the same as the average of the sample. This proves that the sample’s properties, for instance standard deviation, will also be the same for the entire population. At the end, when the sample size increased, the errors will be less which will lead to the formation of a bell-shaped curve.
One of the most important concepts in the Central Limit theorem is the concept of Confidence Interval. This is a show of an approximate value of the average of the population. To create the intervals, various factors need to be considered such as the addition of the margin of errors. This error can be calculated when the standard error of the mean is multiplied with the z-score of the percentage of the confidence level.
Hypothesis testing is the limit up to which an assumption can be tested. In this the results of the hypothesis are collected and displayed to a small group from the selected population.
The hypothesis that is to be tested is known as ‘null hypothesis’. The hypothesis that is used to test the hypothesis is known as the ‘alternative hypothesis’.
For instance, there are two study groups where one group consists of people who smoke and the other one consists of people who do not smoke. Here the study begins by assuming that the average number of patients who have cancer in both the smoking and the non-smoking group will be the same. This is our “null hypothesis,” which we have to check and decide whether to reject it.
Our” alternate hypothesis” is – The number of patients who have cancer in the group that smokes is way higher than the group that avoids smoking. Therefore, the average number of patients who have cancer in the group that smokes is very high compared to the non-smoking group.
Based on Data that has been provided and the actual evidence, we can test both the hypotheses and conclude that either accepts the null hypothesis or rejects it.
It is a hypothesis-testing methodology used to test multiple groups. It is used to check whether both the study groups have the same or similar variances and averages. This method can perform this check with a very less chance of error. ANOVA can be calculated using the F-ratio.
The ratio of Mean square present between the groups to the Mean Square present between internally in the group is known as the F-ratio. There are many different methods of calculating ANOVA.
In terms of creating and understanding both the hypotheses, i.e., alternative and null, as far as null hypothesis is concerned it is necessary to assume that the average of all the groups will be the same. In contrast, in the case of an alternative hypothesis all the averages will not be the same.