While a bar chart may be able to give you some high-level data insights, statistics help you talk to data. Simply put, statistics is the heart of data science. Using this tool, you can carry out technical analysis and find out information in a targeted manner. The math included in technical analysis often ends up giving us dedicate outcomes rather than guess values.
Since statistics empowers us to dig deeper in data and find out grained insights, we have listed the statistical concepts that every beginner in the data science field should learn.
5 Concepts to Know
1. Probability Distributions
Generally, the probability is defined as the percentage of chance of the occurrence of an event. In data science, this is mostly categorized or ranked between 0 to 1. Here, 1 indicates that the event will happen and 0 means it won’t. The certainty of events happening or not happening is maximum at 1 and 0 respectively.
There are usually three types of distributions:
- Uniform Distribution has one value, and it is the most basic form. In this distribution, anything outside the range is just 0. You can consider it as an on or off
- The Gaussian Distribution has standard and mean deviation. While the standard deviation is known to regulate the spread, mean value spatially shifts deviation.
- The Poisson Distribution has just one additional factor of skewness which is not there is Gaussian Distribution. If skewness is low, the spread is similar to Gaussian. But, if the skewness is high, spread varies in different directions.
Although there are various other distributions, these three are main. These distributions impart a lot of value to the whole analysis.
2. Statistical Features
You are analyzing a dataset; statistical features are the first thing you will analyze. These include median, mean, variance, percentiles, and other such features. It is considerably easier to understand the statistical features with a box plot and here’s what a box plot tells you:
- With a short box plot, most of the data points present are similar because multiple values lie in a small range.
- With a tall box plot, most of the data points present are different as value lie in a wide range – rather spread out.
- With a median closer to the bottom, data points present have lower values.
- With a median close to the top, data points present have higher values.
In the last two points, it is only explained that if the median doesn’t fall in the center of the box plot, the data is skewed.
3. Dimensional Reduction
Dimensional reduction only indicates that we have to decrease dimensions in a dataset. This is achieved with some feature variables.
Take a 3-dimensional cube, for example. It has some colored cubes inside it. If we have to process this in 3D, that is difficult. But, in 2D, we can easily divide different colors and still achieve the outcome without hassle.
Another method is feature pruning, in which all the unimportant features are eliminated. If you have 20 features, but 3 have a low correlation to the final result, then you can remove these three features.
In the classification problem where the dataset is inclined to one side, we use sampling – over or under. For instance, if you have 3000 examples of first class and only 300 samples of the second class. In this case, we may use over or undersampling. Here’s how:
- Oversampling indicates that minority class copies are made to match the majority class. In the above case, the minority class is the second dataset. However, this ensures that the copies maintain the distribution.
- Undersampling is just the opposite. In this, the majority class is taken, and some examples are picked. This number matches the minority class, and the distribution is also maintained.
Both the sampling scenarios solve the issue without getting new data.
Things to Consider
Apart from knowing these basic concepts related to data science, here are a few things you need to consider for this profession:
- The math learned in high-school including derivations, eigenvalues, vectors, linear transformations are all essential for your data science career.
- Having knowledge of programming is necessary for data science. Maybe not so much today but it is going to be in the near future.
- Analytical and critical skills are what makes you stand out as a data scientist. Problem-solving requires a unique, different approach every time, for which you need analytical and critical skills.
Many people dream of having a career in data science as the field is entering every industry today. Who wouldn’t want to know if the outcome to a certain business decision will be feasible or not? But, to become a master of data science and correctly find out the answers to problems, you need to know the basic concepts. These concepts are not going anywhere and knowing the same will only help you achieve better quickly.