Data Science Process End to End

 As you are learning Data Science, it is important to understand the complete workflow. This post will give you an overall idea about what goes on in any Data Science project end-to-end. This is a bird's eye view of any Data Science Project methodology. All the data science or data analytics projects in the world have to follow this methodology, no matter how big or small the project is. 

There are 4 major steps. In a Data Science or Machine Learning Project all 4 of these steps are followed. Only in case of a Data Analytics project, the first 3 steps are used. 

Steps:

1. Data Extraction

2. Data Pre-processing

3. Exploratory Data Analysis (EDA)

4. Apply Machine Learning Algorithms for regression or classification or clustering


Data Science Pipeline


Now we will discuss each one of these in a little detail.


Step 1: Data Extraction

Importance:

Data Extraction is the first and a very important step. In most data science courses we get started with a toy dataset handed to us (such as the iris dataset) and we never get to know how complex the process of extracting that data file is. In real world while designing the end to end process, this step determines the success or failure of the project. As a data scientist we must know how data is collected and from where. Ideally in the learning of data science this step should not be skipped at all, as this decided how our further analysis is going to be. If the data collection is flawed we end up with a useless dataset and we can never get to a good prediction if the data we are using is unreal, flawed or biased. 

What are the ways it is accomplished?

Data is everywhere and there are no fixed set of sources from where one can tap data. However we will list a few very common static types of extractions to get started. They are, 

(i) Web scraping

(ii) Tabular Data

(iii) ETL Tools

(iv) Media files




(i) Web scraping:

There are data available on websites in the form of reviews posted, opinions on social media, blogs, product descriptions etc. Such data can be extracted by web scraping and can be stored for analysis. Web scraping using python in current days is in huge demand as millions of contents are getting online every second and there is a need to scrape that information.

(ii) Tabular Data:

This is a format we are very used to. Huge amount of records are available in the form of SQL databases. Insurance companies, banks, big retailers, governments and a lot of large or small organizations have been keeping their transactional records for decades in such databases. 

Excel (.xls and .xlsx files) and CSV (Comma Separated values) files are also very common.

(iii) ETL Tools:

ETL stands for Extract, Transform & Load. There are enterprise level tools which are used to extract data from various sources & load them into a warehouse. Some common ETL tools in use today as of writing this post are Informatica, AWS Glue, Stitch, Xplenty etc. 

(iv) Media:

Till here we discussed about the text data (could be structured or unstructured). Data is available in the form of media as well. Social platforms like YouTube, tiktok & Instagram are some of the largest sources of such data. These video and audio data are mined to extract information. We will dig deeper into it's processes later.


We learnt some of the ways data is extracted. Please note that these are only a hand full of ways and there are several other sources and methods using which data is extracted. However these ones are some of the more prominent ones and it makes for a perfect base to start learning Data Science. 

Once data extraction is done, we have a dataset (in whichever format it be) to work with. The next step in a data science pipeline is the Data Pre-processing part.

Visit this page to learn about Data Pre-processing.

0 Comments