Data Quality

May 8, 2021 00:00 · 183 words · 1 minute read

What is data quality?

Is a process in which you are ensuring your raw data is accurate, consistent, timely and complete.

Missing values

In your data you can discover missing values or incorrect formats, more usual for date formats.

Unwanted characters in columns

For exemple, you don’t want to see in a column with years values as “<2009”, but “2009”.

Examine categorical data

For example, when you have Yes-No type of values might have missing values inside. what can you do with “NaN” values? One idea for categorical data is to use one hot encoding similar to dummy variables ( 1/0 ).

The usual tasks in the Data Quality part are

  • Resolve missing values
  • Convert the feature columns to date time format
  • Rename feature columns
  • Remove or correct innapropriate values from feature columns
  • Create one-hot encodings ( also named dummy variables)

Importance of data quality

Data quality is super important as this is the ground floor for Machine Learning. As Google states in his course, Machine Learning is a way to standard algorithms to derive predictive insights from data and make repeated decisions.