Predictive Hacks

Data Validation for Machine Learning using TFDV

data validation

After the Machine Learning model deployment we need somehow to validate the incoming datasets before we move on and input them in the ML pipeline. We can’t just rely on our sources and take granted that the data will be ok. There might be new columns, new values or even wrong types of data and most of the time the model will ignore them. That means that we may end up using an outdated or biased model.

In this post, I will show you a simple and fast way to validate your data using Tensorflow Data Validation. TFDV is a powerful library that can compute descriptive statistics, infer a scheme and detect data anomalies at scale. It is used to analyse and validate petabytes of data at Google every day across thousands of different applications that are in production. 

But first, let’s create some dummy data.

import pandas as pd
import numpy as np
import tensorflow_data_validation as tfdv

df=pd.DataFrame({'Name':np.random.choice(['Billy','George'],100),'Number': np.random.randn(100),'Feature': np.random.choice(['A','B'],100)})

df.head()

The Schema

Firstly TFDV will create a “schema” of our original data so we can use it later to validate the new data.

df_stats = tfdv.generate_statistics_from_dataframe(df)

schema = tfdv.infer_schema(df_stats)

schema
feature {
  name: "Name"
  type: BYTES
  domain: "Name"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "Number"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "Feature"
  type: BYTES
  domain: "Feature"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
string_domain {
  name: "Name"
  value: "Billy"
  value: "George"
}
string_domain {
  name: "Feature"
  value: "A"
  value: "B"
}

As you can see the schema is a JSON type output that has characteristics of the data. We can display it in a nice format as follows:

tfdv.display_schema(schema)

The schema can be saved and loaded using the following code.

from tensorflow_data_validation.utils.schema_util import write_schema_text, load_schema_text

#save
write_schema_text(schema, "my_schema")

#load
schema = load_schema_text("my_schema")

Let’s suppose that we created a machine learning model with the data above. Now we will create the hypothetical new data that we want to validate.

test=pd.DataFrame({'Name': {0: 'Guilia', 1: 'Billy', 2: 'George', 3: 'Billy', 4: 'Billy'},
 'Number': {0: 1,
  1: 2,
  2: 3,
  3: 5,
  4: 1},
 'Feature': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'A'}})

test['Feature2']='YES'

test.head()

Data Validation

This is the time to validate the new data.

new_stats = tfdv.generate_statistics_from_dataframe(test)
    
anomalies = tfdv.validate_statistics(statistics=new_stats, schema=schema)

tfdv.display_anomalies(anomalies)

We got all the “anomalies” the new data have. The new data have a new column, the column Number has wrong data types and the have a new value in the Name column. This may indicate data drift. Now we may have to retrain our model or apply different data preprocessing. Also, TFDV has the option to update the schema or to ignore some of the anomalies.

Summing it up

Data validation using TFDV is a cost-effective way to validate the new coming set. It will parse the new data and report any anomalies they have such as missing values, new columns, and new values. Also can help us determine if there is data drift and prevent us from using an outdated model.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.