After the Machine Learning model deployment we need somehow to validate the incoming datasets before we move on and input them in the ML pipeline. We can’t just rely on our sources and take granted that the data will be ok. There might be new columns, new values or even wrong types of data and most of the time the model will ignore them. That means that we may end up using an outdated or biased model.
In this post, I will show you a simple and fast way to validate your data using Tensorflow Data Validation. TFDV is a powerful library that can compute descriptive statistics, infer a scheme and detect data anomalies at scale. It is used to analyse and validate petabytes of data at Google every day across thousands of different applications that are in production.
But first, let’s create some dummy data.
import pandas as pd import numpy as np import tensorflow_data_validation as tfdv df=pd.DataFrame({'Name':np.random.choice(['Billy','George'],100),'Number': np.random.randn(100),'Feature': np.random.choice(['A','B'],100)}) df.head()

The Schema
Firstly TFDV will create a “schema” of our original data so we can use it later to validate the new data.
df_stats = tfdv.generate_statistics_from_dataframe(df) schema = tfdv.infer_schema(df_stats) schema
feature {
name: "Name"
type: BYTES
domain: "Name"
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
feature {
name: "Number"
type: FLOAT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
feature {
name: "Feature"
type: BYTES
domain: "Feature"
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
string_domain {
name: "Name"
value: "Billy"
value: "George"
}
string_domain {
name: "Feature"
value: "A"
value: "B"
}
As you can see the schema is a JSON type output that has characteristics of the data. We can display it in a nice format as follows:
tfdv.display_schema(schema)


The schema can be saved and loaded using the following code.
from tensorflow_data_validation.utils.schema_util import write_schema_text, load_schema_text #save write_schema_text(schema, "my_schema") #load schema = load_schema_text("my_schema")
Let’s suppose that we created a machine learning model with the data above. Now we will create the hypothetical new data that we want to validate.
test=pd.DataFrame({'Name': {0: 'Guilia', 1: 'Billy', 2: 'George', 3: 'Billy', 4: 'Billy'}, 'Number': {0: 1, 1: 2, 2: 3, 3: 5, 4: 1}, 'Feature': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'A'}}) test['Feature2']='YES' test.head()

Data Validation
This is the time to validate the new data.
new_stats = tfdv.generate_statistics_from_dataframe(test) anomalies = tfdv.validate_statistics(statistics=new_stats, schema=schema) tfdv.display_anomalies(anomalies)

We got all the “anomalies” the new data have. The new data have a new column, the column Number has wrong data types and the have a new value in the Name column. This may indicate data drift. Now we may have to retrain our model or apply different data preprocessing. Also, TFDV has the option to update the schema or to ignore some of the anomalies.
Summing it up
Data validation using TFDV is a cost-effective way to validate the new coming set. It will parse the new data and report any anomalies they have such as missing values, new columns, and new values. Also can help us determine if there is data drift and prevent us from using an outdated model.