Before we start, you should keep in mind that we can import a curated list of 3rd party packages from Anaconda. We can get a list of all available packages and their corresponding versions by running:
select * from information_schema.packages where language = 'python';
Also, keep in mind that for the time being, Snowflake supports Python 3.8 only and this should be specified within the UDF.
UDFs with In-Line Code
We can create a UDF function with an “in-line code” which means that we write directly the code or with a “code uploaded from a stage” where within the function we specify the location of an existing Python source code in a stage. In this tutorial, we will work with the “in-line code” UDFs.
We will build a function that applies a text cleansing by converting the text to lower case and by removing the HTM tags and markups, the digits, the punctuation, the extra spaces, and the leading/trailing whitespace.
create or replace function text_cleansing(s string) returns string language python runtime_version = '3.8' handler = 'cleansing' as $$ def cleansing(x): import re import string # lower case x = x.lower() # Remove HTML tags/markups x = re.sub('<.*?>', '', x) # remove digits x = re.sub('\d+', '', x) # remove punctuation x = x.translate(str.maketrans('', '', string.punctuation)) # remove etra space and tabs x = re.sub('\s+', ' ', x) # remove leading/trailing whitespace x = x.strip() return x $$;
Once we run the snippet code above, the UDF is ready to use. Make sure that you received a message that the function has been created successfully.
Notice that the handler function is within the double dollar signs ($$) and we start the function by specifying the name, the input and the input type, the type of the output, the programming language, the runtime and finally the handler that should be the same name with the function within the double dollar signs, in our case
cleansing. Note that if we wanted to load packages from Anaconda, we should have added a line mentioning the required packages like:
create or replace function py_udf() returns variant language python runtime_version = 3.8 packages = ('numpy','pandas') handler = 'udf' as $$ import numpy as np import pandas as pd import xgboost as xgb def udf(): return [np.__version__, pd.__version__] $$;
Call the UDF
Now the UDF is ready to use as a common built-in function, meaning that you can use it in any SQL statement. Let’s try to use the function on a sample text:
“This is an example (of) how to remove numbers 123 and tags <li> and punctuations !:#*”
select TEXT_CLEANSING('This is an example (of) how to remove numbers 123 and tags <li> and punctuations !:#*') as cleaned_text
And we get:
this is an example of how to remove numbers and tags and punctuations
How to Get a List of the UDF
We can get a detailed list of UDFs by running the command:
show user functions;
How to Delete a UDF
You can delete a UDF by running the command:
drop function text_cleansing(string);
Note that you need to pass the function name and the data type.