OpenAI is the talk of the town due to its impressive performance in many AI tasks. Although the existing models are really strong, there is always room for improvement and in many cases, it is necessary to fine-tune the models focusing on specific tasks. In this tutorial, we will show you how to fine-tune a custom NLP classification model with OpenAI.
Create a Conda Environment
We encourage you to create a new conda environment. In this tutorial, we have built a new conda environment called OpenAI
. In addition, we have added the OpenAI key as an environmental variable as follows:
First, we activate the environment:
conda activate OpenAI
Then, we install the OpenAI library:
pip install --upgrade openai
Then, we pass the variable:
conda env config vars set OPENAI_API_KEY=<OPENAI_API_KEY>
Once you have set the environment variable, you will need to reactivate the environment by running:
conda activate OpenAI
In order to make sure that the variable exists, you can run:
conda env config vars list
and you will see the OPENAI_API_KEY
environment variable with the corresponding value.
The Dataset
For exhibition purposes, we consider a vanilla case where we will build a classification model trying to predict if an email is a “ham” or “spam”. In other tutorials, we built an Email Spam Detector using Scikit-Learn and TF-IDF and we have fine-tuned an NLP classification model with transformers and HuggingFace. Feel free to have a look at the tutorials in order to get the data and compare different approaches.
Let’s load the dataset and get the first rows:
import pandas as pd df = pd.read_csv('spam.csv') df
According to the documentation, the training data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example as follows:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} ...
Let’s see the different ways that we can generate the JSONL document.
Using Pandas
Pandas allows us to easily create this format file. For example:
# first rename the columns using the column names prompt and completion df.rename(columns={'text':'prompt', 'target':'completion'}, inplace=True) df.to_json("spam_pandas.jsonl", orient='records', lines=True)
The first rows of the “spam_pandas.jsonl“:
{"prompt":"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","completion":"ham"} {"prompt":"Ok lar... Joking wif u oni...","completion":"ham"} {"prompt":"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's","completion":"spam"} {"prompt":"U dun say so early hor... U c already then say...","completion":"ham"}
Using List Comprehensions
If we do not want to use pandas, we can work with list comprehensions. Assuming that we have two lists, one for the prompt and one for the completion, we can apply a list comprehension to build the required dictionary. Then, we can save the JSONL file by writing one line per time. For example:
# create separate lists for prompt and completion prompt = df.prompt.tolist() completion = df.completion.tolist() # create a dictionary using input_dict = [{"prompt":p, "completion":c} for p,c in zip(prompt,completion)] import json # https://stackoverflow.com/questions/38915183/python-conversion-from-json-to-jsonl with open("spam_list_comprehension.jsonl", "w") as f: for entry in input_dict: f.write(json.dumps(entry)) f.write('\n')
The “spam_list_comprehension.jsonl” file:
{"prompt": "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...", "completion": "ham"} {"prompt": "Ok lar... Joking wif u oni...", "completion": "ham"} {"prompt": "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "completion": "spam"} {"prompt": "U dun say so early hor... U c already then say...", "completion": "ham"}
Using the CLI data preparation tool
OpenAI has developed a tool that validates, gives suggestions and reformats the data:
openai tools fine_tunes.prepare_data -f <LOCAL_FILE>
The tool expects a “prompt” and a “completion” column names or keys and supports CSV, TSV, XLSX, JSON or JSONL file formats. The output will be a JSONL file ready for fine-tuning, after guiding you through the process of suggested changes. Let’s see it in practice. We open the conda CLI and we run:
openai tools fine_tunes.prepare_data -f spam_with_right_column_names.csv
Note that the column names of the “spam_with_right_column_names.csv” file are “prompt” and “completion”. The tool is so smart that returns as the following text:
- Based on your file extension, your file is formatted as a CSV file - Your file contains 5572 prompt-completion pairs - Based on your data it seems like you're trying to fine-tune a model for classification - For classification, we recommend you try one of the faster and cheaper models, such as `ada` - For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training - There are 403 duplicated prompt-completion sets. These are rows: [102, 153, 206, 222, 325, 338, 356, 443, 532, 654, 657, 701, 767, 768, 774, 780, 789, 824, 849, 899, 962, 964, 1001, 1002, 1042, 1131, 1132, 1133, 1139, 1151, 1162, 1163, 1197, 1224, 1235, 1249, 1250, 1303, 1318, 1355, 1379, 1402, 1412, 1426, 1458, 1466, 1482, 1484, 1507, 1568, 1584, 1654, 1679, 1690, 1699, 1720, 1737, 1778, 1779, 1784, 1825, 1828, 1875, 1876, 1893, 1901, 1948, 1956, 1963, 1973, 1980, 1983, 1987, 1988, 1995, 2043, 2094, 2108, 2123, 2124, 2134, 2145, 2162, 2169, 2175, 2215, 2233, 2264, 2265, 2276, 2299, 2307, 2321, 2326, 2343, 2344, 2350, 2362, 2384, 2412, 2446, 2472, 2476, 2508, 2517, 2518, 2521, 2523, 2525, 2563, 2564, 2595, 2610, 2617, 2643, 2645, 2659, 2680, 2687, 2711, 2718, 2720, 2727, 2739, 2741, 2760, 2761, 2763, 2795, 2797, 2811, 2825, 2827, 2829, 2841, 2847, 2858, 2864, 2868, 2879, 2897, 2910, 2942, 2957, 2966, 2970, 2980, 2989, 2990, 3002, 3003, 3007, 3034, 3038, 3049, 3054, 3088, 3099, 3121, 3123, 3134, 3151, 3153, 3154, 3163, 3165, 3166, 3174, 3186, 3200, 3226, 3227, 3241, 3248, 3270, 3278, 3298, 3309, 3314, 3322, 3347, 3357, 3364, 3390, 3392, 3399, 3402, 3406, 3414, 3444, 3453, 3456, 3466, 3469, 3474, 3484, 3487, 3490, 3532, 3547, 3583, 3584, 3592, 3608, 3623, 3626, 3627, 3647, 3673, 3678, 3679, 3691, 3707, 3708, 3728, 3731, 3739, 3753, 3755, 3756, 3761, 3768, 3774, 3785, 3797, 3831, 3832, 3846, 3880, 3881, 3897, 3899, 3911, 3913, 3920, 3942, 3964, 3976, 3985, 3990, 4002, 4009, 4010, 4012, 4038, 4040, 4074, 4081, 4101, 4102, 4115, 4126, 4127, 4138, 4152, 4160, 4167, 4171, 4182, 4189, 4194, 4196, 4197, 4199, 4220, 4231, 4233, 4235, 4242, 4257, 4258, 4279, 4294, 4297, 4298, 4309, 4323, 4346, 4350, 4354, 4370, 4389, 4390, 4412, 4419, 4435, 4448, 4454, 4463, 4466, 4496, 4502, 4515, 4535, 4547, 4554, 4564, 4582, 4585, 4589, 4590, 4626, 4631, 4636, 4643, 4648, 4653, 4658, 4683, 4692, 4697, 4699, 4717, 4733, 4741, 4742, 4744, 4757, 4771, 4774, 4801, 4813, 4846, 4857, 4862, 4867, 4877, 4882, 4886, 4888, 4893, 4896, 4916, 4921, 4928, 4946, 4960, 4961, 4976, 5029, 5035, 5044, 5048, 5053, 5073, 5091, 5104, 5105, 5108, 5129, 5141, 5164, 5171, 5175, 5176, 5182, 5188, 5191, 5201, 5203, 5206, 5214, 5215, 5216, 5225, 5226, 5232, 5236, 5241, 5247, 5257, 5264, 5279, 5284, 5285, 5301, 5314, 5315, 5346, 5357, 5365, 5374, 5375, 5386, 5389, 5423, 5425, 5457, 5458, 5460, 5467, 5469, 5471, 5477, 5488, 5490, 5497, 5510, 5524, 5535, 5539, 5553, 5558] - Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty - The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details Based on the analysis we will perform the following actions: - [Necessary] Your format `CSV` will be converted to `JSONL` - [Recommended] Remove 403 duplicate rows [Y/n]: Y - [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y c:\users\gpipis\anaconda3\envs\openai\lib\site-packages\openai\validators.py:222: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy x["prompt"] += suffix - [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y c:\users\gpipis\anaconda3\envs\openai\lib\site-packages\openai\validators.py:421: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy x["completion"] = x["completion"].apply( - [Recommended] Would you like to split into training and validation set? [Y/n]: Y Your data will be written to a new JSONL file. Proceed [Y/n]: Y Wrote modified files to `spam_with_right_column_names_prepared_train.jsonl` and `spam_with_right_column_names_prepared_valid.jsonl` Feel free to take a look!
In all questions, we entered “Y” to proceed and finally, it generated a train and a test dataset and the required command for fine-tuning the model.
Now use that file when fine-tuning: > openai api fine_tunes.create -t "spam_with_right_column_names_prepared_train.jsonl" -v "spam_with_right_column_names_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " ham" After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["am"]` so that the generated texts ends at the expected place. Once your model starts training, it'll approximately take 2.11 hours to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
Important note
As you may notice, the CLI data preparation has modified our data by adding a suffix to the prompt (->
) and a prefix to the completion (white space). Generally, the best practices for the input dataset are:
To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.
- Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is
\n\n###\n\n
. The separator should not appear elsewhere in any prompt. - Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
- Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be
\n
,###
, or any other token that does not appear in any completion. - For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.
Having said that, if we didn’t want to work with the CLI preparation tool, then we should have modified our data by adding a suffix to the prompt and a prefix to the completion.
Fine-Tuning the Model
We have created a train and a validation dataset that can be used to train and evaluate our model. Since we are dealing with a classification task, it makes sense to work with the ada
model. Note that the other available models are the davinci
, curie
, and babbage
Using the openai CLI, in order to fine-tune the “ada” model we can run the following command:
openai api fine_tunes.create -t "spam_with_right_column_names_prepared_train.jsonl" -v "spam_with_right_column_names_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " spam" -m ada
Where:
t
is the path for the train datasetv
is the path for the validation dataset- We set the
compute_classification_metrics
to true, in order to get a classification report - The
classification_positive_class
sets the “positive” class, where in our case is the “spam” since we are building a spam detector m
is for the model, where in our case isada
There are other available options such as the n_epochs
that is set to 4 by default, the batch_size
, the learning_rate_multipler
and so on.
When we run this command, we get:
Upload progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 485k/485k [00:00<00:00, 498Mit/s] Uploaded file from spam_with_right_column_names_prepared_train.jsonl: file-1T5R0Rr4T562mhwXIU7prTvf Upload progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 117k/117k [00:00<00:00, 58.9Mit/s] Uploaded file from spam_with_right_column_names_prepared_valid.jsonl: file-ab2LnsDV6W4tTBUtq5EXEY41 Created fine-tune: ft-sEQZBANQSg4NTNEodwHH2hNz Streaming events until fine-tuning is complete... (Ctrl-C will interrupt the stream, but not cancel the fine-tune) [2023-04-12 16:00:51] Created fine-tune: ft-sEQZBANQSg4NTNEodwHH2hNz Stream interrupted (client disconnected). To resume the stream, run: openai api fine_tunes.follow -i ft-sEQZBANQSg4NTNEodwHH2hNz
Once we run the command above, we have to wait for around 1 hour for the model to be trained. In order to check the status we can run:
openai api fine_tunes.follow -i ft-sEQZBANQSg4NTNEodwHH2hNz
And we get:
[2023-04-12 16:00:51] Created fine-tune: ft-sEQZBANQSg4NTNEodwHH2hNz [2023-04-12 16:02:30] Fine-tune costs $0.16 [2023-04-12 16:02:30] Fine-tune enqueued. Queue number: 12 [2023-04-12 16:06:10] Fine-tune is in the queue. Queue number: 11 [2023-04-12 16:06:44] Fine-tune is in the queue. Queue number: 10 [2023-04-12 16:06:51] Fine-tune is in the queue. Queue number: 9 [2023-04-12 16:06:52] Fine-tune is in the queue. Queue number: 8 [2023-04-12 16:10:12] Fine-tune is in the queue. Queue number: 7 [2023-04-12 16:11:00] Fine-tune is in the queue. Queue number: 6 [2023-04-12 16:11:16] Fine-tune is in the queue. Queue number: 5 [2023-04-12 16:17:02] Fine-tune is in the queue. Queue number: 4 [2023-04-12 16:21:27] Fine-tune is in the queue. Queue number: 3 [2023-04-12 16:21:55] Fine-tune is in the queue. Queue number: 2 [2023-04-12 16:22:01] Fine-tune is in the queue. Queue number: 1 [2023-04-12 16:31:35] Fine-tune is in the queue. Queue number: 0 [2023-04-12 16:31:37] Fine-tune started [2023-04-12 16:35:35] Completed epoch 1/4 [2023-04-12 16:38:59] Completed epoch 2/4 [2023-04-12 16:42:39] Completed epoch 3/4 [2023-04-12 16:46:19] Completed epoch 4/4 [2023-04-12 16:46:58] Uploaded model: ada:ft-persadonlp-2023-04-12-13-46-58 [2023-04-12 16:47:00] Uploaded result file: file-DZcArmsGPse796GhTWZgf59Y [2023-04-12 16:47:00] Fine-tune succeeded Job complete! Status: succeeded 🎉 Try out your fine-tuned model: openai api completions.create -m ada:ft-persadonlp-2023-04-12-13-46-58 -p <YOUR_PROMPT>
As we can see, it took around 57 minutes for the model to be trained, and it cost us $0.16. The name of the model is ada:ft-persadonlp-2023-04-12-13-46-58
. Finally, we can make predictions by running the following command on the CLI
openai api completions.create -m ada:ft-persadonlp-2023-04-12-13-46-58 -p <YOUR_PROMPT>
Evaluate the Model
We can evaluate the model by looking at the classification report. We can download the classification report and it as a csv file called “result.csv” by running:
openai api fine_tunes.results -i ft-sEQZBANQSg4NTNEodwHH2hNz > result.csv
Then, we can get the classification report by running the following Python command in our Jupyter notebook
results = pd.read_csv('result.csv') results[results['classification/accuracy'].notnull()].tail(1)
Make Predictions
Let’s see how we can make predictions.
import pandas as pd import openai import os openai.api_key = os.getenv('OPENAI_API_KEY') # load the validation dataset test = pd.read_json('spam_with_right_column_names_prepared_valid.jsonl', lines=True) test.head()
Let’s make a prediction for the first prompt:
ft_model = 'ada:ft-persadonlp-2023-04-12-13-46-58' res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + ' ->', max_tokens=1, temperature=0) res['choices'][0]['text']
We can also create a function that can be used as a lambda function for the pandas data frame.
ft_model = 'ada:ft-persadonlp-2023-04-12-13-46-58' def ham_spam(text): # add the suffix ` ->` to the prompt input_prompt = text + ' ->' response = openai.Completion.create(model=ft_model, prompt=input_prompt, max_tokens=1, temperature=0) output = response['choices'][0]['text'] return output # get predictions for the test dataset test['predictions'] = test['prompt'].apply(lambda x:ham_spam(x)) test
The accuracy on the test dataset is equal to 99.4%:
import numpy as np np.mean(test.completion==test.predictions)
We get:
0.994
If we want to return the log probabilities, we can run:
ft_model = 'ada:ft-persadonlp-2023-04-12-13-46-58' res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + ' ->', max_tokens=1, temperature=0, logprobs=2) res['choices'][0]['logprobs']['top_logprobs'][0]
How to get a list of all fine-tunes tasks
We can get a list of all fine-tunes tasks by running on the openai CLI the following command:
openai api fine_tunes.list
How to delete a fine-tuned model
To delete a fine-tuned model, you must be designated an “owner” within your organization. If you have the necessary rights, you can delete the model as follows:
openai api models.delete -i <FINE_TUNED_MODEL>
Closing Remarks
OpenAI is not only a powerful tool with advanced large language models, but it also allows us to fine-tune the existing models according to our needs. In this tutorial, we represented a simple classification task. Similarly, we can build custom models for sentiment analysis and other classification tasks with more than two classes. Finally, we can apply the same logic in order to build other models, like NLG, Paraphrase, Question and Answers and so on. The requirement is always to have a prompt and a completion.
2 thoughts on “How to Fine-Tune an NLP Classification Model with OpenAI”
Excelent write up! Thorough and very informative. I’m doing something similar and this has come in handy.
Thanks Alvaro!