Predictive Hacks

How to Paraphrase Documents using Transformers

transformers

In this tutorial, we will show you how to paraphrase a whole document using Transformers. In this example, we will work with Colab so that you can easily follow along. We will extract programmatically the blog post “How to get a Data Science Job without Experience” using the “newspaper3k” library, and then we will generate a paraphrased document. Let’s start by installing the required libraries.

!pip install transformers
!pip install torch
!pip install sentencepiece
!pip install newspaper3k

Once you run this cell above in the Colab you will get something similar to this:

How to Paraphrase Documents using Transformers 1

Extract the Web Document

We can easily extract the text of the document and store it to a variable called input_text.

from newspaper import Article
 
# enter the required URL
url = 'https://predictivehacks.com/how-to-get-a-data-science-job-without-experience/'
 
article = Article(url)
article.download()
article.parse()
 
# Get the article text:
input_text = article.text

T5 Model

For the paraphrase, we can work with the T5 model and more particularly the “Vamsi/T5_Paraphrase_Paws

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")  
model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
How to Paraphrase Documents using Transformers 2

Paraphrase a Sentence

Let’s see an example of a paraphrased sentence. Note that it is better for the model to add the prompt “paraphrase: ” and the ending token “</s>“.

We will generate 5 paraphrased sentences.

# input text
sentence = "Remote work may also enhance work-life balance – because employees have more control over their work schedule, it’s easier for them to take care of personal errands in the morning or during lunch hour."

sentence = "paraphrase: " + sentence + " </s>"
encoding = tokenizer.encode_plus(sentence,padding=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]


outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=5
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print(line)

Output:

Remote work can also improve the work-life balance – because employees have more control over their work schedule, it is easier for them to take care of their personal errands in the morning or during lunch hour.

Remote work may also enhance work-life balance – because employees have more control over their work schedule it’s easier for them to take care of personal errands in the morning or during lunch hour.

Remote work may also improve work-life balance – because employees have more control over their work schedule, it is easier for them to take care of personal chores in the morning or during lunch hour.

Remote work can also enhance work-life balance - because employees have more control over their work schedule, it's easier for them to take care of personal errands in the morning or during lunch hour.

Remote work may also improve work-life balance as employees have more control over their work schedule, it is easier for them to take care of personal errands in the morning or during lunch hour.

Paraphrase a Document

We can paraphrase a whole document, but we need to split it into sentences since the model cannot take as input a big document. For that reason, we will use the sentence tokenizer of NLTK.

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# create a function for the paraphrase
def my_paraphrase(sentence):
  
  sentence = "paraphrase: " + sentence + " </s>"
  encoding = tokenizer.encode_plus(sentence,padding=True, return_tensors="pt")
  input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]
  
  outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=1)
  output = tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True)

  return(output)


# join the paraphrased sentences

output = " ".join([my_paraphrase(sent) for sent in sent_tokenize(input_text)])

print(output)

Output:

What is the “working experience” in practice? Let’s suppose that you are a candidate who has recently graduated from the university with a Bsc/Msc degree in Mathematics, Computer Science, Engineering or other related fields and would like to start a career as a Data Scientist. The grades and degree show your theoretical background and your ability to learn new things. In the real world however, things are different than the University, not necessarily more difficult and somehow the hiring managers would like to know more about you and especially about how you work. A related “working experience” proves that you can do the job you are given to do. You have reliability to your science and you are good at your job. Similarly, a candidate with related work experience has a lower risk for the company. However, as we have said, you lack work experience so let’s see how you can prove to them that you are trustworthy and good at your job so that they can trust you, i.e.. You hire! We live in a era of ratings and reviews. Don’t look at the reviews on Booking.com and Airbnb when planning where to stay, don’t look at TripAdvisor and Google reviews when choosing a restaurant to eat, don’t look at the reviews of a film before you watch it, and so on and so forth. The same applies to companies. They want a review about you. It is common to ask for references before they hire someone, and also they prefer to hire someone who is a referral from an existing employee. This occurs because they want to learn more about their potential colleague before hiring them. Let us see how you can get “reviews” from the market and show your work in parallel. Stack Overflow Nobody can ignore a user with a high reputation in Stack Overflow. It clearly indicates that you are really passionate about your science, can be a good team player by providing solutions to your team, you are competent in your field and contribute to your community. In addition, you give everyone the opportunity to examine your profile and work. Blogging You will get the attention of the hiring manager if you write in a blog. Let’s assume you write in medium and you have many followers and shouts. Having many followers implies that some people have found your article helpful. It is like a positive review obtained from the market and the community. This is something that you can use as a work experience and can also be a show-case of what you can do. Also, your articles reflect somehow your working style and your area of interest and expertise. Freelancer in a Web platform There are many platforms such as Upwork, Fiverr and PeoplePerHour where you can undertake data science projects. This is a good start and can be used as a related working experience. In addition, in these platforms, you receive ratings and reviews from your clients. A strong profile in such platforms proves your ability to deliver in high quality. It also indicates your ability to communicate with the clients, to be reliable and to have self-knowledge. GitHub Account I would recommend having a GitHub account which is like the Data Scientist portfolio. By sharing your profile you give the possibility for them to see how you code, the projects you have done and so on. Kaggle Account Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. If you participate in many competitions and have a relatively good score it indicates that you are good in data science, that you have dealt with real-life problems and have been faced with many different problems. If you lack working experience and see that in most positions for it, do not get disappointed. If you love your work and you are good at it, then it is a matter of time until you will join the industry. The most dangerous part of having gap in your career is having. For that reason, try to invest in the things that we mentioned above during the time you do not work and are looking for work. By doing this you will improve your skills such as writing, coding, etc., you will get hands-on experience and at the same time you can use them as a related working experience. Thus, a candidate with 1000 points reputation in Stack Overflow, labeled as the top freelancer in the web platforms with 500 followers on Medium and with an active Kaggle profile will stand out and will be asked for an interview where they also assess other things like cultural fit, interpersonal skills etc.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Get Started with Hugging Face Auto Train

Hugging Face has launched the auto train, which is a new way to automatically train, evaluate and deploy state-of-the-art Machine