Predictive Hacks

Document Splitting with LangChain

In this tutorial, we will talk about different ways of how to split the loaded documents into smaller chunks using LangChain. This process is tricky since it is possible that the question of one document is in one chunk and the answer in another, which is a problem for the retrieval models. There is a lot of nuance and significance in how you split the chunks to ensure that you group semantically relevant parts. The core principle behind all text splitters in LangChain revolves around dividing the text into chunks of a certain size with some overlap between them.

Chunk size refers to the size of a section of text, which can be measured in various ways, like characters or tokens. Chunk overlap involves a slight overlap between two adjacent sections, ensuring consistency in context. Text splitters in LangChain offer methods to create and split documents, with different interfaces for text and document lists. Various types of splitters exist, differing in how they split chunks and measure chunk length. Some splitters utilize smaller models to identify sentence endings for chunk division. Metadata consistency is crucial in chunk splitting, with certain splitters focusing on this aspect. Chunk splitting methods may vary depending on the document type, particularly evident in code splitting.

The main types of splitters are:

  • RecursiveCharacterTextSplitter(): Splitting text that looks at characters
  • CharacterTextSplitter(): Splitting text that looks at characters
  • MarkdownHeaderTextSplitter(): Splitting markdown files based on specified headers
  • TokenTextSpltter(): Splitting text that looks at tokens
  • SentenceTransformersTokenTextSplitter(): Splitting text that looks at tokens
  • Language(): For programming languages like Python, Ruby etc.
  • NLTKTextSplitter(): Splitting text that looks at sentences using NLTK
  • SpacyTextSplitter(): Splitting text that looks at sentences using SpaCy

Let’s take a look at some splitters.

Recursive Character Text Splitter

At this point, we will introduce the recursive character text splitter which is suitable for generic text.

import os
import openai

from langchain.text_splitter import RecursiveCharacterTextSplitter

# define the chunk size and the chunk overlap
chunk_size = 5
chunk_overlap = 2


# initialize the splitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
 

Let’s see an example to understand how it works.

text1 = '0123456789'

r_splitter.split_text(text1)
 

Output:

['01234', '34567', '6789']

As we can see, it creates a chunk of 5 characters with an overlap of 2, as we have asked above. Keep in mind that the last chunk consists of 4 characters instead of 5 because there were no more characters.

Character Text Splitter

The character text splitter splits on a single character and by default, that character is a newline character. But here, there are no newlines in our toy example. Let’s define a new text where the characters are separated by an empty space and let’s set the separator to be an empty space as well.

from langchain.text_splitter import CharacterTextSplitter

text2 = '0 1 2 3 4 5 6 7 8 9'

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text2)

Output:

['0 1 2', '2 3 4', '4 5 6', '6 7 8', '8 9']

Let’s continue with more real-world examples.

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
 
len(some_text)

Output:

496

The length of the text above is 496. Let’s define our character splitters.

c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)
 

As you can see in the recursive character splitter, we provide a list of separators, which are the default ones, included here in the notebook for clarity. This list includes double newline, single newline, space, and an empty string. When splitting text, it follows this sequence: first attempting to split by double newlines, then by single newlines if necessary, followed by space, and finally, if needed, it splits character by character.

Let’s see what output we get for each case:

c_splitter.split_text(some_text)
 

Output:

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']
r_splitter.split_text(some_text)
 

Output:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Examining the performance of these methods on the provided text, we observe that the character text splitter divides the text based on spaces, resulting in an odd separation within a sentence. On the other hand, the recursive text splitter initially attempts to split the text by double newlines, effectively creating two paragraphs. Despite the first paragraph being shorter than 450 characters, as specified, this split is likely preferable since each paragraph remains intact within its own chunk, rather than being divided in the middle of a sentence.

Tip: How to add a period as a separator.

In order to add a period as a separator you need to pass it using the positive lookbehind regular expressions like (?<=. ). The next two examples we help you to understand the difference.

Pass the period separator without the positive lookbehind:

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
 

Output:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Pass the period separator with the positive lookbehind:

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)
 

Output:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

Split PDF Documents

Let’s see how we can work when we are dealing with PDF documents. First, we load the PDF file.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("my_file.pdf")
pages = loader.load()
 

Then, we define the splitter.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150
)
 

Finally, we split the initial PDF file into documents

docs = text_splitter.split_documents(pages)
 

Now that we aim to utilize documents, we employ the “split documents” method and provide a list of documents as input. By comparing the lengths of these documents with the original pages, it’s evident that significantly more documents have been generated through this splitting process. You can check the above statement by running the commands:

print(len(pages))
# we get 20

print(len(docs))
# we get 80

Token Splitting

There’s an alternative method of splitting text based on tokens, and for this purpose, we’ll import the token text splitter. This approach is valuable because Language Models (LLMs) often rely on context windows defined by token count. Therefore, understanding the tokens and their positions is crucial. By splitting based on tokens, we can gain a slightly more accurate understanding of how the LLM perceives them, allowing us to distinguish between tokens and characters effectively.

To gain a better understanding of the contrast between tokens and characters, let’s initialize the token text splitter with a chunk size of 1 and a chunk overlap of 0. This configuration will divide any text into a list of individual tokens.

Now, let’s fabricate a playful text. Upon splitting it, we’ll notice that it breaks down into various tokens, each exhibiting differences in length and character count. For instance, the initial token is “foo,” followed by a space, then “bar,” another space, followed by “b,” “az,” “zy,” and “foo” again. This demonstration highlights the disparity between splitting based on characters and tokens.

from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
 

Output:

['foo', ' bar', ' b', 'az', 'zy', 'foo']

Sources:

[1]: DeepLearning.AI

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s