In this article, I will be discussing how to use the Python Programming language for extracting text out of a PDF using a Python Package called PDFPlumber.
Let’s straightforward dive into how to use PDFPlumber for getting text out of a PDF using Python.
PDFPlumber can be installed on a computer/laptop using pip, which is a package manager for Python. So head over to the terminal on mac or command line on windows and just type in pip install pdfplumber. This will download and install pdfplumber on your system. (Just in case if you have multiple versions of pip installed on your system then I would recommend using pip3 for installing pdfplumber. So instead of pip install pdfplumber just use pip3 install pdfplumber)
Just to make sure that the pdfplumber package has been installed, open up Python Interpreter by typing in python3 into the terminal on mac or command line on windows. Then type in import pdfplumber and press enter, if there does not come up any error then it means that pdfplumber has been installed properly.
Using PDFPlumber for Extracting Text Out of PDF
Firstly, the pdfplumber package needs to be imported into the Python Environment. So firstly create a Python File(a file that ends with a .py extension). Then put import pdfplumber as the first line of code in that file. This will bring in all the functionality of pdfplumber package and moving onwards in the code, you will be able to use different functions which pdfplumber offers.
So far the Python File looks like the following image.
After importing pdfplumber next step would be to load the PDF from which text is to be extracted. For loading PDF, pdfplumber package provides pdfplumber.open(x) function where x can be path to a PDF file, file object or file-like object loaded as bytes.
pdfplumber.open(x) function loads PDF as an instance of the pdfplumber.PDF class.
Any instance of type pdfplumber.PDF class (essentially an object representing PDF loaded by pdfplumber) has two properties – metadata, pages.
So pdfplumber.open(x) loads PDF as an object which has two properties – metadata, pages. Property metadata contains information like creation date, modification date etc. of a PDF while property pages is a list containing instances of pdfplumber.Page class. In simple words property pages is a list of all pages in PDF.
Output of Above Code
For loading the PDF using pdfplumber package function pdfplumber.open(x) can be used. Moreover if incase PDF is locked using a password still pdfplumber can read it just pass password = “Password of PDF” parameter in addition to file path of PDF to pdfplumber.open(x) function.
- pdfplumber.open(x) if PDF is not locked using a password
- pdfplumber.open(x), password = “password of PDF”) if PDF is locked using a password
Now as we have read in PDF as an instance of pdfplumber.PDF class. Let’s now see how to get text out of each page. As pages property is a list of each page in pfplumber.PDF class instance. Using that text can be extracted from each page. So
- first_page_of_pdf = pdf.pages
- second_page_of_pdf = pdf.pages
And so until the end of PDF. Do note that above pdf is an instance of pdfplumber.open(x) class and first_page_of_pdf is an instance of pldfplumber.Page class.
So we have first_page_of_pdf, second_page_of_pdf as instance of pdfplumber.Page class which have a method called extract_text(), which can take out text of a page. Let’s have a look what would be the code look like for extracting text out of a PDF using PDFplumber.
import pdfplumber with pdfplumber.open("test.pdf") as pdf: first_page_of_pdf = pdf.pages.extract_text() print(first_page_of_pdf)
SImilar to this, text from other pages in the PDF can also be extracted.