How to Validate Text in PDF Files using Selenium

What is a PDF file?

Portable Document Format (PDF) is a file format developed by Adobe in 1992 to present documents, including text formatting and images. PDF format is widely used for saving critical data that cannot be modified by anyone except the owner, however, can be accessed and read by anyone, unlike other formats like word and text files.

Table of Contents

Why is verifying PDF file content required?
What is Apache PDFBox?
How to integrate PDFBox with Selenium and Java
How to read content from PDF file using Apache PDFBox
How to validate contents of PDF file hosted on the web
How to assert PDF Text
How to set the start and end page of PDF for extraction?
How to validate contents of PDF opened in another browser tab
How to validate contents of already downloaded PDF in the Downloads folder

Why is verifying PDF file content required?

Almost every organization/business uses PDF files to save their official data. Let’s take a very simple use case – most of the websites have some links, which when clicked, either opens the PDF in the browser’s reader mode or downloads the PDF in the local system depending upon the browser’s setting to handle PDF files.

When it comes to testing these PDF files, you can do that by manually opening the link or opening the PDF file from the local system and verifying whether particular information is available or not. However, verifying the contents of PDF files at scale becomes cumbersome; hence, automation is a must.

What is Apache PDFBox?

Selenium does not have any inbuilt functionality to test the content of PDF files; hence it needs to use the third-party library Apache PDFBox.

It is an open-source Java tool and can be used with Selenium Java and TestNG to assert the content of PDF. Apache PDFBox allows the creation of new PDF documents, manipulation of existing documents, and the ability to extract content from documents.

This article explores content extraction from PDF with Selenium Automation using Apache PDFBox.

How to integrate PDFBox with Selenium and Java

Apache PDFBox library can be downloaded and added as an external library in Eclipse or any other editor of your choice. It can also be added as a Maven dependency in pom.xml

Downloading jars and adding as an external jar:

Download the Apache PDFBox JAR

Note: 0.26 is the latest version. In the future, you may navigate to https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/ and download the latest version.

Download Apache FontBox JAR
Add these Downloaded JARs using the below steps
- Right click on the created project in Eclipse-> Select Build Path-> Configure Build Path.
- Click on the Libraries tab and click on “Add External JARs”.
- Select the downloaded jars and finally click on the Apply and Close button

Adding as a Maven dependency:

Step 1 – Create a Maven project in eclipse/ any Java editor by selecting archetype as “maven-archetype-quickstart” and add Selenium Java and TestNG dependencies in pom.xml as seen below

Step 2 – Copy the latest PDFBox dependency from https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox and add it under <dependencies> tag in pom.xml

<dependencies>   

     <dependency>

     <groupId>org.seleniumhq.selenium</groupId>

     <artifactId>selenium-java</artifactId>

     <version>4.3.0</version>

     </dependency>

     <dependency>

     <groupId>io.github.bonigarcia</groupId>

     <artifactId>webdrivermanager</artifactId>

     <version>5.2.1</version>

     </dependency>

     <dependency>

     <groupId>org.apache.pdfbox</groupId>

     <artifactId>pdfbox</artifactId>

     <version>2.0.26</version>

     </dependency>

     <dependency>

     <groupId>org.testng</groupId>

     <artifactId>testng</artifactId>

     <version>7.6.1</version>

     </dependency>

    </dependencies>

Step 3 – Save the pom.xml file to download all the dependencies from the MVN repository (Make sure to have an active internet connection while performing this). Just to be double sure, expand the Maven Dependencies folder and verify that the required jars are downloaded.

Now that the jars are configured, let us start writing the test case in Selenium using Java and TestNG to extract content from PDF files.

How to read content from PDF file using Apache PDFBox

PDFTextStripper class of PDFBox library is used to extract/strip out the text from PDF file as seen in the command below

String pdfContent=new PDFTextStripper().getText(doc);

To get this “doc” reference which is passed to getText method, you would need to write the below lines of code which we will understand step by step.

Let us take an example of Google Cloud Security PDF doc. Store the pdf url in a String object using the below command

String pdfUrl = "https://cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf";

Create URL class object of java.net package and pass pdfURL as parameter.

URL url =new URL(pdfUrl);

Use the openStream() method of the URL class to open a connection to this URL which returns an InputStream for reading from that connection.

InputStream is= url.openStream();

After this, create an object of BufferedInputStream class which creates a BufferedInputStream and pass InputStream object as a parameter.

BufferedInputStream bis=new BufferedInputStream(is);

And at last, use the PDDDocument class to represent the PDF Document. load() method of PDDocument takes inputstream object as a parameter, parses the PDF and returns a PDDocument object.

PDDocument doc=PDDocument.load(bis);

Now we are good to use “doc” as a reference to PDFTextStripper().getText()

String pdfContent=new PDFTextStripper().getText(doc);

getText() method of PDFTextStripper is used to get the text of the document passed as a parameter and returns a String value.

Lastly, don’t forget to close the document as per PDFDocument guidelines.

doc.close();

How to validate contents of PDF file hosted on the web

Directly navigate to the desired PDF file hosted on the web using a link and verify the content as seen in the example explained in the previous section. Example of ReadPDF test class, compiling the above steps in a single code snippet as seen below:

public class ReadPDF {    

    WebDriver driver;

    String pdfUrl ="https://cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf";    

    @BeforeTest

    public void setUp() {

     WebDriverManager.chromedriver().setup();

     driver=new ChromeDriver();

     driver.manage().window().maximize();

     driver.get(pdfUrl);

    }    

    @Test

    public void verifyTextFromPDF() {     

     try {

     String pdfContent=getPdfContent(pdfUrl);

     Assert.assertTrue(pdfContent.contains("Secure low-level infrastructure"));

     Assert.assertTrue(pdfContent.contains("Security of physical premises"));

     } catch (IOException e) {

     // TODO Auto-generated catch block

     e.printStackTrace();

     }     

    }    

    public static String getPdfContent(String url) throws IOException {

     URL pdfURL=new URL(url);

     InputStream is=pdfURL.openStream();

     BufferedInputStream bis=new BufferedInputStream(is);

     PDDocument doc=PDDocument.load(bis);

     int pages=doc.getNumberOfPages();

     System.out.println("The total number of pages "+pages);

     PDFTextStripper strip=new PDFTextStripper();

     strip.setStartPage(1);

     strip.setEndPage(2);

     String stripText=strip.getText(doc);

     System.out.println(stripText);

     doc.close();

     return stripText;

    }   

}

How to assert PDF Text

Now that we have received all the content from the PDF file, which is stored in a String object pdfContent, let us see how to assert whether the expected text is present in pdfContent String. You can use TestNG assertions like below to assert that a given text is present in PDF.

Assert.assertTrue(pdfContent.contains(“Google Infrastructure Security”));

You can do multiple assertions as per your requirement. You may also use TestNG SoftAssert as it will not throw an exception when the first assertion fails (in case of multiple asserts), and rather records all exceptions and will throw the exception at the end using assertall() method of SoftAssert class.

How to set the start and end page of PDF for extraction?

PDFBox can parse large PDF files as well. For example, if you are testing a PDF file of 40 pages and you are interested in parsing only limited pages, you can achieve that with setStartPage(int startPageValue) and setEndPage(int endPageValue) methods of PDFTextStripper class.

strip.setStartPage(x);

strip.setEndPage(y);

Here x and y are starting and ending values of the Pages of the PDF that you need to extract.

Code to Read PDF from a given start page to the end page

public static String getPdfContent(String url) throws IOException {

     URL pdfURL=new URL(url);

     InputStream is=pdfURL.openStream();

     BufferedInputStream bis=new BufferedInputStream(is);

     PDDocument doc=PDDocument.load(bis);

     int pages=doc.getNumberOfPages();

     System.out.println("The total number of pages "+pages);

     PDFTextStripper strip=new PDFTextStripper();

     strip.setStartPage(1);

     strip.setEndPage(2);

     String stripText=strip.getText(doc);

     System.out.println(stripText);

     doc.close();

     return stripText;

    }

How to validate contents of PDF opened in another browser tab

Navigating to a webpage, clicking on a link (using a locator in Selenium) which opens the PDF in the same or another browser tab. And then use that PDF URL to parse and verify the content as seen in the example below:

public class PDFInBrowser {

    WebDriver driver;

    String url = "http://www.princexml.com/samples/";

    String pdfUrl;

    @BeforeTest

    public void setUp() {

     WebDriverManager.chromedriver().setup();

     driver = new ChromeDriver();

     driver.manage().window().maximize();

     driver.get(url);

     driver.findElement(By.xpath("(//a[contains(@href, 'drylab.pdf')])[2]")).click();

     pdfUrl = driver.getCurrentUrl();

    }

    @Test

    public void verifyTextFromPDF() {

     try {

     String pdfContent = getPdfContent(pdfUrl);

     Assert.assertTrue(pdfContent.contains("New York, St. Louis, San Francisco"));

     } catch (IOException e) {

     // TODO Auto-generated catch block

     e.printStackTrace();

     }

    }

    public static String getPdfContent(String url) throws IOException {

     URL pdfURL = new URL(url);

     InputStream is = pdfURL.openStream();

     BufferedInputStream bis = new BufferedInputStream(is);

     PDDocument doc = PDDocument.load(bis);

     PDFTextStripper strip = new PDFTextStripper();

     String stripText = strip.getText(doc);

     System.out.println(stripText);

     doc.close();

     return stripText;

    }

}

Talk to an Expert

How to validate contents of already downloaded PDF in the Downloads folder

Download any PDF file to be tested. Go to the folder where the PDF file is present. Right-click on the PDF and select “Open With” any browser. Copy the entire URL and use it

For Example, verifying the contents of PDF stored in the folder location as mentioned below

file:///C:/Users/<username>/Downloads/google_infrastructure_whitepaper_fa.pdf

Code to Verify the contents of the PDF at a given folder location

public class PDFDownload {    

    String url = "file:///C:/Users/lenovo/Downloads/google_infrastructure_whitepaper_fa.pdf";

    @Test

    public void verifyTextFromPDF() {

     try {

     String pdfContent = getPdfContent(url);

     Assert.assertTrue(pdfContent.contains("Secure low-level infrastructure"));

     } catch (IOException e) {

     // TODO Auto-generated catch block

     e.printStackTrace();

     }

    }

    public static String getPdfContent(String url) throws IOException {

     URL pdfURL = new URL(url);

     InputStream is = pdfURL.openStream();

     BufferedInputStream bis = new BufferedInputStream(is);

     PDDocument doc = PDDocument.load(bis);

     PDFTextStripper strip = new PDFTextStripper();

     String stripText = strip.getText(doc);

     System.out.println(stripText);

     doc.close();

     return stripText;

    }

}

As per PDFTextStripper class documentation, this class will take a pdf document, strip out all of the text, and ignore the formatting. Please note; that it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. This means you have to make sure that you are using a valid pdf file with proper permission to access. Also, it cannot parse any image file which is scanned into a PDF.

Run Selenium Tests for Free

As always, it is important to run Selenium tests on real browsers and devices. BrowserStack offers a cloud Selenium Grid of 3000+ real browsers and devices for testing purposes. Simply sign up, choose the required device-browser-OS combination from 3000+ Desktop and Mobile browsers for Selenium Testing, and start testing websites for free.

Was this post useful?

Yes, Thanks Not Really

We're sorry to hear that. Please share your feedback so we can do better

Thanks a lot for your feedback!

How To Validate Text in PDF Files Using Selenium Automation

Why is verifying PDF file content required?

What is Apache PDFBox?

How to integrate PDFBox with Selenium and Java

How to read content from PDF file using Apache PDFBox

How to validate contents of PDF file hosted on the web

How to assert PDF Text

How to set the start and end page of PDF for extraction?

How to validate contents of PDF opened in another browser tab

How to validate contents of already downloaded PDF in the Downloads folder

We're sorry to hear that. Please share your feedback so we can do better

Test Automation on Real Devices & Browsers

Request received!

How To Validate Text in PDF Files Using Selenium Automation

Why is verifying PDF file content required?

What is Apache PDFBox?

How to integrate PDFBox with Selenium and Java

How to read content from PDF file using Apache PDFBox

How to validate contents of PDF file hosted on the web

How to assert PDF Text

How to set the start and end page of PDF for extraction?

How to validate contents of PDF opened in another browser tab

How to validate contents of already downloaded PDF in the Downloads folder

We're sorry to hear that. Please share your feedback so we can do better

Related Articles

Keyword Driven Framework for Selenium

5 Selenium tricks to make your life easier

How to Read/Write Excel Data using Apache POI Selenium

Featured Articles

Keyword Driven Framework for Selenium

5 Selenium tricks to make your life easier

Automation Tests on Real Devices & Browsers

Test Automation on Real Devices & Browsers

Contact sales

Get in touch with us

Thank you

Request received!