Accessible PDFs: Good for humans, good for automation

If you’ve ever been to a workshop on accessibility – maybe you’re a professor that needs to create accessible materials for students, maybe you’re a business that needs to post accessible documents on your website or provide accessible documents internally – you’ve heard what sounds like a tired cliché: what’s good for people with disabilities is good for everyone. As it turns out, they’re right, and it’s not terribly difficult to achieve, provided you start with accessibility in mind instead of trying to make things accessible after the fact.

I’m not going to go through an exhaustive list of accessibility concerns. Instead, I want to tackle this issue from a data perspective. Specifically, I want to focus on how accessible PDFs can enable using your PDF-based data. Usable can mean a lot of things, but, in the world of data, I mean that you can use that data to assist you with your business goals, whether that’s using it to create your mailing list for a direct mail campaign or harnessing it for use in analytics.

If you’ve ever had to process a PDF programmatically, you know the pain of which I am about to speak. If you’ve ever thought about hiring someone to wrangle your data out of PDF files and into another format, this article might help to understand the concerns they raised and the quotes you received. If this topic is all new to you, don’t worry. I’m going to forego the finger wagging and wallowing in our collective mistakes, and talk a bit about how we got to this place, what some standard accessibility and automation barriers are, and what we can do to prevent them, and promote usable data, going forward. After all, if you’ve decided to keep a file, I assume you must have a reason for doing so that’s not just because you love to fill hard drives and cloud storage.

How did I end up with an inaccessible PDF in the first place?

PDF stands for Portable Document format, and the introduction of this file type allowed us to share documents containing images and formatted text with other folks regardless of what computer or operating system or other software they were using to view it.

When we began the great push to free our offices from the clutter of paper files, and we began scanning in those old paper files, PDF was a natural choice for the output format. Client files? Internal policy and procedure manuals? Magazines? They could all be scanned and thrown out.

As part of scanning, you might have invoked OCR (optical character recognition – letting the computer try to figure out where the text is and what letters were used in an artificially intelligent pattern matching kind of way) so that the text would remain at least somewhat searchable and usable, but, mostly, we all scanned to image, meaning that each page scanned became a picture; multi-page documents became collections of images, and just as (un)usable as photographs of text documents.

For all those things you just wanted to hold on to for posterity, image-based PDFs were great, but, for all of that information you wanted to use somewhere else, like mailing lists and business cards and data tables, that data was just as hard to use in other places as it had been in paper form.

Even more nefarious, when we wanted to make our already-digital, non-PDF documents more accessible to folks using different operating systems and other software, we all learned the great hack of printing our documents to PDF so that we could distribute the PDF version. While your text might remain readable and usable, some of the other elements of your digital document, such as alt tags for images, which might add value and information as data, can get lost when you convert to PDF this way.

Wrestling with accessibility and automation barriers in image-based PDFs

PDFs are never really the best way to get data. If you have that data in another format, or can get it from someone in another format, you can probably save yourself a lot of time and effort by using that other format. That said….

Suppose you open a PDF, and you have hopes and dreams of copying that data out and pasting it in somewhere else. Unfortunately, your PDF is image-based, and your mouse cursor doesn’t seem to recognize the text the same way your eyes do. You suspect, but don’t know for sure, that this PDF page is saved as an inaccessible image instead of usable text. If you’re in Adobe Reader, maybe you use the Read Out Loud option to see what Adobe Reader can “see.” You hear back, “Warning: Empty page.” If you’re using the full version of Adobe Acrobat, you might instead explore your document using the accessibility check under the Tools menu and find out that you’re looking at a document full of images.

So you decide to use OCR on your document to see if that will free the text from its image-based prison. In Adobe Acrobat, you can use the Recognize Text Tool. That gives you a document version that has selectable text, but there are some errors, such as the word “The” being copied and pasted as “Tlie” (they look similar, right?) and double-ts (tt) being copied and pasted as a capital H. The computer is good, but not perfect. You’ll have to clean the data manually; the automated OCR can’t get you clean data on its own, especially with things like names. Even worse, some of the information was in columns, and, on some pages, when you try to copy and paste, your mouse can’t select neatly down columns, but sometimes selects across the whole line (row) of data as if this was paragraph text instead of columnar text. That’s another roadblock for automation.

If posterity wasn’t the main reason you digitized your information, this won’t do. You want usable data to help you achieve your business goals.

Creating accessible documents: 4 things to keep in mind

I’ll quickly touch on some issues to be aware of, and then move on to how to make your PDF documents more accessible and automation friendly, and your data more usable.

Color Contrast

Adequate color contrast ratio between your text and your background colors helps folks actually read what you’ve written. Even if you have “normal” vision, sometimes things can be hard to read because the font is too light, there’s bad contrast with the background, or you scanned in a document with a coffee stain and it’s now practically impossible to read what’s under the stain even though you can figure out well enough in context and by memory.

This is true for the computer, too. Remember that features like OCR are trying to match character patterns to figure out what the letters and numbers are. Choosing a more readable font (e.g., Calibri) can go a long way to making this possible, but the font choice doesn’t matter if your color contrast makes it hard to see the numbers and letters in the first place.

Alt Tags for Images

An alt tag is something you might hear about a lot when dealing with websites, and not so much when dealing with other information formats. An alt tag, also known as alternative text, is a short description of an image, and takes the place of an image when the image can’t be rendered (e.g., you have images turned off in your email unless you click to download them). They’re also what a screen reader reads out loud when it gets to the image; how else could the computer tell someone what the picture is of, or what information that image is supposed to be conveying in the context of that document?

You can set alt tags for images in many authoring tools, such as MS Word and MS PowerPoint, as well as on your website, so that folks using a screen reader will know what the non-decorative image conveys. It also adds to your text-based information.

Structured Documents & Reading Order

Using structured elements in your documents, like the Heading 1 and Heading 2 Styles in MS Word, helps everyone understand how information is organized. Think about it – you check the table of contents of a book to see if it contains something that’s of interest to you, and you might do the same with the headings of an article or webpage so that you can skip to the part you care about. Using the official styles and structure tags instead of manually adjusting font sizes and styles, means that folks using a screen reader can also skim through like this easily, and it preserves the reading order you intended for the document.

When you’re using automation to parse a document, these structures/styles can also help the automation tool figure out how to understand what it’s parsing.

To someone looking at a document, a whitespace gap between columns might indicate read down one column and then move on to the next column of data, but, without the right structure, the computer might read one line at a time, combining the columns inappropriately and give you garbage data.

Screen Readability

Having your document saved in a format that allows for a screen reader to be used, that is, having all of that great text-based information available as searchable, usable text for people with disabilities, also means that it’s available for other software to use, like your automation tools.

Creating better PDFs and other ways to go digital

If you’re printing to PDF, and your document isn’t fancy (e.g., doesn’t contain any text formatting or any images), you might luck out, but you can take some active steps to make sure your document is more accessible and more usable.

As a first step, before you try to fix something that might not be broken, you can run the accessibility checkers available in MS Word and Adobe Acrobat. If you’re editing a Word document, you can find the Check Accessibility tool in the Review menu of the ribbon. If you’re using Acrobat to view an existing PDF, look under the Tools menu for the Accessibility toolset, and choose Full Check. Both of these tools will notify you of errors.

If you’re using office software like MS Word or MS PowerPoint, and you’re ready to create a PDF version of that document, use File > Save As instead of File > Print. Select Browse to select the PDF’s destination folder, and use the dropdown to change the file format from, for example, Microsoft Word (.docx), to PDF, and Save. All of your alt tags and other structural elements will be preserved.

If you’re going beyond the standard black and white document, and want to experiment with color while still preserving the human readability of your documents, you can check out tools like the WebAIM Contrast Checker. It’s designed to help with websites, so the displayed colors are all in HEX, but it’s a good starting place for this topic. If you don’t know the HEX codes for colors you want to check, you can select the colored boxes to select colors using RGB values instead, or just use the sliders to pick colors that way. (These notes will make more sense if you follow the link.)

If you’re ready to move beyond PDFs, you might also want to investigate what would be involved in putting all of your information into a database. Some companies that scan paper documents for you might offer some sort of document management service as well. As an example you might have come across, there are business card scanning apps that do this, with varying degrees of success.

However you decide to move forward, if you keep an eye on accessibility for humans and screen readers, you’re taking steps in the right direction to ensure that your data is usable for your business goals, too. Happy PDF-ing!!

Barbara Olsafsky

Owner and Data Wrangler/Strategist

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment