We take for granted document processing on an individual scale: double-click the file (or use a simple command-line phrase) and the contents of the file display. But it gets more complicated at scale. Imagine you’re a recruiter searching resumes for keywords or a paralegal looking for names in thousands of pages of discovery documents. The formats, versions, and platforms that generated them could be wildly different. The challenge is even greater when it’s time-sensitive; for example, if you have to scan all outgoing emails for personally identifiable information (PII) leakages or you have to give patients a single file that contains all of their disclosure agreements, scanned documents and MRI/X-ray/test reports, regardless of the original file format.
At Hyland we produce a document processing toolkit that independent software vendors can implement to identify files, extract text, render file content, convert formats and annotate documents in over 550 formats. These are Document Filters, and any software that interacts with documents will need Document Filters.
One library for 550 formats may seem like overkill, but imagine stringing together dozens of open source libraries, testing each of these libraries each time a new release hits the wild. We give you one dependency, one point of contact if something goes wrong and one library to deploy instead of dozens.
We started as a company that sold desktop search software called ISYS. The application was built in Pascal for MS-DOS and provided mainframe-level search on PCs. Eventually, other companies, such as Microsoft and Google, started providing desktop search applications for free, and it’s tough to compete with free.
This led us to realize that the sum of the parts was greater than the whole; getting text out of files and delivering the exact location is harder than it seems and relevant to applications other than search. Our customers noticed our strength in text extraction and wanted that as something they could integrate or embed in their software and across multiple platforms.
Identifying that Pascal was not going to meet our needs, we pivoted our engineers to rebuild the app in C++ over the next year for about half a dozen computing platforms. Since then, we’ve learned a lot about content processing at scale and how to make it work on any platform.
On any platform
When we rewrote our software, one of the key factors was platform support. At the time, Pascal only supported Windows, and while it now supports Mac and Linux, it was and still is a niche language. That wasn’t going to work for us, as we wanted to support big backend processing servers like Solaris and HP-UX. We considered writing it in C, but we would have had to invent a lot of the boilerplate that C++ gave us for free.
We were able to port about 80 percent of the code from Pascal. The other 20 percent was new OS abstractions, primarily to support the Windows API functions we lose on other platforms and the various quirks of each platform. Each compiler makes different assumptions of how to implement C++ code, so we use multiple compilers to see what those assumptions are.
Complicating things was that we not only had to consider operating systems, but CPUs as well. Different CPUs process bytes in different orders, called byte endianness. All Intel and ARM chips use little-endian, where the least significant byte is stored first. SPARC chips historically used in Solaris machines used big-endian storage, where the most significant byte was stored first. When you’re reading from a file, you need to know what chipset produced it, otherwise you could read things backwards. We make sure to abstract this away so no one needs to figure out the originating chipset before processing a file.
Ultimately, the goal is to have the software run exactly the same on all 27 platforms. Some of the solution to that problem is just writing everything as generically as possible without special code for each platform. The other solution is testing. With the conversion to C++, we wrote a lot of new tests in order to exercise as much code on all platforms. Today, we’ve expanded those tests and made error detection much more strict. Lots of files and formats pass through during tests, and they need to come through clean.
Search and extract text at scale
The first step to locating or extracting text from a file is finding out what format the file is in. If you are lucky to get a plaintext file, then that’s an easy one. Unfortunately, things are rarely easy. There aren’t a lot of standards available for how files are structured; what exists may be incomplete or outdated. Things have changed a lot over the years; Microsoft is actually at the forefront for publishing standards. They publish standards for most of their file types these days, particularly the newer ones.
Many file types can be identified by an initial set of four bytes. Once you have that, you can quickly parse the file. Older MS Office files all had the same four bytes, which presented complications, especially since so many files were in one of the four Office formats. You had to do a little extra detective work. Newer Office files all identify as ZIP files — they are all compressed XML — so once you extract the XML, you start applying known heuristics and following markers. Most XML is self-describing, so those markers can be easy to follow. Others don’t have much of a path at all.
Binary file types are harder. Some of the work here is reverse engineering and making sure you basically have enough files that are a representative sample set. But once you know the pattern, then detecting the file is absolutely predictable. We don’t use any machine learning or artificial intelligence (AI) techniques to identify files because of this. The challenge is working out what the pattern is and what pattern a given file fits.
Identifying files is the very first thing that we do, so it has to be fast. One slow detection can impact everything and take us from sub-milliseconds per document to 15 milliseconds per document. When you’re trying to crank through 40,000 documents in a minute, that’s a lot.
We gain a lot of speed from specializing in text search and extraction as a pure back-end system. Alternate methods have used something like LibreOffice to process documents as a headless word processor. End-user applications have graphic elements and other features that you don’t care about. In a high-traffic environment, that could mean 50 copies of LibreOffice running as separate processes across multiple machines, each eating up hundreds of MB. If that crashes it could bring down vital business processes with it. It’s not uncommon to see server farms running LibreOffice for conversions that could be replaced with a single back-end process such as Document Filters. That’s before considering the other workarounds to process all the other file types you might need such as spreadsheets, images and PDFs.
By focusing on processing text at a high volume, we can help clients that need to process emails, incoming and outgoing, looking for data loss and accidental PII leakages. These products need to scan everything going in or out. We call it deep inspection. We cracked apart every level of an email that could have text. Zipping something and renaming the extension is not enough to try and trick it. Attaching a PDF inside a Word document inside an Excel document is also not enough. These are all files that contain text, and security needs to scan all of it without delaying the send. We won’t try to crack an encrypted file, but we can flag it for human review. All this is done so quickly you won’t notice a delay in the delivery of critical email.
We can process text so quickly because we built in C++ and run natively on the hardware; targeting native binaries also gives us the greatest flexibility where we can be embedded in applications written in a wide variety of languages. On top of that, all that work identifying file formats pays off. When scanning a file, we load as little as possible into memory, just enough to identify the format. Then we move to processing, where we ignore any information we don’t need to spot text — we don’t need to load Acrobat forms and crack that stuff apart. Plus we let you throw as much hardware at the problem as you have. Say you are running a POWER8 machine with 200 cores, you can run 400 threads and it won’t break a sweat. You want a lot of memory if you’re doing that amount of documents in parallel.