+ Start a Discussion
Force.comForce.com 

Decode/Encode Word Documents in Apex

In account object, I have attached a word document in Notes and Attachement RelatedList. Through apex code, I want to fetch the body text of this word document attached. For this, Apex code is:

 

Account acc = [select Name from Account LIMIT 1];

Attachment att = [Select name,ContentType,body,ParentId,Description from Attachment where ParentId =: acc.Id];

 

// Get body of the attachment in Blob variable and converting it to string

Blob bodyBlob = att.body;

String bodyStr = bodyBlob.toString();

 

But the above code generates String Exception which says: 'BLOB is not a valid UTF-8 string'

The above code works fine for attachments which have .rtf extension.


Is it possible to get the content of the body of any Word (.doc) attachment in Apex? Please suggest some workaround.

 

Thanks


Best Answer chosen by Admin (Salesforce Developers) 
sfdcfoxsfdcfox

When loading a String variable, it must be in UTF-8 format. This is a character-based format, not a byte-based format.

 

Byte-based files are those that do not directly represent textual information. These types of files are also commonly called "binary files."  They have no consistent rule about how the bytes are laid out; each binary format has its own rules. For example, most binary files have a header (known as a magic string) that tells the program trying to read the file that it is probably a valid file. A GIF file will contain GIF as the first three bytes, while a JPEG will contain JFIF as the first four bytes (usually, but the format allows other data to precede this tag). One notable feature of binary files is that you can not print the contents properly, as they contain "non-printable characters."

 

Conversely, character-based formats are binary files that represent text. They must follow a set of rules to be considered a character-based format. These formats typically do not have a "magic string", since they identified by their contents. They have restrictions on which bytes may be placed in the file. For example, UTF8 specifies that you can not use an "overlong encoded character", such as 0xF880808081 (this should be simply 0x01), while UTF-16 specifies that certain byte combinations can not appear next to each other.

 

You can freely load valid HTML, XML, RTF, TXT, ASM, CPP, C, JAVA, PHP, PL, CSV, TSV, and ASP files (as examples). These are all character-based formats. They encode their data in a human-readable format. You need only a program such as Notepad to properly read and write these files. Because of UTF-8 restrictions on legal strings, you can not load any type of file that may violate the rules for UTF-8.

 

Byte-based files (a.k.a. binary files) will likely break the rules, because they were not designed for compatibility with text formats, but instead for space effeciency: character-based formats will ALWAYS be less efficient spacewise than a pure binary file. Binary files include GIF, JPEG, MPEG, AVI, PNG, EXE, CLASS, COM, XLS, DOC, MP3, and O files. They are designed to load faster at the expense of human readability. If poorly documented, they may even be unusable to anyone outside the developers that wrote the codec for that format.

 

As you've noticed, I included XLS and DOC formats in my list of unloadable binary formats. Microsoft often used proprietary formats, especially in their older Microsoft Office software (anything before 2007 falls in this category). These formats are not even publicly documented (fully, at least), so you probably will not be able to read the text fully even if you could just load it into a String variable. Your only choice would be to retrieve the file's contents into a Base64-encoded format, then attempt to decode it. The number of script lines required just to decode a modest sized file from base64 to valid UTF8 would likely run over the governor limits.

 

If you really want to do something with processing, you'll probably need ActiveX, Silverlight, Java, or a remote server to process the file. I apologize for the overly long post, but I just wanted to make sure that it was clear as to why you can not easily read a "text file" of a certain type.

All Answers

sfdcfoxsfdcfox

When loading a String variable, it must be in UTF-8 format. This is a character-based format, not a byte-based format.

 

Byte-based files are those that do not directly represent textual information. These types of files are also commonly called "binary files."  They have no consistent rule about how the bytes are laid out; each binary format has its own rules. For example, most binary files have a header (known as a magic string) that tells the program trying to read the file that it is probably a valid file. A GIF file will contain GIF as the first three bytes, while a JPEG will contain JFIF as the first four bytes (usually, but the format allows other data to precede this tag). One notable feature of binary files is that you can not print the contents properly, as they contain "non-printable characters."

 

Conversely, character-based formats are binary files that represent text. They must follow a set of rules to be considered a character-based format. These formats typically do not have a "magic string", since they identified by their contents. They have restrictions on which bytes may be placed in the file. For example, UTF8 specifies that you can not use an "overlong encoded character", such as 0xF880808081 (this should be simply 0x01), while UTF-16 specifies that certain byte combinations can not appear next to each other.

 

You can freely load valid HTML, XML, RTF, TXT, ASM, CPP, C, JAVA, PHP, PL, CSV, TSV, and ASP files (as examples). These are all character-based formats. They encode their data in a human-readable format. You need only a program such as Notepad to properly read and write these files. Because of UTF-8 restrictions on legal strings, you can not load any type of file that may violate the rules for UTF-8.

 

Byte-based files (a.k.a. binary files) will likely break the rules, because they were not designed for compatibility with text formats, but instead for space effeciency: character-based formats will ALWAYS be less efficient spacewise than a pure binary file. Binary files include GIF, JPEG, MPEG, AVI, PNG, EXE, CLASS, COM, XLS, DOC, MP3, and O files. They are designed to load faster at the expense of human readability. If poorly documented, they may even be unusable to anyone outside the developers that wrote the codec for that format.

 

As you've noticed, I included XLS and DOC formats in my list of unloadable binary formats. Microsoft often used proprietary formats, especially in their older Microsoft Office software (anything before 2007 falls in this category). These formats are not even publicly documented (fully, at least), so you probably will not be able to read the text fully even if you could just load it into a String variable. Your only choice would be to retrieve the file's contents into a Base64-encoded format, then attempt to decode it. The number of script lines required just to decode a modest sized file from base64 to valid UTF8 would likely run over the governor limits.

 

If you really want to do something with processing, you'll probably need ActiveX, Silverlight, Java, or a remote server to process the file. I apologize for the overly long post, but I just wanted to make sure that it was clear as to why you can not easily read a "text file" of a certain type.

This was selected as the best answer
bp-devbp-dev

When I try a getString on an rtf file it returns me the blob itself, which is ok but is there any way I can get the actual word content of it in English?

 

What I mean is, when I created a .txt UTF8 encoded file with the words "THIS IS A TEST", selected the doc, and did a d.body.getString() it returned the very phrase I put in the document. Is there a way to do it with a regular rtf file? I'd like to be able to parse the text in apex.

sfdcfoxsfdcfox

RTF files are very similar to HTML files; it is a text-based encoding format whereby certain structures of ordinary text become style-control information. So, it's conceptually possible to get the "plain English" out of the file, but you're going to have to work for it. Just for kicks, I donwloaded the RFC posted by Microsoft (a KB article, actually) that fully outlines the latest RTF format... there's a lot of considerations. Given the size of the specs, it is unlikely that you would be able to "perfectly" parse every conceivable type of RTF in pure Apex Code due to the intricities of the RTF format, and without a native parser available, either (e.g. JSON, XML). Take a look at the Wikipedia links for RTF, and see if you can determine if you can come up with a reasonable approximation. I'd do it myself, but I don't have time to do this myself currently; if I do get a moment, I'll see if I can revisit this thread with an update.

Sai Ram ASai Ram A

Thanks for the info provided.. This helps

Pedro I Dal ColPedro I Dal Col
The content of Word documents is compressed using the Zip format. To extract the text you need to uncompress a file named 'document.xml' that is embeded in all Word documents. For that you can use the Zippex library which is open source https://github.com/pdalcol/Zippex

After you install Zippex, use this code to get the content of the Word file in plain text:
 
Attachment att = [SELECT Body FROM Attachment WHERE Id = '<Attachment Id>']; //Replace with actual Id
Blob bodyBlob = att.body;
//Instantiate Zippex
Zippex myZip = new Zippex(bodyBlob);
//Uncompress data
String docXml = myZip.getFile('word/document.xml').toString();
//Remove XML tags
String plainText = docXml.stripHtmlTags();

System.debug(plainText);