+ Start a Discussion
KSusan CoxKSusan Cox 

Fetch the content of word file in string

Hi,

     How to fetch content of Word file in string. Because now I am getting all content in Blob but I am not able to fetch it in string. 
   
    I can fetch all content of text file but the same code is not working for word or excel. I tried conversion code from blob to string but it is not working.

    I also tried to encode and decode the content of file,but it is also not working. As encoded file dosent exactly converts into original contents after decode.

 Basically I want to read the content of all MSoffice files posted on chatter.

Thanks in advance.
Vinit_KumarVinit_Kumar
Natively it is not supported to read Word documents in Apex because it is a zip file (means a binary file containing text files.) and there is no Apex native support for unzipping files to get at the contents inside it.

However,there is a  blog which could help you out :-

http://andyinthecloud.com/2012/11/04/handling-office-files-and-zip-files-in-apex-part-1/

http://andyinthecloud.com/2012/12/09/handling-office-files-and-zip-files-in-apex-part-2/

If this helps,please mark it as best answer to help others :)
KSusan CoxKSusan Cox
Hi Vinit

Thanks for the reply. 

Is there any alternatives to solve this issue? Like encoding - decoding or anything else which can help me.

Thanks

Vinit_KumarVinit_Kumar
Like I said as of now it is not supported so I am afraid you don't have anything to read the Word body.

The other workaround I can think of creating an Attachment or Document record inside salesforce and then read the Body.

Hope that helps!!
KSusan CoxKSusan Cox
Hi,

I am near to the solution to read the content of file such as doc, xls, ppt, pdf. And it works fine.
But it consumes more CPU time so I am working on that.

Here I provide you the solution whihc works for me:

public static String blobToString(Blob input, String inCharset){
    String hex = EncodingUtil.convertToHex(input);
    System.assertEquals(0, hex.length() & 1);
    final Integer bytesCount = hex.length() >> 1;
    String[] bytes = new String[bytesCount];
    for(Integer i = 0; i < bytesCount; ++i)
       bytes[i] =  hex.mid(i << 1, 2);
    return EncodingUtil.urlDecode('%' + String.join(bytes, '%'), inCharset);
    }

Thanks.
Vinit_KumarVinit_Kumar
Thanks for sharing KSusan Cox !!
Pedro I Dal ColPedro I Dal Col
The content of Word documents is compressed using the Zip format. To extract the text, you need to uncompress a file named 'document.xml' that is embeded in all Word files. For that you can use the Zippex library (open source) https://github.com/pdalcol/Zippex

After you install Zippex, use this code to get the content in plain text:
//wordFileBlob is a Blob that contains the Word document
Zippex myZip = new Zippex(wordFileBlob);
//Uncompress data
String wordDoc = myZip.getFile('word/document.xml').toString();
//Remove XML tags
String plainText = wordDoc.stripHtmlTags();

System.debug(plainText);