You need to sign in to do that
Don't have an account?
Convert a string to ASCII values?
In order to integrate with a legacy system from Salesforce, I need to generate a digest in a certain way. Luckily the methods in the Crypto class get me 99% there, but for one step of the process I need to convert the characters in a string into their ASCII code values in order to do some math with the results.
The problem is that I can't find a simple way to do this in APEX. Conversions between strings and integers don't appear to do what I need as they would in most other languages. I can use the convertToHex method in EncodingUtils to get the hex value for a string, but the result is yet another string and the Integer class valueOf doesn't accept a number base. In the worse case, I can write the conversion of a hex string to an integer myself, but I want to be sure there's not another answer.
I know this a rather low-level task for APEX. Any suggestions?
Super old post, but I ran into the same issue and thought I'd post my solution in case anyone has to tackle this exact issue:
http://www.cloudywithachanceofcode.com/converting-string-to-decimal-character-array-in-apex/
Hope it helps someone.
-Richard
I found a way to see if a string is Unicode. The same technique could be used to figure out the ASCII value for characters if desired. The problem with "blob.valueOf" is that it sees Extended ASCII (128-255) as Unicode. This of course isn't true. Use this function and you will get back true if any part of your string is Unicode. If you want the ASCII values, you can create a loop with the string to parse the values 2 at a time, and if one starts with e3, then the next character is the unicode character within that set.
public static Boolean isUnicodeString(string strInput) {
Boolean rtn = false;
if (strInput == null || strInput == '' || blob.valueOf(strInput).size() == strInput.length()) return rtn;
string strHex = encodingUtil.convertToHex(blob.valueOf(strInput));
if (!strHex.contains('e3')) return rtn;
return true;
}
I should have updated this. I completely modified the code to support the unicode characters. This basically gives you utf8 code point numbers. Also for my purposes I realized I don't have to convert them the way I was. I needed a way to get an equivalent of byte in java, which I was able to work out just fine with converting the string to hex then do integer. This code below is something I came up with while trying to figure it all out and might be useful to anyone who needs unicode values for strings.
This absolutely boggles my mind that SFDC hasn't created an ASC function to return the code for the character (0-65535). Java inherently allows it, .NET, and virtually every other language, but for some reason SFDC has limited the Integer conversion and not provided an alternative.
Thank you rtuttle for that code snippet!
I have users entering Chinese characters into SFDC forms or copy/pasting in binary data from machine logs. Also many web servers think input values containing angle brackets are scripting attacks. When this data is sent via web callouts to an external web service it never arrives. HTML or Base64 encoding is problematic because I don't control all the endpoints and they expect non-encoded strings. I need to strip all outbound non-printable characters that users can enter for reliable XML transport. Converting chinese characters stumped me because I didn't understand how they can Hex encode into varying lengths.
Here's what I built that strips invalid XML and invalid Windows 1252 characters from a string using APEX. I'm fairly new to APEX programming and would love to see someone improve this.
... and btw, this for loop in C# is one line of code: "foreach (char c in xml)"
public with sharing class XmlTextCleaner {
/// <summary>
/// Remove illegal XML characters from a string.
/// </summary>
public static string SanitizeXmlString(string xml)
{
if ((null == xml) || (0 == xml.length()))
return xml;
String ret = '';
(Apex code from above removed to fit within allowed posting size)
for (Integer i=0; i < hex.length(); i += increment)
{
(Apex code from above removed to fit within allowed posting size)
if (IsLegalXmlChar(out) && IsLegalWindows1252(out)) {
if (60 == out) // "<"
charList.Add(91);
else if (62 == out) // ">"
charList.Add(93);
else
charList.Add(out);
}
}
String s = String.fromCharArray(charList);
System.debug('SanitizeXmlString: ouput=' + s);
return s;
}
/// <summary>
/// Whether a given character is allowed by XML 1.0.
/// </summary>
private static boolean IsLegalXmlChar(integer character)
{
return
(
character == 9 /* == '\t' == 9 */ ||
character == 10 /* == '\n' == 10 */ ||
character == 13 /* == '\r' == 13 */ ||
(character >= 32 && character <= 55295) ||
(character >= 57344 && character <= 65533) ||
(character >= 65536 && character <= 1114111)
);
}
// from http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx
private static boolean IsLegalWindows1252(integer character)
{
return
(
character == 9 /* == '\t' == 9 */ ||
character == 10 /* == '\n' == 10 */ ||
character == 13 /* == '\r' == 13 */ ||
(character >= 32 && character <= 255) ||
/* 0x01-- */
character == 338 /* LATIN CAPITAL LIGATURE OE */ ||
character == 339 /* LATIN SMALL LIGATURE OE */ ||
character == 352 /* LATIN CAPITAL LETTER S WITH CARON */ ||
character == 353 /* LATIN SMALL LETTER S WITH CARON */ ||
character == 376 /* LATIN CAPITAL LETTER Y WITH DIAERESIS */ ||
character == 381 /* LATIN CAPITAL LETTER Z WITH CARON */ ||
character == 382 /* LATIN SMALL LETTER Z WITH CARON */ ||
character == 402 /* LATIN SMALL LETTER F WITH HOOK */ ||
/* 0x02-- */
character == 710 /* MODIFIER LETTER CIRCUMFLEX ACCENT */ ||
character == 732 /* SMALL TILDE */ ||
/* 0x2--- */
character == 8211 /* EN DASH */ ||
character == 8212 /* EM DASH */ ||
character == 8216 /* LEFT SINGLE QUOTATION MARK */ ||
character == 8217 /* RIGHT SINGLE QUOTATION MARK */ ||
character == 8218 /* SINGLE LOW-9 QUOTATION MARK */ ||
character == 8220 /* LEFT DOUBLE QUOTATION MARK */ ||
character == 8221 /* RIGHT DOUBLE QUOTATION MARK */ ||
character == 8222 /* DOUBLE LOW-9 QUOTATION MARK */ ||
character == 8224 /* DAGGER */ ||
character == 8225 /* DOUBLE DAGGER */ ||
character == 8226 /* BULLET */ ||
character == 8230 /* HORIZONTAL ELLIPSIS */ ||
character == 8240 /* PER MILLE SIGN */ ||
character == 8249 /* SINGLE LEFT-POINTING ANGLE QUOTATION MARK */ ||
character == 8250 /* SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */ ||
character == 8364 /* EURO SIGN */ ||
character == 8482 /* TRADE MARK SIGN */
);
}
}
I have come up with 2 methods to assist in this requirement. One tells you simply if there is Unicode or not and what the Unicode characters are, and the other actually tells you the Unicode character number for each position. I keep these 2 in a Utility class that I can access from anywhere.
LOL thread confusion, ignore this message ;)
boBNunny, I like the change you made to the map, that'll save some space for sure!
-Richard
We should get a utility class together for code share. Anyone interested?
Thanx. I actually had a solution similar to yours, but you had a few efficiencies that I really liked, so (in the parlance of Rap music) I "sampled" your code. :manhappy:
But I found I needed 2 different variations. One for just did it have unicode or not, and one for what all of the actual values were.
But your code was definitely welcome and a great contribution.
I think a community share of Utility class methods is a good one.
Anyone else think so?
Hmm, from everything I read about unicode during researching the code I wrote I found that unicode is backwards compatible with standard ascii. What cases were you running into that you had to specifically figure out if it was unicode?
The code should in theory take a standard ascii character and just spit the standard 1 byte value out for it.
-Richard
Well, the problem arises if it's a character (chinese for example) that can't be represented in ASCII. Then it just shows as a ? if stripped. But also, we have a legacy system that can't accept unicode, so if there is even 1 character, the update to the legacy system will fail. Also, we have a requirement that Unicode be put into Localized fields mirroring those fields and so we can give a message to the user telling them to transfer the values first.
Ahh I could see how you would need that. I'm running into a similar problem so I might use your isUnicode method in an upcoming project if you don't mind.
That's the reason for these boards IMO. Great to have a virtual community to depend on. Glad I could help.
Unicode issues aside, I had a web service callout failure where the user copy/pasted text into SFDC that contained a simple backspace character (0x08). That is invalid according to the XML specification and the SOAP message could not be transported.
Had another time where the SOAP data contained a printable european character that was not supported by the Windows codepage. It made it through the SOAP transports and then threw an exception deep within a 3rd party DLL from IBM.
It's much easier to clean the string BEFORE transport than try to diagnose where it failed.
Agree. Sometimes you can have a hard space (ASCII 160) that can cause havoc with matching where trimming won't do it, so you need to strip those too. Or even the wide dash that looks like a regular dash. So those, you could fix during a matching call or even during storage. Another issue is when the Unicode version of an Alpha or Numeric character is used and it LOOKS like ASCII, but it's more than 255.
I combined the ideas into this...
public static LIST<Integer> StringToIntegerList(String strInput) {
LIST<Integer> charLIST = new List<Integer>();
if (strInput == null || strInput == '') return charLIST;
string strHex = EncodingUtil.convertToHex(Blob.valueOf(strInput));
if (strHex == null || strHex == '') return charLIST;
// Build map to convert hex to decimal
Map<String,Integer> hexMAP = new Map<String,Integer>();
for (integer nLoop = 0; nLoop < 16; nLoop++)
hexMAP.put('0123456789abcdef'.subString(nLoop, nLoop+1), nLoop);
Integer increment = 2;
for(Integer i = 0; i < strHex.length(); i += increment) {
Integer out = 0;
Integer c1 = (hexMAP.get(strHex.subString(i,i + 1)) * 16) + (hexMAP.get(strHex.subString(i + 1,i + 2)));
Integer c2 = 0;
Integer c3 = 0;
Integer c4 = 0;
if(c1 <128) {
out = c1;
increment = 2;
}
else
{
if(c1 > 193 && c1 < 224) {
// first of 2
increment = 4;
}
if(c1 > 223 && c1 < 240) {
// first of 3
increment = 6;
}
if(c1 > 239 && c1 < 245) {
// first of 4
increment = 8;
}
c2 = (hexMAP.get(strHex.subString(i + 2,i + 3)) * 16) + (hexMAP.get(strHex.subString(i + 3,i + 4)));
if(increment == 4) {
out = (c1 - 192) * 64 + c2 - 128;
}
else if(increment == 6) {
c3 = (hexMAP.get(strHex.subString(i + 4,i + 5)) * 16) + (hexMAP.get(strHex.subString(i + 5,i + 6)));
out = (c1-224)*4096 + (c2-128)*64 + c3 - 128;
}
else if(increment == 8) {
c4 = (hexMAP.get(strHex.subString(i + 6,i + 7)) * 16) + (hexMAP.get(strHex.subString(i + 7,i + 8)));
out = (c1 - 240) * 262144 + (c2 - 128) * 4096 + (c3 - 128) * 64 + c4 - 128;
}
}
if ((out != 0) && IsLegalXmlChar(out) && IsLegalWindows1252(out))
charLIST.add(out);
}
return charList;
}
That works, but for our purposes, we needed to simply know if it was Unicode or not. Since this returns a list, we would have to traverse the list to find the numbers (my second method). My first returns a string of the illegal characters. So if it's not Unicode, the string length would be 0.
But, whatever works for anyone is great. There's really no one way to do anything.
I needed to make two minor tweaks to make the Map initialization loop work for me.
- The hex alpha values were lower case after convertToHex.
- The substring needs to span a string length of one, not zero.
string strHex = '0123456789ABCDEF';
Map<String,Integer> hexMAP = new Map<String,Integer>();
for (integer nLoop = 0; nLoop < 16; nLoop++) {
hexMAP.put(strHex.substring(nLoop, nLoop), nLoop);
}
Map<String,Integer> hexMAP = new Map<String,Integer>();
for (integer nLoop = 0; nLoop < 16; nLoop++)
hexMAP.put('0123456789abcdef'.subString(nLoop, nLoop+1), nLoop);