Escape the literal prison.
When you have more backslashes than words in the strings you are constructing, perhaps it’s time to look at another way of handling special characters like quotes.
The problem
You’re generating JavaScript code from another language with C-like quoting and escaping syntax (e.g. Java).
As part of that code generation, you’re substituting strings that have both double quotes (“) and single quotes (‘). My team’s actual problem stems from using XPath locators which include quoted text and sending that to Selenium so we can obtain the particular element that contains the text*.
Let’s say we want to pop up an alert within which are some double quotes:
String original = "Hello \"stranger\", how are you?";
String quoted = original.replaceAll("\"", "\\\"");
String script = "alert(\"" + quoted + "\")";
eval(script);
We then expect the browser to have this JavaScript code to process:
alert("Hello \"stranger\", how are you?")
and when then expect the alert to display the string:
Hello “stranger”, how are you?
But why doesn’t it work?
Well, as you can already see, there’s lots of escaping of quotes going on so let’s analyze each line of the code:
- Needs to escape the quotes around the word stranger in a string literal for the Java compiler.
- Replaces all double quotes with a backslash and double quotes (or does it??). The intent is to inject the same escaped quotes for the benefit of the JavaScript interpreter, when it’s a string literal in JavaScript code. To do that we inject the backslash character to escape the quote. But the backslash character also needs to be escaped for the Java compiler so we escape it too.
- Then constructs another string which is the JavaScript function to execute in the browser, containing our escaped-quotes string. Again, we’ve had to escape quotes in the string literals for the Java compiler.
- Sends the command to the JavaScript interpreter (e.g. a browser) for evaluation.
What does the JavaScript interpreter do with the command we’ve asked it to evaluate? Something like this:
ERROR: Threw an exception: missing ) after argument list
All that trouble we went to to escape the quotes and the escape character has disappeared! In fact, the result is identical to not having tried to inject the escape characters at all.
What’s going on? Let’s try escaping the backslash:
String original = "Hello \"stranger\", how are you?";
String quoted = original.replaceAll("\"", "\\\\\"");
String script = "alert(\"" + quoted + "\")";
eval(script);
This works. Why? We’re using String replaceAll(). It seems that not only is backslash an escape character for the Java compiler, it is also an escape character for the regex Matcher. From the JavaDoc of Matcher replaceAll():
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
Hmm, that vital piece of information might have been useful in the JavaDoc for String replaceAll() which delegates to Matcher replaceAll(). Obviously we can’t rely on the Java standard libraries for names being obvious (Principle Of Least Surprise does not hold with these APIs). Let’s try another method, String replace(). After looking at the JavaDoc we see:
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence…
The word “literal” is used a couple of times here, so we might interpret that to mean the special backslash and dollar sign handling doesn’t apply here. Indeed, looking at the JavaDoc of the functionality String replace() delegates to, Matcher.quoteReplacement():
Returns a literal replacement String for the specified String. This method produces a String that will work use as a literal replacement s in the appendReplacement method of the Matcher class. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes (‘\’) and dollar signs (‘$’) will be given no special meaning.
Apart from the fact that the author meant “backslashes” not “slashes”, this makes it obvious. Again, the JavaDoc for String replace() could have been more explicit and actually useful.
String original = "Hello \"stranger\", how are you?";
String quoted = original.replace("\"", "\\\"");
String script = "alert(\"" + quoted + "\")";
eval(script);
The strategic lesson is to not rely on the name of a library method to mean what you think it means, nor to rely on the API documentation to tell you what you really need to know about using the method. Although this is exactly what you should be able to do, you can’t.
& or < characters (btw, I had to HTML encode the ampersand and less-than sign for the blog) or %HH hex-encoded values in the text, or an invalid document that has double decoded characters that the renderer takes as markup characters rather than character data (have you ever seen something like this on a web page… Ben & Jerry’s).
So we’ve found out that at least three parts of our infrastructure (Java compiler, JavaScript interpreter, and the standard Java regular expression matcher) treat backslash specially. Are there more? Probably, and typically they’ll change as time goes by.
Perhaps the following approach goes from the frying pan into the fire (in which case you might not want to use something like it), but it uses a blend of the C-style escaping with encoding†…
Different encodings
Instead, let’s escape the literal prison and use a different mechanism to embed quotes in the string.
String original = "Hello \"stranger\", how are you?";
String quoted = original.replace("\"", "%22");
quoted = "unescape(\"" + quoted + "\")";
String script = "alert(\"" + quoted + "\")";
eval(script);
Now line 2 uses the character encoded version of double quote (the ASCII/UTF-8 hexadecimal value).
Also, line 4 includes the instruction for JavaScript to decode the string which would revert the %22 back to a double quote.
escape() and unescape() functions are NOT for encoding values in a URI. They are now merely for JavaScript functions to safely represent certain characters within the JavaScript context. URI encoding may have been their original intent but the encodeURI() and encodeURIComponent() functions are the correct ones to use for URI purposes.
Here, we are generating JavaScript and using a naive implementation of escape() so we can use the inbuilt JavaScript unescape() to decode the quotes at runtime.
Given we’re dealing with JavaScript, we can also use single quotes for string literals so we’d include the encoded version of them in our substitution too:
String quoted = original.replace("\"", "%22").replace("'", "%27");
Footnotes
*Why are you trying to find an element based on the text within it?
That is a long story but let’s imagine a website where the HTML structure is very poor, you want to run some functional tests against it, you don’t have access to the source of the application that generates the HTML, and you have only marginal influence on the third-party who writes that application.
In some unfortunate circumstances, there is nothing unique about an element except the text within it, or even worse, the text within a nearby element.
†Is it encode or escape?
Probably a naming accident during the rushed implementation that JavaScript originally was, the escape() and unescape() functions are probably better called encode() and decode().
“Escaping” refers to using a special character to have compiler temporarily “escape” from processing other special characters and rather treat them as normal, literal characters.
You use the same characters, but just prefix them with the escape character. In C-like languages, the escape character in string literals is backslash.
Encoding refers to using a different form to represent the data. So JavaScript encodes a double-quote character as the string %22. The URI standard also encodes them as %22 and the HTML standard encodes them as the HTML entity " or " (as 34 is the decimal encoding of double-quote), and in C we could use \%x22
In these encoding schemes, the %, & and \ characters are used as escape characters to indicate to the relevant parser that it needs to deviate from its normal processing and go into “encoding” mode.
Josh!!! That’s one of the first questions on the 5.0 certification exam – difference between replace() and replaceAll()
Wouldn’t it have been easier to simply use single quotes in JavaScript?
I usually go with single quotes in JavaScript unless it becomes specifically necessary to use doubles. Saves me a lot of trouble.
Oh, and I just remembered http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#quote(java.lang.String)