Given how ubiquitous URLs are, they seem to be surprisingly poorly understood by developers as evidenced by the plentiful questions on Stack Overflow about how to correctly build a URL. See this excellent post by Lunatech for more details about how URL syntax works.
Instead of going over URL syntax in detail (see RFC 3986, RFC 1738, the above-mentioned blog post, and W3 docs on HTML if you want the full story), I’m going to talk about how it’s been done wrong in commonly available libraries, and then finally how to do it right using url-builder, a Java library we’ve released for building correct URLs.
Sad tale #1: Java’s URLEncoder
This poorly-named class has a delightfully non sequitur first sentence of Javadoc.
Utility class for HTML form encoding.
One wonders why it’s named URLEncoder, then…
If you’ve read the Lunatech blog post, then you know by now that you cannot magically convert a URL string into a safe, properly encoded URL by running it through this class (or any other class), but just in case you haven’t done your homework, here’s a quick example.
Suppose you have you have an HTTP endpoint http://foo.com/search
that takes a q
query parameter whose value is the string to search for. If you search for the string You & I
, then your first attempt at creating a URL to execute this search might result in http://foo.com/search?q=You & I
. This won’t work because &
is the token that separates query param name/value pairs. Furthermore, once you have this mangled URL string, there is nothing you can do to fix it since you cannot reliably parse it.
So, let’s use URLEncoder
. The result of URLEncoder.encode("You & I", "UTF-8")
is You+%26+I
. The %26
will be decoded to a &
, and a +
in a query string is interpreted as a space, so that’ll work.
Now, suppose you want to assemble the path of the URL from your search string instead of putting it in the URL as a query parameter. http://foo.com/search/You & I
is clearly invalid. Unfortunately, using the result of URLEncoder.encode()
is also wrong. http://foo.com/search/You+%26+I
will have a decoded path of /search/You+&+I
since +
is not interpreted as a space in the path of a URL.
URLEncoder happens to work for some of the things you need to do. Unfortunately, its overly generic name makes developers likely to mistakenly use it in inappropriate ways, so it is best to avoid it entirely to avoid having future developers incorrectly extend your usage of it (unless, of course, you are specifically doing “HTML form encoding”).
Sad tale #2: Groovy HttpBuilder and Java’s URI
HTTP Builder is a Groovy HTTP client library.
Making a basic GET request is easy enough:
new HTTPBuilder('http://localhost:18080').request(Method.GET) { uri.path = '/foo' }
This sends GET /foo HTTP/1.1
over the wire, as it should. (You can verify this by running the code with nc -l -p 18080
running.)
Now let’s try a path that has a space in it.
new HTTPBuilder('http://localhost:18080').request(Method.GET) { uri.path = '/foo bar' }
This sends GET /foo%20bar HTTP/1.1
; still looking good.
Now, let’s suppose we want to have a single path segment that is foo/bar
. We can’t just send the path as foo/bar
because that will be interpreted as a path containing two segments foo
and bar
, so let’s try foo%2Fbar
(replacing the /
with its percent-encoded equivalent).
new HTTPBuilder('http://localhost:18080').request(Method.GET) { uri.path = '/foo%2Fbar' }
This sends GET /foo%252Fbar HTTP/1.1
. Not so good. The %
in %2F
has been re-encoded, so the decoded path will be foo%2Fbar
, not foo/bar
. It turns out that the blame here really lies with java.net.URI
which is used in HTTP Builder’s URIBuilder
class.
URIBuilder
is the type of the uri
property that’s exposed to the config closure in the above code samples. When you update the path of the uri via uri.path = ...
, that ends up invoking a URI
constructor which has this to say about the provided path
:
If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character (‘/’) or the commercial-at character (‘@’), is quoted.
This is not very useful behavior since it effectively makes it impossible to provide a properly encoded path segment whose unencoded form contains reserved characters. In other words, it’s fallen prey to the fallacy of “I will just encode this string and then it will be correct”. Either the string is already correctly encoded, in which case there is nothing to be done, or it is not, in which case it is hopeless because it cannot be reliably parsed. The fact that the documentation says that it will not quote /
means that it’s basically assuming the path string is simultaneously correctly encoded (uses /
appropriately as a path segment delimiter) and also not correctly encoded (because other stuff needs to be encoded).
It would be nice if HTTP Builder didn’t use this broken part of URI, of course, but it would be even nicer if URI wasn’t broken to begin with.
Doing it right
We wrote url-builder to provide a simple way to make the sorts of URLs that developers typically need to assemble. It uses encoding rules from the references listed at the top of this article and a small fluent-style API. This usage example shows basically everything:
UrlBuilder.forHost("http", "foo.com") .pathSegment("with spaces") .pathSegments("path", "with", "varArgs") .pathSegment("&=?/") .queryParam("fancy + name", "fancy?=value") .matrixParam("matrix", "param?") .fragment("#?=") .toUrlString() // produces: // http://foo.com/with%20spaces/path/with/varArgs/&=%3F%2F;matrix=param%3F?fancy%20%2B%20name=fancy?%3Dvalue#%23?=
This example demonstrates the different encoding rules for different parts of the URL, like the fact that &=
is allowed un-encoded in the path while ?/
are both encoded, yet =
is encoded in the query param and ?
is not since the query part has already started.
For more samples, see the tests and the UrlBuilder class.
Let us know if you find any improvements we can make to this library, or just to say that you find it useful!