Palomino Labs Blog » Marshall Pierce

Open-source Libraries From Our Java Web Stack

Marshall Pierce — Tue, 15 Oct 2013 17:53:17 +0000

A while ago, I wrote about how to set up Guice, Jetty, Jersey, and Jackson and then how to calculate metrics about Jersey resource methods. We’ve subsequently open sourced some libraries to make it easy to use these (and other) techniques.

I’ll be describing each of our new libraries in turn, or you can skip to the end and look at a sample app to see them in action.

Jersey CORS Filter

CORS allows better cross-domain sharing of Web resources. jersey-cors-filter eases the task of adding CORS headers to Jersey resource methods. In the simple case, you can annotate a resource method with @Cors and that’s all.

Having an @Cors annotation on the method or class will result in Access-Control-Allow-Origin and other CORS headers being set on the response:

@Path("foo")
public class FooResource {
    @GET
    @Cors
    public String get() {
      return "some data";
    }
}

If you watch the log output in the sample app, you’ll see that the request that should get Access-Control-Allow-Origin is displaying the value it receives for that header.

Jersey Metrics Filter

jersey-metrics-filter is a pre-packaged and improved version of the metrics-calculating technique described in an earlier blog post. In the default config, all Jersey resource methods will have timing and status code count metrics measured for them, but that can also be customized with the @ResourceMetrics annotation.

This method would have timing measured, but status codes will not be tallied:

@GET
@ResourceMetrics(statusCodeCounter = false, timer = true)
public String get() {
    return "stuff";
}

The sample app uses Metrics’ JmxReporter to make all metrics available via JMX, so be sure to open up jconsole to take a look.

Jersey Guice Dispatch Wrapper

jersey-guice-dispatch-wrapper is used by jersey-metrics-filter to directly wrap the invocation of Jersey resource methods so that timing information can be captured accurately. Most filtering needs can be accomplished with Jersey’s ContainerRequestFilter and ContainerResponseFilter, but if you need javax.servlet.Filter-style direct wrapping of request handling, this library will simplify the Jersey boilerplate.

Jetty HTTP Server Wrapper

jetty-http-server-wrapper provides a simple way to set up an embedded Jetty HTTP server with Guice Servlet. While you can certainly set up Jetty manually, this library lets you focus more on configuration rather than the mechanics of wiring up a ServletContextHolder, etc. It also provides sane defaults for TLS ciphers and protocols.

Here’s a simple TLS HTTP server:

KeyStore keyStore = getServerKeystoreFromSomewhere();

HttpServerConnectorConfig httpsConfig = HttpServerConnectorConfig.forHttps("localhost", 8443)
    .withTlsKeystore(keyStore)
    .withTlsKeystorePassphrase("password");

httpServerFactory
    .getHttpServerWrapper(new HttpServerWrapperConfig().withHttpServerConnectorConfig(httpsConfig))
    .start();

URL Builder

url-builder provides a builder-style API for assembling correctly-encoded URLs. See this blog post for more details on why this isn’t easily done using the built-in Java libraries. A simple example:

UrlBuilder.forHost("http", "localhost", 8080)
    .pathSegment("foo")
    .queryParam("search", "some stuff")
    .toUrlString();

Jersey/New Relic Integration

jersey-new-relic lets Jersey resource requests have useful New Relic transaction names that include the value of the appropriate @Path annotations. Without this, all Jersey requests show up as being handled by ServletContainer (or GuiceContainer if you’re using Guice). It also informs New Relic of exceptions thrown during request processing.

Demo app

I’ve put together a runnable sample app that shows all of these libraries in use. Grab the code, execute gradle run, and open up jconsole to look at the generated metrics. If you have New Relic, follow the instructions in the README to have data recorded in New Relic as well.

It happens to be written in Groovy, but is easily translatable to Java. We use this style of service (embedded Jetty with simple main()-method startup) for all of our Java web services that do not have to be deployed as a .war for compatibility with existing systems.

We’ve still got a few more things to release, but this is enough to get a useful service up and running. Let us know in the comments if you have suggestions for how any of these can be improved.

The post Open-source Libraries From Our Java Web Stack appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

Creating URLs Correctly and Safely

Marshall Pierce — Thu, 03 Oct 2013 23:33:05 +0000

Given how ubiquitous URLs are, they seem to be surprisingly poorly understood by developers as evidenced by the plentiful questions on Stack Overflow about how to correctly build a URL. See this excellent post by Lunatech for more details about how URL syntax works.

Instead of going over URL syntax in detail (see RFC 3986, RFC 1738, the above-mentioned blog post, and W3 docs on HTML if you want the full story), I’m going to talk about how it’s been done wrong in commonly available libraries, and then finally how to do it right using url-builder, a Java library we’ve released for building correct URLs.

Sad tale #1: Java’s URLEncoder

This poorly-named class has a delightfully non sequitur first sentence of Javadoc.

Utility class for HTML form encoding.

One wonders why it’s named URLEncoder, then…

If you’ve read the Lunatech blog post, then you know by now that you cannot magically convert a URL string into a safe, properly encoded URL by running it through this class (or any other class), but just in case you haven’t done your homework, here’s a quick example.

Suppose you have you have an HTTP endpoint http://foo.com/search that takes a q query parameter whose value is the string to search for. If you search for the string You & I, then your first attempt at creating a URL to execute this search might result in http://foo.com/search?q=You & I. This won’t work because & is the token that separates query param name/value pairs. Furthermore, once you have this mangled URL string, there is nothing you can do to fix it since you cannot reliably parse it.

So, let’s use URLEncoder. The result of URLEncoder.encode("You & I", "UTF-8") is You+%26+I. The %26 will be decoded to a &, and a + in a query string is interpreted as a space, so that’ll work.

Now, suppose you want to assemble the path of the URL from your search string instead of putting it in the URL as a query parameter. http://foo.com/search/You & I is clearly invalid. Unfortunately, using the result of URLEncoder.encode() is also wrong. http://foo.com/search/You+%26+I will have a decoded path of /search/You+&+I since + is not interpreted as a space in the path of a URL.

URLEncoder happens to work for some of the things you need to do. Unfortunately, its overly generic name makes developers likely to mistakenly use it in inappropriate ways, so it is best to avoid it entirely to avoid having future developers incorrectly extend your usage of it (unless, of course, you are specifically doing “HTML form encoding”).

Sad tale #2: Groovy HttpBuilder and Java’s URI

HTTP Builder is a Groovy HTTP client library.

Making a basic GET request is easy enough:

new HTTPBuilder('http://localhost:18080').request(Method.GET) {
  uri.path = '/foo'
}

This sends GET /foo HTTP/1.1 over the wire, as it should. (You can verify this by running the code with nc -l -p 18080 running.)

Now let’s try a path that has a space in it.

new HTTPBuilder('http://localhost:18080').request(Method.GET) {
  uri.path = '/foo bar'
}

This sends GET /foo%20bar HTTP/1.1; still looking good.

Now, let’s suppose we want to have a single path segment that is foo/bar. We can’t just send the path as foo/bar because that will be interpreted as a path containing two segments foo and bar, so let’s try foo%2Fbar (replacing the / with its percent-encoded equivalent).

new HTTPBuilder('http://localhost:18080').request(Method.GET) {
  uri.path = '/foo%2Fbar'
}

This sends GET /foo%252Fbar HTTP/1.1. Not so good. The % in %2F has been re-encoded, so the decoded path will be foo%2Fbar, not foo/bar. It turns out that the blame here really lies with java.net.URI which is used in HTTP Builder’s URIBuilder class.

URIBuilder is the type of the uri property that’s exposed to the config closure in the above code samples. When you update the path of the uri via uri.path = ..., that ends up invoking a URI constructor which has this to say about the provided path:

If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character (‘/’) or the commercial-at character (‘@’), is quoted.

This is not very useful behavior since it effectively makes it impossible to provide a properly encoded path segment whose unencoded form contains reserved characters. In other words, it’s fallen prey to the fallacy of “I will just encode this string and then it will be correct”. Either the string is already correctly encoded, in which case there is nothing to be done, or it is not, in which case it is hopeless because it cannot be reliably parsed. The fact that the documentation says that it will not quote / means that it’s basically assuming the path string is simultaneously correctly encoded (uses / appropriately as a path segment delimiter) and also not correctly encoded (because other stuff needs to be encoded).

It would be nice if HTTP Builder didn’t use this broken part of URI, of course, but it would be even nicer if URI wasn’t broken to begin with.

Doing it right

We wrote url-builder to provide a simple way to make the sorts of URLs that developers typically need to assemble. It uses encoding rules from the references listed at the top of this article and a small fluent-style API. This usage example shows basically everything:

UrlBuilder.forHost("http", "foo.com")
    .pathSegment("with spaces")
    .pathSegments("path", "with", "varArgs")
    .pathSegment("&=?/")
    .queryParam("fancy + name", "fancy?=value")
    .matrixParam("matrix", "param?")
    .fragment("#?=")
    .toUrlString()

// produces:
// http://foo.com/with%20spaces/path/with/varArgs/&=%3F%2F;matrix=param%3F?fancy%20%2B%20name=fancy?%3Dvalue#%23?=

This example demonstrates the different encoding rules for different parts of the URL, like the fact that &= is allowed un-encoded in the path while ?/ are both encoded, yet = is encoded in the query param and ? is not since the query part has already started.

For more samples, see the tests and the UrlBuilder class.

Let us know if you find any improvements we can make to this library, or just to say that you find it useful!

The post Creating URLs Correctly and Safely appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

Using whichinstance.com to Optimize EC2 Costs

Marshall Pierce — Mon, 01 Apr 2013 23:04:24 +0000

We created whichinstance.com a while ago to make it easier to decide which EC2 pricing option to use. Since then, we’ve gotten some questions on how to use it most effectively. To clarify its usage, I’ll walk through a couple of examples.

First generation Standard Medium instance, 100% utilization, 8 months

Hypothetical scenario: continuous integration box for a medium-term project (8 months). Since it’s a CI box for your hardworking global team, it needs to be running 24/7.

In the controls at the top, configure your desired region, and then select “Standard 1st gen” (since this is an m1 instance) and Medium. We’ll leave days at 365 since 8 months is close enough to the default.

8 months is about 244 days, and surprisingly enough the 3 year light utilization reserved instance turns out to be the cheapest.

However, 1 year light, 1 year medium, and 3 year light are all almost exactly the same cost. 1 year light is the cheapest before 8 months, while 1 year medium is the cheapest after 8 months. If it was up to me, I’d pick the 1 year medium in case an 8 month project becomes a 10 month project, but the cost of all 3 is fairly close. Then again, if it turns out to be a 14 month project, the 3 year light would be the best choice.

Second generation Standard XL instance, 60% utilization, 1.5 years

Hypothetical scenario: server used to run reports during business hours.

I’ve set up this graph at 730 days and 60% utilization, and the cursor highlight is at 1.5 years (548 days).

This time, with fractional utilization being taken into account, the heavy instances are nowhere near competitive price-wise, which makes sense. The 3 year light is the definite winner this time.

High CPU XL instance, 20% utilization, 2 months

Hypothetical scenario: short-term need for data processing once or twice a week.

An on demand instance is dramatically cheaper than the other options. Lest the previous two examples make it appear that reserved instances are the way to go, this should serve as a reminder that a lot of the value that EC2 offers is in cost-effectively handling occasional workloads.

The post Using whichinstance.com to Optimize EC2 Costs appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

Practical XML Parsing With Java and StaxMate

Marshall Pierce — Wed, 06 Mar 2013 23:27:05 +0000

Java has no shortage of XML libraries and APIs: common ones like DOM, SAX, StAX, and JAXB, plus more esoteric ones like XOM, VTD-XML, Castor, etc. Of the more low-level XML tools (as opposed to data binding or other high-level functionality), the most common ones are DOM, SAX, and StAX, and this article explains how to use StAX effectively with StaxMate.

Between DOM, SAX, and StAX, why would one use StAX? DOM loads the entire document into memory and constructs a tree. It’s easy to navigate the tree however you wish, but having the whole document be RAM-resident is impractical with larger documents. SAX is a streaming parser, so it doesn’t have the memory usage problems that DOM does, but it’s awkward to use for many XML parsing tasks. StAX is a newer API that provides a more convenient API than SAX while delivering competitive performance.

Though StAX is easier to use than SAX, it could be better, which is where StaxMate fits in. StaxMate is a library that uses a StAX parser under the hood to get closer to the goal of DOM-like ease of use.

Contrived Example

There’s a sample project on GitHub. We’ll walk through the xml and the code step by step to show what’s going on. Try running the unit tests (with mvn clean install) and make sure everything passes.

The XML this code parses describes animals and vegetables and the various ways in which one may eat them. The XML is rather strange, but this is intentional so that different types of parsing tasks can be demonstrated.

Initialization

private static final SMInputFactory FACTORY = new SMInputFactory(new WstxInputFactory());

SMInputFactory is the starting point to using StaxMate. We only ever want to have one SMInputFactory (and one StAX XMLInputFactory) since they are somewhat expensive to create, are threadsafe, and are intended to be re-used. Here, we’re using the Woodstox implementation of StAX. StAX, like SAX or JMS, is just a specification, so there are multiple different implementations. The JDK comes with one built in (SJSXP) but the Woodstox implementation is superior.

Starting to parse

Here’s the first bit of XML (see more):



  
    
      
        Prosciutto
      
      
        Speck
      
    
    ...

We’ve got a deliciousFood root element and a list of animal element inside a container animals element. The animal name is specified as an attribute, while the meat name is specified as a child element; the difference is simply so that we can see how to handle both usages of StaxMate.

The top level of the parsing code (from here):

public Food parse(InputStream xml) throws XMLStreamException {
    Food food = new Food();

    SMHierarchicCursor rootC = FACTORY.rootElementCursor(xml);

    try {
        rootC.advance();

        SMInputCursor rootChildCursor = rootC.childElementCursor();

        while (rootChildCursor.getNext() != null) {
            handleRootChildElement(food, rootChildCursor);
        }
    } finally {
        rootC.getStreamReader().closeCompletely();
    }

    return food;
}

First, we make a StaxMate cursor for the root element. Since StaxMate is, after all, based on StAX, it’s still a fundamentally streaming-style API, but, as you’ll see (especially if you’ve ever used SAX or StAX directly), it provides some nice helpers for common tasks.

Cursors start out before the first element, just like an iterator is before the first element in a collection until you call next() the first time. So, we advance() the root cursor, and now it should be at the root element (deliciousFoods).

The childElementCursor() call creates a cursor that is filtered to only expose the start of the first level of child elements, so in this case, the first thing it produces will be the start of the animals tag. We create a Food object that we’ll fill in as we parse, and loop over every child element of deliciousFoods. The result of getNext() doesn’t need to be inspected in our usage of it, since we know (because it’s a child element cursor) that the event is always START_ELEMENT. All we care is that it’s not null. Once it’s null, the cursor has read all of its input.

Children of the root element

private void handleRootChildElement(Food food, SMInputCursor rootChildCursor) throws XMLStreamException {

    switch (rootChildCursor.getLocalName()) {
        case "animals":
            handleAnimals(food.getAnimals(), rootChildCursor.childElementCursor());
            break;
        case "vegetables":
            handleVegetables(food.getVegetables(), rootChildCursor.childElementCursor());
            break;
    }
}

In this case there are only two child elements, and we know what order they’re in, but to avoid being needlessly tied to a specific ordering of XML tags, we generally use a switch on the tag name (a handy Java 7 feature!) to make the code more flexible.

In each case, we simply get a child element cursor (which would iterate over each animal or vegetable tag) and proceed.

private void handleAnimals(List animals, SMInputCursor animalsCursor) throws XMLStreamException {
    while (animalsCursor.getNext() != null) {
        animals.add(extractAnimal(animalsCursor));
    }
}

Here we use getNext() in a loop again to iterate over each animal and add each animal as it’s parsed to the list of animals.

Animals

private Animal extractAnimal(SMInputCursor animalsCursor) throws XMLStreamException {
    Animal a = new Animal();
    a.setName(animalsCursor.getAttrValue("name"));

    SMInputCursor meatsCursor = animalsCursor.childElementCursor();

    while (meatsCursor.getNext() != null) {
        Meat m = new Meat();
        SMInputCursor nameCursor = meatsCursor.childElementCursor().advance();
        m.setName(nameCursor.getElemStringValue());
        a.getMeats().add(m);
    }

    return a;
}

As a reminder, here’s the xml in question:


  
    Clod
  
  ...

Finally, the data we care about! We’re not worrying too much about making our business model (Animal in this case) be immutable, but we can at least make things like setName() be package-private so that the code that uses the result of our parsing can’t mess with the data.

Setting the name from the attribute is simple since the cursor is on the animal element already. Getting meat names is a bit more complicated since there’s an annoying extra level of XML hierarchy. The meat element isn’t doing anything other than holding the name element, so in a perfect world the XML wouldn’t look like this, but unfortunately this sort of XML is frequently encountered in the wild, hence its presence in our contrived example.

What we can do is create a child element cursor to iterate the children of animal (which will all be meat tags), and for each meat tag, we create another child element cursor. This last cursor will only traverse one element, the name element. Note that we still need to advance() that cursor so that we can get the element value of the name tag.

Another option (which I did not use in the sample code) is to use collectDescendantText() on the meatsCursor. If you’re sure that only one child element (name in this case) exists, you could call collectDescendantText to aggregate the text contents of the child nodes instead of creating a child element cursor.

Vegetables

Animals are done; on to the vegetables.

The xml:


  
    Brussels sprouts
    
      Sauteed
      Roasted
      Steamed

There’s a similar top-level loop just like for animals:

private void handleVegetables(List vegetables, SMInputCursor vegetablesCursor) throws
    XMLStreamException {
    while (vegetablesCursor.getNext() != null) {
        vegetables.add(extractVegetable(vegetablesCursor));
    }
}

The structure of each vegetable is different from the animals. The name is a child tag, and the ways to eat it are in a container element (preparations) and each preparation is just text in a node instead of being wrapped in a name or other extra node.

private Vegetable extractVegetable(SMInputCursor vegetableCursor) throws XMLStreamException {
    Vegetable v = new Vegetable();

    SMInputCursor vegChildCursor = vegetableCursor.childElementCursor();

    while (vegChildCursor.getNext() != null) {
        switch (vegChildCursor.getLocalName()) {
            case "name":
                v.setName(vegChildCursor.getElemStringValue());
                break;
            case "preparations":
                SMInputCursor preparationCursor = vegChildCursor.childElementCursor();
                while (preparationCursor.getNext() != null) {
                    v.getPreparations().add(preparationCursor.getElemStringValue());
                }
                break;
        }
    }

    return v;
}

Since name and preparations are the same level, we need another switch. In the preparations case, we can use getElemStringValue() directly on the child element cursor without the extra layer of cursors that was used in the animal code.

Conclusion

While for a document this small DOM would have been perfectly acceptable, this code isn’t much different from what the DOM code would look like, and is still actually a little bit faster than DOM. There’s a very crude benchmark which at least on my laptop shows that running the StAX code above takes about 57 μs to parse the file, while simply loading the DOM Document takes 74 μs. In other words, parsing the document and populating the desired data model with StAX is faster than simply doing the parsing step alone with DOM, even for a document this small. I wouldn’t read too much into this, though — it’s a very crude benchmark (little other than making sure to warm up the JIT was done, so it suffers from all sorts of problems due to on-stack replacement of the for loops, etc.). Try using VisualVM’s Visual GC or jconsole while running the benchmark; the two parsers have somewhat different allocation patterns.

The post Practical XML Parsing With Java and StaxMate appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

Custom Task Types With BenchPress

Marshall Pierce — Fri, 17 Aug 2012 21:42:23 +0000

Time being the limited resource that it is, it took a little while to wrap up, but BenchPress is now open source.

BenchPress is intended to be able to be able to represent many different types of payloads via simple JSON configuration, but the project is still new and it doesn’t (yet) have a lot of flexibility in terms of what users can do with the existing task definition language. Fortunately, it’s pretty straightforward to make your own custom task types, so in this post I’ll show how to make a “hello world” custom task type. You can also check out the sample code on GitHub.

BenchPress basics

The basic structure of the JSON you submit to the job controller is simple.

{
    "task": {
        "type": "HELLO-WORLD",
        "config": {
            # whatever you want
        }
    }
}

The config can be any JSON you wamt for your task type. The type is a semi-magical string that is used to identify a few classes that comprise a specific type of task; you’ll see how that string is used later.

TaskFactory and friends

I’ll go from the bottom up to explain the task execution structure. There are two types of nodes in BenchPress: worker and controller. Typically there is only one controller, but theoretically there could be many if you want. A job is submitted to the controller, which splits its sole task among the available workers. Each worker gets its own partition of the overall work.

Fundamentally, the work that a worker does is just a collection of Runnable instances. The Runnables are made on each worker by a TaskFactory instance. This is the relevant method of the TaskFactory interface:

Collection getRunnables(UUID jobId, int partitionId, UUID workerId,
    TaskProgressClient taskProgressClient, AtomicInteger reportSequenceCounter)
    throws IOException;

The method parameters represent the generic information available to every task — its parent job id, the id of the worker it’s running on, and some necessities for reporting progress back to the controller. In order to keep TaskFactory simple, the work of creating a TaskFactory has been pushed off to another interface, the TaskFactoryFactory (which is sure to drive Joel Spolsky nuts). The TaskFactoryFactory’s job is to create a TaskFactory given the JSON config, so its sole method is simply this:

TaskFactory getTaskFactory(ObjectReader objectReader, JsonNode configNode)
    throws IOException;

It’s up to you to read whatever you want out of the JSON and construct your flavor of TaskFactory.

TaskPartitioner

The JSON here is the JSON pertaining to one individual worker’s partition of the overall work. Since the task JSON is specific to each task type, the code to split up the original task into the per-worker partitions must necessarily be provided by the task type as well. So, we have the TaskPartitioner interface:

List partition(UUID jobId, int workers, String progressUrl, String finishedUrl,
    ObjectReader objectReader, JsonNode configNode, ObjectWriter objectWriter) throws IOException;

The workers param is how many workers the task should be split for. The two URLs are needed to create a Partition, and the ObjectReader, JsonNode and ObjectWriter params let the implementation deserialize its configuration info, split as desired, and re-serialize.

Hooking up a custom task type

BenchPress needs to know which TaskFactoryFactory and TaskPartitioner to hand the config JSON to based on the contents of the type JSON field. The way this is done is with the com.palominolabs.benchpress.job.id.Id annotation and Guice multibindings. Annotate your TaskFactoryFactory and TaskPartitioner implementations (which might be just one class):

@Id("HELLO-WORLD")
final class HelloWorldTaskFactoryFactory implements TaskFactoryFactory {
...

@Id("HELLO-WORLD")
final class HelloWorldTaskPartitioner implements TaskPartitioner {
...

and add the Guice bindings:

public final class HelloWorldModule extends AbstractModule {
    @Override
    protected void configure() {
        Multibinder.newSetBinder(binder(), TaskFactoryFactory.class)
            .addBinding().to(HelloWorldTaskFactoryFactory.class);
        Multibinder.newSetBinder(binder(), TaskPartitioner.class)
            .addBinding().to(HelloWorldTaskPartitioner.class);
    }
}

Note that this means that since your classes are instantiated by Guice, you are free to use @Inject on your TaskFactoryFactory and TaskPartitioner constructors if you need anything beyond the provided ObjectReader, etc.

Finally, you’ll need to tell BenchPress to use your custom module. You can do so by adding the jar for your custom code to the lib directories in the worker and controller tarballs and starting with an extra system property that is set to a comma-separated list of extra module names:

-Dbenchpress.plugin.module-names=com.foo.benchpress.helloworld.HelloWorldModule

Since both the controller and worker need the custom code (for the TaskPartitioner and TaskFactoryFactory, respectively), you’ll need to do this for both services.

Once that’s all done, you should be able to submit your job JSON to the controller and have it work. In the case of the sample “HELLO-WORLD” task type, you should see a logging message like this:

2012-08-17 14:32:43,111 [pool-5-thread-2] INFO  MDC[] c.p.b.e.h.HelloWorldTaskFactory - Greeting: Hello, world!

The post Custom Task Types With BenchPress appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

XPath, Selenium, and Safely Handling Strings

Marshall Pierce — Tue, 07 Aug 2012 21:30:53 +0000

XPath Basics

XPath is a query language that operates on XML documents and offers a reasonably succinct way to find XML nodes. Unfortunately, XPath string literals have an unsophisticated syntax, so there’s a little extra work to be done to handle strings safely. I’ve released an xpath-utils library for Java to do this robustly.

The simple XPath //div will find a div anywhere in the document. You can also use attributes in your XPath. If you have XML like this


    asdf

then you could find the second bar tag with /foo/bar[@attr='quux'].

XPath has lots of tutorials, so check those out if you’re curious.

Strings in XPath

XPath is useful when writing Selenium tests. Even though many Selenium selectors are better done with CSS selectors, XPath can express things that CSS cannot, so it’s useful to know how to use it. One common operation in Selenium is using text() to match against text node contents. You could match the first bar node in the above xml with //bar. However, to match against arbitrary strings, it’s important to be able to safely handle them, just like proper XML escaping is important when generating an XML document.

Strings are very limited in XPath. String literals can either be single-quoted strings that do not contain single quotes, or double quoted strings that do not contain double quotes. To be more specific, the spec uses this grammar:

'"' [^"]* '"'
| "'" [^']* "'"

This means that we can represent It's-Its are delicious with the string literal "It's-Its are delicious" because the original string does not contain a ". Similarly, we can represent "To be or not to be" as the literal '"To be or not to be"'.

Unfortunately, when a string contains both ' and ", it gets a little messy. To represent "I'm hungry", we have to split the string on either ' or " and use concat() to stitch the string back together into an XPath expression that doesn’t need to use both types of quotes in each string literal. If we split on ', we get

concat('"I', "'", 'm hungry"')

and if we split on " we get

concat('"', "I'm hungry", '"')

You can use xpath-utils to do this concat-ification for you. The XPathUtils class in that library has some convenience methods, the most important of which is getXPathString. It takes a string input and returns an XPath string literal or expression as needed.

// XPath string literal: "foo ' bar"
String simpleCase = XPathUtils.getXPathString("foo ' bar");
// XPath expression: concat('""foo"', "'", '"bar""')
String complexCase = XPathUtils.getXPathString("\"\"foo\"'\"bar\"\"");

Another use of this technique beyond text() matching is matching CSS classes. It’s somewhat awkward to do so in XPath, but it exhibits how to correctly use safe string handling. If we have the following markup

and we want to find the li with the selected class, we can’t just do //li[@class='selected'] because the class attribute isn’t an exact string match for the XPath string literal'selected'. (Of course, the li.selected CSS selector would work fine here!) Instead, we can use concat() and friends to handle the case where the target class isn’t the only class on the node:

//li[contains(concat(' ', normalize-space(@class), ' '), ' selected ')]

It’d be nice if we could safely construct XPath even if we tried to use a CSS class that had quotes in it. A class like crazy"'class won’t match any HTML nodes, but that’s better than throwing an exception because our XPath statement didn’t parse! We can use another XPathUtils method, hasCssClass, to automatically generate the XPath boilerplate:

// contains(concat(' ', normalize-space(@class), ' '), concat(' crazy"', "'", 'class '))
String classXpath = XPathUtils.hasCssClass("crazy\"'class");

Of course, it also works fine on simple cases:

// contains(concat(' ', normalize-space(@class), ' '), ' selected ')
String classXpath = XPathUtils.hasCssClass("selected");

If you’re a Maven user, this is the dependency statement for the library.


    com.palominolabs.xpath
    xpath-utils
    1.0.1

Since this was just released, it may take a day or two for this to propagate to your Maven mirror.

The post XPath, Selenium, and Safely Handling Strings appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

OS X Keyboard Shortcut Issues With Chrome and IntelliJ

Marshall Pierce — Fri, 27 Jul 2012 21:28:05 +0000

Ctrl-shift-space is a tremendously useful keyboard shortcut in IntelliJ IDEA, and I use it all the time. It provides “SmartType” completion. Sounds great, if a little vague… here’s an example.

List strList = new

If you were to invoke SmartType completion, you would have ArrayList suggested, which is probably exactly what you wanted: an implementation of the interface type. Great! Except that frustratingly often, the default shortcut (ctrl-shift-space) simply does not work on the Mac. There are several complaints about this issue with Apple, with Google, with Google again and with Jetbrains. It appears that there is some bad interaction between Chrome, the Chinese trackpad input method (presumably for drawing characters), and the default keyboard shortcut for that method, which is ctrl-shift-space.

Not running Chrome isn’t an option for me since I am frequently doing web-facing development work, but fortunately there’s a relatively easy workaround. Apparently, OS X still captures ctrl-shift-space even when the Chinese trackpad input method is disabled. As if that wasn’t weird enough, there’s no way to disable the keyboard shortcut either, but you can change it to something that you’ll never use for anything else.

Open System Preferences.
Open Language & Text. Note the disabled “Keyboard Shortcuts…” button.
Enable Chinese – Traditional. This will enable the button. If this isn’t enabled, the shortcut will not show up in the Keyboard pref pane, so you can’t edit it, even though it’s still active to a certain extent.
Click the button. It will take you to the Keyboard Shortcuts area of the Keyboard pref pane.
Change the “Show Hide Trackpad Hand…” shortcut to something improbable. I chose command-option-control-shift-backslash. For some reason, there’s no checkbox to disable that shortcut, even though all the other shortcuts have one. Also, if you have the default key bindings for Spotlight, you’ll see yellow shortcut conflict warnings. Don’t worry; those will go away once you disable Chinese – Traditional again.
You can now go back to Language & Text and disable Chinese – Traditional.

Ctrl-shift-space should then work fine in IntelliJ.

The post OS X Keyboard Shortcut Issues With Chrome and IntelliJ appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.

Introducing BenchPress: Distributed Load Testing for NoSQL Databases

Marshall Pierce — Wed, 13 Jun 2012 20:07:29 +0000

Recently a client of ours posed an interesting question: they wanted to store many tens of thousands of objects per second, with each object needing several KiB of space, and they wanted to know which storage systems were capable of handling that much load. They were already using a SQL-based DBMS, but it was struggling to keep up with the load on high-end hardware. With the hardware at its limits and ever-increasing load looming, non-relational storage systems seemed like a good fit. The question was, what system could best handle their workload? To guide the search, we wanted a benchmarking tool that could efficiently and easily generate test workloads for a variety of different storage systems.

The target workload exceeded what can be pushed across a gigabit Ethernet interface, so the tool needed to be able to coordinate load generation across many nodes. We also wanted to be able to drive testing completely programmatically so that we could easily compare a variety of different workloads. Since we don’t have racks full of idle hardware lying around, we also wanted something that would be easy to spin up on EC2. And, of course, an easy setup procedure would be nice, especially when we want to get external teams up to speed quickly.

Current tool landscape

Grinder is one option for a benchmarking tool. It distributes work to multiple workers, and is mostly written in Java (user-defined workload scripts are written in Jython). Its gui-based control structure is great for quickly putting together fixed test scenarios, but is awkward for the sort of vendor-neutral, programmatically-driven testing that we wanted to do. Using user-defined scripts to generate load is a flexible approach, but also requires a fair amount of user effort to build support for each type of storage system.

YCSB is another choice. It supports many different databases, and requires less setup work than Grinder, but it doesn’t deal with distributed workloads (aside from “run multiple clients at once and merge the results later”). Configuring the various clients is more labor intensive than we’d like, especially when such configuration needs to be manually re-done on each node. It can be controlled via shell commands and properties files.

Though YCSB was a closer match to our priorities than Grinder, we thought that there was room for improvement for the sort of testing that we wanted to do, so we decided to build a simple tool called BenchPress with a distributed-by-default approach, programmatic job control, and a focus on minimizing the time from downloading the tool to generating useful measurements. It’s not quite ready to be released to the world yet (though it will be soon!), but we’re nonetheless eager to get feedback on the approach we’re taking and what we can do to make the tool useful for other people’s performance investigations.

System structure

The goals we set out to achieve were these:

Generate hundreds of MiB/s of load to a storage system.
Work across a variety of storage systems.
Capture basic performance statistics.
Customize the generated data (what keys, values, etc. are used).
Minimize the learning and setup required before useful data can be gathered.
Build the simplest thing that could possibly work.

There are a few initial conclusions we can draw from these goals.

We need to distribute load across many worker nodes since we can’t assume anything faster than 1Gbit Ethernet.
Since there will be many worker nodes, we need to coordinate them. The simplest way to manage a cluster of nodes is to have just one centralized controller node. As long as we’re careful to limit the responsibilities of the controller, this isn’t a bottleneck, and HA isn’t really a concern for this task.
The controller would need to know which workers exist in the system, and though we could configure the controller with a list of workers, or vice versa, a service discovery approach would simplify setup.
We should avoid requiring any per-node setup beyond having a Java runtime, decompressing the BenchPress distribution, and running a startup shell script.
We’d like to allow common workloads to be defined declaratively so that users don’t have to write scripts or custom workload implementations.
The system should make performance data from all workers easily available to the user.

Given the above guidelines, we implemented the following structure.

Workers register their existence with ZooKeeper via Curator so that the controller can dynamically find workers at runtime. This means that the only setup that needs to be done to get BenchPress nodes up and running is to configure the workers and controller with the location of a ZooKeeper cluster. A single standalone ZooKeeper server is fine for this limited use, or you can just have the controller run an embedded ZooKeeper server.
Workers and the controller communicate via a simple REST API. There’s nothing performance critical happening between workers and the controller, so ease of use is a bigger priority than wire efficiency, and JSON over HTTP is as easy as it gets.
Workload job definitions are submitted to an endpoint in the controller’s REST API. A user can just use curl and a text editor, but programmatic job submission is also easy.
Workload job definitions contain all the configuration information needed to connect to the target storage system (e.g. for HBase, things like the ZooKeeper quorum). This means that no further setup needs to occur on each worker node to connect to HBase or MongoDB or any other system. It all just works out of the box.

Workload job flow

POSTing the following sample JSON to the controller’s API will start the job. In this case, the job will write 1,000,000 objects into MongoDB.

{
    "task": {
        "type": "MONGODB",
        "config": {
            "hostname": "127.0.0.1",
            "port": 27017,
            "dbName": "foo",
            "collectionName": "bar"
        },
        "op": "WRITE",
        "threads": 4,
        "quanta": 1000000,
        "batchSize": 1000,
        "keyGen": {
            "type": "WORKER_ID_THREAD_ID_COUNTER"
        },
        "valueGen": {
            "type": "ZERO_BYTE_ARRAY",
            "config": {
                "size": 10000
            }
        },
        "progressReportInterval": 10000
    }
}

Of note is that the user does not provide any implementation of how to write to MongoDB, and instead only provides (hopefully self-explanatory) configuration directives. This means that to perform the same test against an HBase cluster requires only small changes to the JSON (HBase needs a ZooKeeper quorum string, for instance). Though this declarative approach does constrain the user to the types of workload for which there are configuration options, we think that the user-friendliness is worth the tradeoff given the ease of implementing new storage systems and configuration options. Of course, custom job types are also possible.

The workload will be divided by the controller into partitions (one for each worker). Each worker will send performance data back to the controller as it proceeds through its partition of the work, and the controller aggregates all of this data and makes it available through the REST API.

Future plans

We’ve got lots of ideas that we’d like to implement. A few of them are:

Support more storage systems (DynamoDB, Riak, …). Adding support for more systems is easy, so why not enable users to do more out of the box?
More workload configuration options. Some real-life workloads use lots of columns, for instance, so it would be good to be able to concisely configure a workload to generate large numbers of arbitrary columns.
Automated cluster setup via tools like Whirr. Though node autodiscovery is a good start, it would be even better to be able to start up a fully functional cluster with a Whirr recipe.
More performance analysis options. CSV export, permanent performance data storage, a graph-laden web UI, and streaming live job progress to external systems would all be useful.

If you have input on what features would be useful to you, let us know in the comments or via email. BenchPress is on Github for your browsing pleasure.

The post Introducing BenchPress: Distributed Load Testing for NoSQL Databases appeared first on Palomino Labs Blog.

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality. Check out our website for more information, or send us an email and let's start talking about how we can work together.