About Marshall Pierce

Marshall specializes in highly tuned and immensely scalable web and mobile applications. Experienced in front-end web and iOS development, he constantly pushes the boundaries of the latest browsers and mobile platforms. He splits his time with back-end development, where he is considered a domain expert in Java concurrency, distributed systems, systems design, and network security. Prior to co-founding Palomino Labs, Marshall was director of software development at Ness Computing where he led their initial launch. Before Ness, Marshall was a senior software developer at Genius.com, where he built the best-in-class integration with Salesforce.com.

Open-source Libraries From Our Java Web Stack

A while ago, I wrote about how to set up Guice, Jetty, Jersey, and Jackson and then how to calculate metrics about Jersey resource methods. We’ve subsequently open sourced some libraries to make it easy to use these (and other) techniques.

I’ll be describing each of our new libraries in turn, or you can skip to the end and look at a sample app to see them in action.

Jersey CORS Filter

CORS allows better cross-domain sharing of Web resources. jersey-cors-filter eases the task of adding CORS headers to Jersey resource methods. In the simple case, you can annotate a resource method with @Cors and that’s all.

Read on

Creating URLs Correctly and Safely

Given how ubiquitous URLs are, they seem to be surprisingly poorly understood by developers as evidenced by the plentiful questions on Stack Overflow about how to correctly build a URL. See this excellent post by Lunatech for more details about how URL syntax works.

Instead of going over URL syntax in detail (see RFC 3986, RFC 1738, the above-mentioned blog post, and W3 docs on HTML if you want the full story), I’m going to talk about how it’s been done wrong in commonly available libraries, and then finally how to do it right using url-builder, a Java library we’ve released for building correct URLs.

Read on

Using whichinstance.com to Optimize EC2 Costs

We created whichinstance.com a while ago to make it easier to decide which EC2 pricing option to use. Since then, we’ve gotten some questions on how to use it most effectively. To clarify its usage, I’ll walk through a couple of examples.

First generation Standard Medium instance, 100% utilization, 8 months

Hypothetical scenario: continuous integration box for a medium-term project (8 months). Since it’s a CI box for your hardworking global team, it needs to be running 24/7.

Read on

Practical XML Parsing With Java and StaxMate

Java has no shortage of XML libraries and APIs: common ones like DOM, SAX, StAX, and JAXB, plus more esoteric ones like XOM, VTD-XML, Castor, etc. Of the more low-level XML tools (as opposed to data binding or other high-level functionality), the most common ones are DOM, SAX, and StAX, and this article explains how to use StAX effectively with StaxMate.

Between DOM, SAX, and StAX, why would one use StAX? DOM loads the entire document into memory and constructs a tree. It’s easy to navigate the tree however you wish, but having the whole document be RAM-resident is impractical with larger documents. SAX is a streaming parser, so it doesn’t have the memory usage problems that DOM does, but it’s awkward to use for many XML parsing tasks. StAX is a newer API that provides a more convenient API than SAX while delivering competitive performance.

Though StAX is easier to use than SAX, it could be better, which is where StaxMate fits in. StaxMate is a library that uses a StAX parser under the hood to get closer to the goal of DOM-like ease of use.

Contrived Example

There’s a sample project on GitHub. We’ll walk through the xml and the code step by step to show what’s going on. Try running the unit tests (with mvn clean install) and make sure everything passes.

The XML this code parses describes animals and vegetables and the various ways in which one may eat them. The XML is rather strange, but this is intentional so that different types of parsing tasks can be demonstrated.

Read on

Custom Task Types With BenchPress

Time being the limited resource that it is, it took a little while to wrap up, but BenchPress is now open source.

BenchPress is intended to be able to be able to represent many different types of payloads via simple JSON configuration, but the project is still new and it doesn’t (yet) have a lot of flexibility in terms of what users can do with the existing task definition language. Fortunately, it’s pretty straightforward to make your own custom task types, so in this post I’ll show how to make a “hello world” custom task type. You can also check out the sample code on GitHub.

BenchPress basics

The basic structure of the JSON you submit to the job controller is simple.

    "task": {
        "type": "HELLO-WORLD",
        "config": {
            # whatever you want

The config can be any JSON you wamt for your task type. The type is a semi-magical string that is used to identify a few classes that comprise a specific type of task; you’ll see how that string is used later.

TaskFactory and friends

I’ll go from the bottom up to explain the task execution structure. There are two types of nodes in BenchPress: worker and controller. Typically there is only one controller, but theoretically there could be many if you want. A job is submitted to the controller, which splits its sole task among the available workers. Each worker gets its own partition of the overall work.

Fundamentally, the work that a worker does is just a collection of Runnable instances. The Runnables are made on each worker by a TaskFactory instance. This is the relevant method of the TaskFactory interface:

Read on

XPath, Selenium, and Safely Handling Strings

XPath Basics

XPath is a query language that operates on XML documents and offers a reasonably succinct way to find XML nodes. Unfortunately, XPath string literals have an unsophisticated syntax, so there’s a little extra work to be done to handle strings safely. I’ve released an xpath-utils library for Java to do this robustly.

The simple XPath //div will find a div anywhere in the document. You can also use attributes in your XPath. If you have XML like this

    <bar attr="baz">asdf</bar>
    <bar attr="quux"></bar>

then you could find the second bar tag with /foo/bar[@attr='quux'].

XPath has lots of tutorials, so check those out if you’re curious.

Strings in XPath

XPath is useful when writing Selenium tests. Even though many Selenium selectors are better done with CSS selectors, XPath can express things that CSS cannot, so it’s useful to know how to use it. One common operation in Selenium is using text() to match against text node contents. You could match the first bar node in the above xml with //bar. However, to match against arbitrary strings, it’s important to be able to safely handle them, just like proper XML escaping is important when generating an XML document.

Read on

OS X Keyboard Shortcut Issues With Chrome and IntelliJ

Ctrl-shift-space is a tremendously useful keyboard shortcut in IntelliJ IDEA, and I use it all the time. It provides “SmartType” completion. Sounds great, if a little vague… here’s an example.

List<String> strList = new

If you were to invoke SmartType completion, you would have ArrayList<String> suggested, which is probably exactly what you wanted: an implementation of the interface type. Great! Except that frustratingly often, the default shortcut (ctrl-shift-space) simply does not work on the Mac. There are several complaints about this issue with Apple, with Google, with Google again and with Jetbrains. It appears that there is some bad interaction between Chrome, the Chinese trackpad input method (presumably for drawing characters), and the default keyboard shortcut for that method, which is ctrl-shift-space.

Not running Chrome isn’t an option for me since I am frequently doing web-facing development work, but fortunately there’s a relatively easy workaround. Apparently, OS X still captures ctrl-shift-space even when the Chinese trackpad input method is disabled. As if that wasn’t weird enough, there’s no way to disable the keyboard shortcut either, but you can change it to something that you’ll never use for anything else.

Read on

Introducing BenchPress: Distributed Load Testing for NoSQL Databases

Recently a client of ours posed an interesting question: they wanted to store many tens of thousands of objects per second, with each object needing several KiB of space, and they wanted to know which storage systems were capable of handling that much load. They were already using a SQL-based DBMS, but it was struggling to keep up with the load on high-end hardware. With the hardware at its limits and ever-increasing load looming, non-relational storage systems seemed like a good fit. The question was, what system could best handle their workload? To guide the search, we wanted a benchmarking tool that could efficiently and easily generate test workloads for a variety of different storage systems.

The target workload exceeded what can be pushed across a gigabit Ethernet interface, so the tool needed to be able to coordinate load generation across many nodes. We also wanted to be able to drive testing completely programmatically so that we could easily compare a variety of different workloads. Since we don’t have racks full of idle hardware lying around, we also wanted something that would be easy to spin up on EC2. And, of course, an easy setup procedure would be nice, especially when we want to get external teams up to speed quickly.

Current tool landscape

Grinder is one option for a benchmarking tool. It distributes work to multiple workers, and is mostly written in Java (user-defined workload scripts are written in Jython). Its gui-based control structure is great for quickly putting together fixed test scenarios, but is awkward for the sort of vendor-neutral, programmatically-driven testing that we wanted to do. Using user-defined scripts to generate load is a flexible approach, but also requires a fair amount of user effort to build support for each type of storage system.

YCSB is another choice. It supports many different databases, and requires less setup work than Grinder, but it doesn’t deal with distributed workloads (aside from “run multiple clients at once and merge the results later”). Configuring the various clients is more labor intensive than we’d like, especially when such configuration needs to be manually re-done on each node. It can be controlled via shell commands and properties files.

Read on