Create a Universally-Readable Dictionary

05 Jun 2013

You ever need to create a dictionary?

Of course you do! Heck, I do it all the time!

Oh, stop looking at me like that. I don’t just mean language dictionaries! Address books, blogrolls, bibliographies, calendar events grouped by date… anywhere you have a list of words/phrases/terms/names/etc. and their corresponding definitions/explanations/details/etc., you’ve got a dictionary.

For example, suppose you wanted to create a list of tourist attractions in your favorite city:

Cerro San Cristóbal: A really big hill on the north side of town.
La Vega:             Awesome place to buy veggies.
Patio Bellavista:    Great nightlife and dining.

I’m going to go ahead and take the liberty of assuming that your favorite city is Santiago de Chile, but if this is completely unacceptable to you, feel free to follow along with your own list.

Not too shabby. Now, let’s tweak it a little bit by putting everything in quotes:

"Cerro San Cristóbal": "A really big hill on the north side of Santiago."
"La Vega":             "Awesome place to buy veggies."
"Patio Bellavista":    "Great nightlife and dining."

And just so that it’s clear where each entry in the list ends, we’ll add a comma at the end:

"Cerro San Cristóbal": "A really big hill on the north side of Santiago.",
"La Vega":             "Awesome place to buy veggies.",
"Patio Bellavista":    "Great nightlife and dining."

And for good measure, let’s decorate it with some curly braces because they’re just so goshdarned pretty:

{
  "Cerro San Cristóbal": "A really big hill on the north side of Santiago.",
  "La Vega":             "Awesome place to buy veggies.",
  "Patio Bellavista":    "Great nightlife and dining."
}

Well, would you look at that — we now have ourselves a properly-formatted JSON object! Not only is it still perfectly readable by humans, but you can now feed it into a script in just about any language.

And since most JSON readers don’t mind a little whitespace, you can make it even prettier:

{
  "Cerro San Cristóbal":
    "A really big hill on the north side of Santiago.",

  "La Vega":
    "Awesome place to buy veggies.",

  "Patio Bellavista":
    "Great nightlife and dining."
}

With a little bit of JSON, you can easily create attractive definition lists that are readable both by humans and computers!

Keep Track of Progress

28 May 2013

Let’s pretend that you are tasked with writing a one-off script to process a massive amount of backlogged data (a few of you might not have to pretend very hard). Because this script is ultimately going to be run once and then discarded, it would be a waste to spend lots of time building error handling routines and making the process super robust — it would be far less expensive just to expect failure and come up with an efficient way to deal with it.

And if there’s one thing you can count on, it’s failure. Servers will go offline, files will disappear, hamsters will chew through the power cables… one way or another, you’re going to run into a situation where a script or job that you’ve written will not be able to complete its task successfully.

And then what?

Well, you’ve got two options:

Clear out all the partially-processed data (assuming that’s even possible!), run the job again and hope extra hard that it works this time!
Periodically save your progress to a file so that your program knows exactly where to leave off in the event of a crash.

I don’t know about you, but I am rather partial to option #2.

For the purposes of this example, let’s say that you manage a website with a huge number of hosted image files (cc-by-sa, of course). One day, you noticed that other website owners are using your images without properly crediting your site. I know that barely ever happens in real life, but bear with me here. p So you’ve gotta write a script that does two things:

Submit each photo to a reverse image search API to locate unauthorized uses of the image and log the offending websites to a file for later follow-up.
Apply a watermark to each image so that future content copiers will attribute the images correctly whether they want to or not.

Step 1 could potentially take awhile, and there are a lot of opportunities for something to get screwed up somewhere along the line. Network outages, connection timeouts, the remote server sends back a 503 — there are a lot of ways this process could fail.

Even the watermarking process could go boom. A corrupted or oddly-formatted image might cause the image processor to crash.

Since you can’t spend hours and hours making your script fail-proof, you’ll have to find a way to keep track of your script’s progress so that in the event that something does break, the program can recover gracefully (and something will break – usually right after you go home for the evening, wasting an entire night’s worth of processing time).

JSON to the rescue! Yeah, I bet you were wondering when I’d get to the point already.

Imagine you started off with a status.json file that looked something like this:

{
  "img_1856322": {
    "search":     false,
    "watermark":  false
  },
  "img_1856323": {
    "search":     false,
    "watermark":  false
  },
  ... and so on ...
}

How this file was generated is outside the scope of this article, but given that you’ve made it this far, I bet you’ve already thought of at least 3 devilishly clever ways to do it.

When your script starts up, it reads the JSON object from the status file and starts traversing it, looking for tasks that haven’t been completed yet. As it finishes each step, it updates the object with the results (and don’t forget to write the JSON back to status.json each time it is updated so that you don’t lose your status in the event that the entire script gets crash-y!).

After awhile, status.json starts to show signs of life (explanatory comments added):

{
  "img_1856322": {
    "search":     [],   // Oh good; nobody's using this image!  It's mine!  All mine!!
    "watermark":  true  // Watermark completed successfully (:
  },
  "img_1856323": {
    "search":     ["http://www.istealimages.com/buy_my_poster/totally_legit.jpg"],  // Somebody is abusing our image!  Release the hounds!
    "watermark":  ["DivideByZeroError", "Unexpected divide by zero in region 42."]  // Oops; something went wrong during the watermarking!  Wait... *unexpected* divide by zero?
  }
  ... you get the idea ...
}

Notice something else cool about status.json? It doubles as a log file once the script is finished running! Not only do you have a lean placemarker file for your script, but it’s also in a format that’s easy for a human to scan through.

For really complex jobs or periodic tasks, you will probably find that it is worth the time to integrate a fully-featured job manager and spend the extra effort tuning your script for robustness. But when you have a one-off task to crunch through, consider using a JSON-formatted status file so that both you and your script can keep a handle on things.

Get More Bang for Your Relational Buck

21 May 2013

From time to time, your data models might suffer from field creep — you have to keep adding more and more fields to your models to work around edge cases and unique business requirements.

In the world of relational databases, this usually manifests either as additional fields tacked onto an existing table, or a separate “detail” table with a one-to-one relation to its corresponding source model — and these are just the least complicated approaches; there are far more complex ways to address the situation!

Relational Schema 1

One approach to addressing the need for ever-more-specific-use-case fields: add more (increasingly ridiculously-named) fields to the table.

Relational Schema 2

An alternate approach: move all of those pesky “detail” fields into a separate table that you can hide away in the closet whenever company comes over.

In addition to cluttering up your ERD, you are also leveraging the power (and overhead!) of your relational database engine for things that don’t actually need it.

Your pet store application might never need to filter a select query by userstory183274_marker value, but thanks to user story 183274, you still have to store that information with each pet.

Perhaps there’s another approach, one that is more scalable and doesn’t clutter up your database tables with (almost) useless fields… and yet still makes this custom information accessible to your application.

Well, as it happens, there is! Combine all of those extra “detail” fields into a single detail field that uses – you guessed it – JSON to store key/value pairs!

Relational Schema 3

Much easier on the eyes!

Let’s peek at some values in our newly JSON-ified Pet table:

pet_id	owner_id	species_id	name	detail
1	1	1	‘Bowser’	`{"collar_manufacturer": "PetProtector", "rabies_flag": true}`
2	1	2	‘Oliver’	`{"has_name_tag": false, "music_preference": "Light Jazz"}`
3	2	1	‘Marcus’	`{"userstory183274_marker": "FILENOTFOUND"}`

Notice that in the time it took to draft up this table, we added a new feature: Musical Preference. Oliver likes Light Jazz.

Some of the benefits of storing “extra” data about your models this way:

You can add additional properties without having to alter your database schema.
Keeps your database schema tidy; only “essential” fields show up in your tables.

There are a few costs to this approach that you should be aware of, however:

It requires additional application logic to encode/decode the JSON in the detail field. This has implications both for performance and development time.
These fields are not indexable! If you need to search/filter by a property in a detail field, you will need to move it to its own column in the database table or use a system such as Lucene to index and search that value.
Extracting data from a detail field is an all-or-nothing operation; you have to pull the entire JSON string from the database and decode it in your application to find a specific property. This is usually not an issue for modern ORMs, since all columns are usually automatically pulled in every time they fetch a record from the database.

For storing one-off properties and flag values that you only need at the application level (i.e., you do not need to index/filter/search these values), think about whether you can store these values in a single JSON-formatted string instead of bloating your tables with extra fields.

And if you find that you’re using this feature a lot in your relational-database-powered application, you might want to consider whether a NoSQL solution such as CouchDB is a better fit for some or all of your data!

Write a Better Config File

20 May 2013

Anyone who’s been writing applications for awhile has run into the problem of making the application’s behavior configurable in different operating environments. From DSNs to administrator email addresses, it’s important to have a set of configuration values that live outside the codebase.

But how to store these values in a format that a computer can easily read and write?

A lot of solutions have been tried over the years, including INI, XML and YAML. Each of these formats has its strengths, but they all lack the combination of power and simplicity that JSON offers.

Consider a configuration file that stores database connection strings for different operating modes. Let’s start by seeing how you might write it with INI syntax:

[prod]
sql_username   = dbuser
sql_password   = dbpass
sql_hostname   = 192.168.113.172
sql_portnum    = 3306
sql_database   = app_database
redis_hostname = 192.168.113.173
redis_portnum  = 6379

[test]
sql_username   = dbuser
sql_password   = dbpass
sql_hostname   = 192.168.113.172
sql_portnum    = 3306
sql_database   = test_app_database
redis_hostname = 192.168.113.175
redis_portnum  = 6379

Well, it gets the job done, but it’s kind of… verbose. Plus, if that config file gets long enough, it will get really difficult to separate the different sections (prod, test, etc.).

What about an XML config file?

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE configurations SYSTEM "path/to/config.dtd">
<configurations>
  <configuration mode="prod">
    <option key="sql">
      <value key="username" type="string">dbuser</value>
      <value key="password" type="string">dbpass</value>
      ...
    </option>
    <option key="redis">
      <value key="hostname" type="inet_addr">192.168.113.173</value>
      <value key="portnum" type="integer">6379</value>
    </option>
  </configuration>
  <configuration mode="test">
  ...
  </configuration>
</configurations>

Yuck — and I thought INI was verbose! I didn’t even want to type the entire thing out, it got so huge! And by the way, good luck trying to store array values!

How about YAML?

---
prod:
  sql:
    username: dbuser
    password: dbpass
    ...
  redis:
    hostname:  192.168.113.173
    portnum:   6379

test:
  sql:
    username: dbuser
    password: dbpass
    ...
  redis:
    hostname:  192.168.113.175
    portnum:   6379

Not bad. Definitely easy for a human to read, and there are libraries in practically every programming language to read and write YAML files. But I’ll share with you a little secret: YAML is JSON-compatible!

That’s right; the above configuration file could also be written like this:

---
prod:
  sql:   {"username": "dbuser", "password": "dbpass", ...}
  redis: {"hostname": "192.168.113.173", "portnum": 6379}
test:
  sql:   {"username": "dbuser", "password": "dbpass", ...}
  redis: {"hostname": "192.168.113.175", "portnum": 6379}
p Well, if you're going to go and use a configuration file format that's compatible with JSON... why not use JSON?
pre.
{
  "prod": {
    "sql": {
      "username": "dbuser",
      "password": "dbpass",
      ...
    },
    "redis": {
      "hostname": "192.168.113.173",
      "portnum":  6379
    }
  }, 
  "test": {
    ...
  }
}

Not only do you get an easy to read syntax (for both humans and computers), but you also get the advantage of being able to explicitly supply a type for your configuration values.

Try to set a configuration value to null in an INI or YAML file (credit where it’s due, at least XML can handle this… if you don’t mind typing a dozen extra characters!).

What about true or false? In JSON, this is just part of the syntax:

{
  "prod": {
    "use_caching": true
  },
  "test": {
    "use_caching": false
  }
}

And with JSON libraries available in nearly every programming language (or supported natively if you are developing a Node.js app!), it’s incredibly easy to integrate a JSON configuration into your application!

Keep Function Parameters Tidy

19 May 2013

We’ve all been there before: You finished putting together an awesome class or function whose behavior can be configured down to the smallest degree, and now to wire up all the different configuration values:

function jsonapi(url, data, pre_execute_hook, success_hook, failure_hook, error_hook, post_execute_hook, async, method, jsonp) {
  ...
}

What a mess. Can you imagine trying to call that monstrosity?

jsonapi('http://www.example.com/rest/doSomethingAwesome', {"subscribe": ["peter", "henrietta", "xavier"]}, null, mySuccessHandler, myFailureHandler, myErrorHandler, null, null, null, true);

At least you have some job security — nobody is going to want to maintain that thing!

However, there’s a better way to organize your function parameters using JSON:

function jsonapi(options) {
  // Merge options with default values.
  options = $.extend(
    {
      // Standard Options
      'url':          '',
      // Hooks
      'pre_execute':  null,
      'success':      null,
      'failure':      null,
      'error':        null,
      'post_execute': null,
      // Advanced Options
      'async':        true,
      'method':       'post',
      'jsonp':        null,
      'data':         null
    },
    (options || {})
  );
}

Note that we are using the extend() function from the jQuery library to merge values in the options parameter.

When you are invoking the function, pass a JSON-encoded object as the value of the options parameter:

jsonapi({
  "url":      'http://www.example.com/rest/doSomethingAwesome',
  "data":     {"subscribe": ["peter", "henrietta", "xavier"]},
  "success":  mySuccessHandler,
  "failure":  myFailureHandler,
  "error":    myErrorHandler, 
  "jsonp":    true
});

You might have noticed several advantages to this method:

The values you are passing at invocation time are self-documenting — even without looking at the function signature, you can tell which option each value corresponds to.
You can omit any options that should use the default value — in the above example, the we did not have to pass a null value to the pre_execute, post_execute, method and async options; the function automatically uses default values if they are omitted from the JSON object passed to it.
Individual options can be documented in the function body itself. p One thing to be careful about though, is that the different options are no longer part of the function’s signature, so IDE tooltips and auto-completion won’t work. Make sure you document well your functions and the options they expect if you are using this technique!

Older Newer

JSON Community

Create a Universally-Readable Dictionary

Keep Track of Progress

Get More Bang for Your Relational Buck

Write a Better Config File

Keep Function Parameters Tidy