The technology that powers Cue

How Apple’s Garbage Collector Trashed Our iOS App

A Quick Disclaimer:
I have confirmed this issue to be fixed in [redacted]. Even as you continue to support iOS 6, you should think carefully about releasing this code into production, as we have received a small number of crash reports that appear to originate in CFURLCache (on the order of 4 in 1,000 launches).

Ever since we started formally profiling memory usage in our app, Leaks had been finding problems in our HTTP caching code. The traces always seemed to be coming from iOS system calls, and since the amount leaked was insignificant — a kilobyte here and there — it was never a priority to track it down.

Fast forward to our big December app update, in which our primary goal was to reduce network usage and to speed up load times. We forked our install of SDURLCache, tweaked storage capacities, tracked down failures in our NSURLProtocol logic, and created unit tests for all of our worst pathological cases. In the end, we cut our network usage in half and drastically increased the loading speed of all content detail pages!

Unfortunately, as our caching got better, our memory leak had gotten worse. A lot worse. Crash-your-app-in-five-minutes worse.

So we took another look; I mean we really took another look. We tried various patches to SDURLCache, we tried vanilla NSURLCache, we even tried manually parsing Cache.db. All these solutions were spotty, unstable, or leaky (or all of the above).

After a couple days of sadness and futility, we eventually found our way to a discussion over at https://github.com/steipete/SDURLCache/issues/7, where the inimitable Saurik had dropped in and disassembled CFCachedURLResponse, because that’s what Saurik does.

The source is interesting stuff, and I’d encourage you to read the whole thing, but for our purposes let’s focus on this line:

CFMakeCollectable(CFRetain(data)).

You could be forgiven for not recognizing CFMakeCollectable. Objective-C’s awkward Garbage Collection phase is something we’re all trying to forget, after all. This function transfers ownership to the Garbage Collector if one exists, which makes the above line equivalent to [[data retain] autorelease].

However, as Saurik eventually uncovered, a reference-counted environment such as iOS will convert CFMakeCollectable into a no-op, leaving you with an unmatched retain. A Memory Leak!

So now that we've discovered a bona fide bug in Apple's code, it’s time to get to work. If you look at Saurik’s entire disassembly of the accessor method, you’ll see that this is not a simple case of throwing in an extra release and calling it a day. Based on the internal state of the object, this method will do one of two things. It might return the incorrectly retained object, or it might return a properly autoreleased copy, which will crash your app if you try to release it. We have literally no way of knowing beforehand which it will do.

So here’s what we did!

Before we do anything else, we run a quick check to verify the leak is still present in this version of iOS. Now that Apple has fixed the bug, we don't want network access to insta-crash our app.

1
2
3
4
5
6
7
8
+ (void)cueInstallDataFixIfNeeded;
{
    if (needsWorkaround()) {
        Method oldMethod = class_getInstanceMethod([NSCachedURLResponse class], @selector(data));
        Method newMethod = class_getInstanceMethod([NSCachedURLResponse class], @selector(cueSwizzledData));
        method_exchangeImplementations(oldMethod, newMethod);
    }
}

We then create a wrapper around the accessor and call the original method twice. If we get the same pointer both times, we know that we’ve hit the bad branch and need to call release twice. If the pointers don’t match, we know that we’ve hit the good path, and we can go about our business. Memory leak fixed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- (NSData *)cueFixedData;
{
    @synchronized(self) {
        NSData *retval = [self cueSwizzledData];
        @synchronized (retval) {
            NSInteger count = [retval retainCount];
            if (retval == [self cueSwizzledData]) {
                if ([retval retainCount] > count) {
                    [retval release];
                    [retval release];
                }
            }

        }
        return retval;
    }
}

So there you have it! A very messy bug (mostly) neutralized thanks to the Objective-C runtime. You’re welcome to use it, but be aware that this kind of hacking is very touchy stuff, and you should think/test long and hard before you consider releasing it into production, as we won’t be responsible if your app crashes.

You can download a simple repro of this leak here.

Thanks for reading! If you’re excited about optimizing iOS, check out our jobs page or send us an email at jobs+ios@cueup.com. We’re hiring Python and Java hackers as well.

Acknowledgements:
@rs for creating SDURLCache
@saurik for being a true rockstar
@bogardon for first implementing the fix
@steipete for all his work and research
@mattt for pushing to get this post out the door

References:
Repro of memory leak
Full discussion of memory leak
Apple Developer Forums discussion (requires developer account)
Discussion of NSURLCache and why it doesn’t work
Apple’s GC Documentation

TheKitchenSync - a Tool Belt for iOS Concurrency

Today we're happy to open source TheKitchenSync: a set of utilities, locks, and thread-safe collections designed to make iOS better. Check out how to get started in under 5 minutes here.

Concurrency is hard.

Some would say Threads are Evil, but we'll just stick with Concurrency is hard.

Much like sharing your toothbrush, you shouldn't share application state without taking very extensive precautions and accounting for every edge case. Objective C is like an anonymous toothbrush exchange. The sharing part is easy, but the accounting part is very very hard. We'd like to make it easier.

CueSyncCollections

One piece of Java's API that we've always liked is Collections.synchronizedMap. If you know that a collection will be accessed in multiple threads and isn't in the critical path, why would you waste neurons on synchronization logic for it? And what if human error gets its way and you forget a synchronize call? Doesn't it make more sense to reduce the cognitive overhead and abstract all of this into the class itself?

We've taken that principle and applied it to Objective C with the creation of CueSyncArray, CueSyncDictionary and CueSyncSet. These classes present most of the interface of their unsafe counterparts (including the subscript operator!), but provide their own locking to ensure thread-safety. They will also retain and autorelease any returned value to guarantee no dangling pointers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
CueSyncArray *syncArray = [@[@"some", @"items"] cueConcurrent];
CueSyncDictionary *syncDict = [@{ @"key" : @"val" } cueConcurrent];
[self cuePerformBlockInBackground:^{
  [syncArray insertObject:@"more" atIndex:1];
  syncDict[@"key2"] = @"val2";
}];

int i = 0;

// We cannot guarantee safety during normal for-loops, but foreach loops are safe.
for (id obj in syncArray) {
  // Since we copy the array before enumerating, you can even mutate mid-loop!
  syncArray[i++] = @"I've been mutated!";
}

Tasks

While Grand Central Dispatch is great, we found it too rigid and too opaque for some of our needs. With that in mind we created CueTaskQueue. A queue can be started with any number of worker threads, and it exclusively maintains those threads until it is stopped (naturally, if you call startWithThreadCount:1 then you don't have to worry about synchronization).

In addition to predictable threading behavior, the task queue also allows sophisticated deduping of tasks. In our case, we wanted to get rid of unnecessary render calls on our home screen, so we allowed each object a single queued render task at any given time, and "collapsed" the rest. This can be done either by specifying a key (usually a NSString or NSNumber), or by implementing CueTaskQueueDelegate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CueTaskQueue *queue = [[CueTaskQueue alloc] initWithName:@"PewPewQueue"];

// relatively low-priority queue.
queue.threadPriority = 0.3;

// Run 100 times, queuing tasks of key 0-9, but only allowing one of each.
for (int i = 0; i < 100; ++i) {
    int taskKey = arc4random() % 10;
    [queue addTask:[CueBlockTask taskWithKey:@(taskKey) priority:1.0f executionBlock:^(CueTask *task) {
        NSLog(@"Task %d reporting for duty!", taskKey);
    }]];
}

// Since the queue deduped its tasks by key, we shouldn't have more than 10 tasks in the queue when it starts.
[queue startWithThreadCount:1];

// Executes remaining tasks, cleans up, and removes the thread.
[queue finish];

Locking

One glaring omission in Objective C is the read/write lock. NSReadWriteLock seemed like a pretty obvious feature, but for whatever reason, it never came. To that end, we've created CueReadWriteLock with the assistance of CueFairLock. It is worth mentioning that we first tried POSIX's pthread_rwlock_t but opted instead for the convenient deadlock detection you get when you build on NSFoundation locks.

A quick note: In benchmarking, CueReadWriteLock proved to be far slower than normal or recursive locks. We are working on improving it, but in the meantime think hard about whether you want to incur this overhead.

Stacking the Odds

One awesome trick in C++ is the RAII pattern, which uses stack-based objects' destructors to enforce some behavior in the current scope, and then guarantee clean-up afterwards. Using the Chimaeric wonder that is Objective C++, We've created the CueStackLock system. An example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
- (BOOL)unsafeMethod;
{
    // Guarantees unlock at end of method call, even if exception is thrown.
    CueStackLock(_lock);

    // Also works with CueReadWriteLock!
    CueStackLockWrite(_rwLock);

    if ([self needsToBreak]) {
        // Don't need to remember to unlock first!
        return NO;
    }
    return [self throwAnException];
}

If you're familiar with the C++11 std::lock_guard, then you should feel right at home with this. In fact we've even created a CueStackLockStd() wrapper macro that is compatible with std::mutex!

CueLoadingCache

One thing we realized in our usage of dictionaries was that they were often being used to cache file system reads, or some other synchronously derived value. To aid in this use-case, we created the CueLoadingCache. It shares most of the interface of an NSMutableDictionary, but allows the user to specify a custom transform, which it will use to dynamically and safely generate its contents. It's not unlike Guava's CacheBuilder.

1
2
3
4
5
6
7
8
9
10
CueLoadingCache *multiplyByTwo =
[[CueLoadingCache alloc] initWithLoader:^id(id key) {
    return @([key intValue] * 2);
} isMemorySensitive:YES];

// "0 * 2 = 0"
// "1 * 2 = 2"
for (int i = 0; i < 10; ++i) {
    NSLog(@"%d * 2 = %@", i, multiplyByTwo[@(i)]);
}

Tying up loose threads

Because Concurrency is hard, we'd love to hear your feedback. Feel free to send us a pull request if you spot a bug, or just drop us a line if you like what we're doing!

As a closing note: There are many languages that address these problems by not sharing state at all. If you're interested in how other languages tackle this problem, you should really check out Erlang and Go.

And of course, if you want to work on this stuff for a living: We are hiring.

Daemon Showdown: Upstart vs. Runit vs. Systemd vs. Circus vs. God

We write a lot of daemons: programs which run on servers in the background, like an HTTP server, or a database. Once we've written the programs, though, we have to run them, and running programs as daemons is surprisingly heavy on details; it's fraught with perils for the unwary. If you go the traditional Unix way, you do some magic with double-forking and pid files and init scripts, and it's horribly tedious. If you run in a detached session of screen or tmux, you end up doing everything manually. In both cases, there's nothing in place to restart your program if it crashes. Steve Huffman, one of Reddit's founders, will tell you from experience just how bad an idea this is:

In the early days of Reddit, we didn’t really have any crash protection. We had tons of errors, and Reddit would occasionally lock up, freeze, or get in an infinite loop, any number of ways of bringing down the site. I used to have to sleep with my laptop and I would wake up every couple of hours and see if Reddit was working, and restart it. It was the worst feeling in the world; totally mentally draining.

To reclaim their ability to sleep, they had the computer keep watch for them, resurrecting dead processes. This also let them automate health checks: when Reddit's front page became unresponsive, they could detect this automatically and simply kill the offending process, secure in the knowledge that the supervisor program would quickly bring it back up again. They could sleep right through this fleeting death and rebirth, and only a few people would notice.

There are some good tools for keeping programs running as daemons, and all of them make it easy for you. If you're on Ubuntu, Upstart is the default, and it's a good choice. If you want something less monolithic, Runit is featureful and appealingly Unixy. Several Linux distributions use systemd as their init, and for our purposes it's comparable to Upstart. Other interesting options include Supervisord, God, and Circus. Unfortunately there is no clear best answer here, so which one you want depends on your needs. Let's compare.

Upstart

Used as the default init replacement on Ubuntu, Upstart is very good at starting things up. The Upstart Manual can be kind of overwhelming, but the config files themselves are pretty simple. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# myserver - it's my server
#
# myserver listens on a port and, like, does things. Great things.

description     "My cool server"

start on runlevel [2345]
stop on runlevel [!2345]

respawn
respawn limit 10 5

setuid myuser
exec /usr/local/bin/myserver --port 8080

Let's walk through this. The comments at the top customarily include a one-line description and an optional longer description below – a lot like a git commit message. The description line is, again, for your own use. The start on runlevel [2345] part means to start the service when the machine is started. The respawn and respawn limit commands tell Upstart to restart the server if it dies, but stop trying to restart it after 10 failures in 5 seconds. Finally, Upstart sets the uid to myuser and runs the server. Make sure to setuid, because if you don't include it, your server will be run as root by default. This is a profoundly weird default – it's usually not what you want, and it's very dangerous. So, be careful with this.

Upstart expects myserver not to fork. Any output from myserver on stdout or stderr gets sent, by default, to a log file in /var/log/upstart/. Both of these things can be changed; Upstart can handle the classic forking style of daemons, and you can plug in other options for logging.

Upstart is one of the best options for running traditional forking daemons, since it actually uses ptrace to make sure it's watching the correct process ids. The system for this is very slick.

Runit

Upstart, while admirably simple to use, does have some shortcomings. Its default logging, for example, does not support log rotation and is only compatible with logrotate if you use the copytruncate hack. Upstart's options for running servers sometimes don't include all that you might want, and as far as I know there's no easy way to run something in that same environment from a regular shell. Runit is an interesting alternative, inspired by D. J. Bernstein's daemontools: it's a collection of programs which each do one thing, and together, they have a lot of features that Upstart lacks. If you want a gentle introduction to using Runit, the blog post Runit for Ruby (And Everything Else) is very well-written.

The two best things about Runit are the chpst and svlogd programs. With chpst you can start a program with a changed process state:

chpst -u myuser -/ /var/roots/mychroot -o 10000 -n 3 /usr/local/bin/myserver --port 8080

This runs the command /usr/local/bin/myserver --port 8080 as user myuser, chrooted to /var/roots/mychroot, with a maximum of 10000 open file descriptors, at a niceness level of 3. Some of these options are available with Upstart, and some aren't. The beauty of chpst, though, is that you can use it with anything. You can use chpst with Upstart, or Supervisord, or whatever else strikes your fancy. And because Runit relies entirely on chpst for this sort of thing, you end up with completely reproducible execution environments: if you run the Runit script for a service by hand, you get the same environment as you would get if Runit had run it. This lack of magic is invaluable when debugging.

Similar to chpst, Runit's svlogd can be used with anything to provide log-handling. It reads text-based logs on stdin, and either writes that to rotated log files, or sends it to syslog, or some combination thereof.

Runit is a decent way to run daemons, and it has useful pieces that work with just about anything. We use it for most of our services. The one caveat to be aware of is that Runit expects daemons not to fork; this is great for servers that you write yourself, but it can be inconvenient if you try to use Runit with some other programs out there.

systemd

The always-lowercase systemd is used as the default init on a fair number of Linuxes, most notably the Redhat/Fedora family of distributions. If you're using one of these and would like to go with the already-installed option, this blog post is a good introduction. Since it's pretty similar to Upstart, the easiest way to describe it is to talk about some of the differences. First, obviously, the configuration files have a different syntax, but that's more of a superficial difference – the concepts are the same. More noteworthy are the ways that systemd handles logging, resource limiting, and daemons that fork.

Systemd is serious about logging (though I suspect they may have over-engineered it a bit). By default, all processes have their stdout and stderr connected to systemd's logging facility (described here and compared with syslog here), but it can also go to syslog, the kernel log buffer, a socket, a tty, or any combination of these. Docs for the logging are here.

Systemd has a very clever solution to the problem of tracking daemons that fork, which coincidentally happens to handle resource limiting at the same time. Where Upstart uses ptrace to watch the forking, systemd runs each daemon in a control group (requires Linux 2.6.24 or newer) from which it can not escape with any amount of forking. This allows easy resource limiting, both for forking and non-forking daemons, since control groups were made for this sort of thing.

More documentation can be found on the systemd web site. Or, for a reference, the man pages are here.

Other contenders: Supervisord, God, Circus

Supervisord is pretty convenient to set up, and seems to get a fair amount of use in Pythonland. Its main advantage over Upstart, Runit, and systemd is that Supervisord can easily start pools of several instances of a program. This is useful if you want a pool of worker processes, e.g. as FastCGI workers. Supervisord expects programs to not daemonize, though it can handle daemonized programs with a pid file workaround. This workaround does not handle edge cases as well as Upstart or systemd, but it should work fine most of the time.

The hard-to-google God hails from the Ruby world. Its great virtue is programmability: you decide what programs to start, how to run them, etc., and God will monitor them and keep them up and running. Suppose you have three back-end web servers, which you want to run on three different ports, so that a reverse proxy can load-balance across all of them. The config file is just a Ruby program, so a simple loop can bring up all three servers. The God web site has examples with code. If you're comfortable writing Ruby and you have interesting needs, God might be what you're looking for.

Mozilla's Circus is perhaps the most interesting of the three. It's programmable in Python, it can handle pools of workers, and it can bind sockets for you. This last point is especially neat: Circus can act as an HTTP server container by binding to a socket, starting up a pool of worker processes, and passing them the file descriptor of the socket. It acts a lot like Gunicorn, but more powerful. Circus also handles logging and log rotation, has a web-based console for watching the supervised processes, and sends a stream of all events over a ZeroMQ PUB socket. It's more actively developed than Supervisord or God, but already stable enough for production – and it has the best docs. Here are some examples of how to use it. If I wanted to serve up a web site with a WSGI stack, Circus is the option I would look at first.

Any of these three can be run under, say, Upstart. They don't want to be an init replacement, so they play nice with everything else.

Pick something

Hopefully this infodump is a useful comparison, but the most important thing isn't which of these options you use; what's important is that you use something. Automating server administration a bit more is almost always a win. The more things you can solve once and then stop thinking about, the better.

It's also worth mentioning that a simple crash is far from the only reason why a daemon might need to be restarted. Several of our server health checks can kill unresponsive processes, if something has frozen up or gone into a weird state. This can also be a temporary stopgap against hard-to-debug intermittent memory leaks: if the server is using too much memory, kill it! We have an ever-growing number of health checks keeping our servers in working order, because it's easy to add more. We're not quite writing crash-only software, but it turns out that you can get a lot of its advantages by making crashes cheap.

(If you find this kind of thing interesting, we're hiring. curl -L cueup.com/jobs)

How Our Terminal-Friendly Jobs Page Works

Yesterday we submitted a jobs posting on Hacker News titled "curl -L cueup.com/jobs". Typing that command into your terminal leads to a scrolling terminal-friendly version of our jobs page.

A few people were curious how it it's done, so here's a quick explanation. The experience relies on the combination of two techniques: ANSI escaping and buffered output.

Most people use ANSI escape codes for simple coloring of their terminal, but (perhaps thankfully), its more advanced features are rarely used. Try this in your terminal:

1
printf "\x1b[4m\x1b[5m\x1b[44mHello World\x1b[0m\n"

(There are many libraries that help with formatting ANSI escape codes. We used the termcolor Python module.)

After the initial "ACCESS GRANTED" message is shown, we slowly buffer output to your screen to mimic the feel of downloading a page on a slow internet connection. We use Tornado to serve our front-end which makes the process of buffering output pleasant to write.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class CurlJobsPageHandler(tornado.web.RequestHandler)
  """Renders the cURL jobs page"""


  TICK = 0.2


  @tornado.web.asynchronous
  @tornado.gen.engine
  def get(self):
    text = colorMarkdown(htmlToMarkdown('jobs.html'))

    textParts = ['%s\n' % x for x in text.splitlines()]

    for part in textParts:
      self.write(part)
      yield tornado.gen.Task(self.flush)
      yield tornado.gen.Task(ioloop.sleep, random.uniform(max(tick - 0.10, 0.01), self.TICK))
    self.finish()

We hope you enjoyed our jobs page! When we're not trying to find new ways to catch your attention, we work on some pretty interesting technical challenges as part of our job. If you're excited about things like entity extraction, distributed search, mobile performance or ANSI escaping (just kidding - mostly) send us an email at jobs+curl@cueup.com.

Colossal Cue Adventure: A Zork-style Text-based Programming Adventure Game

Check out our latest challenge:

Screenshot

We had so much fun with our last programming challenge and we enjoyed the results so much that we decided to build another one.

We wanted to make this one a bit different and a little more fun, so we wrote it in the style of some of our favorite old text-based adventure games (not to mention this great mostly-text-based adventure game ).

We hope you enjoy it!

(If you do, join in the discussion on Hacker News!)

Hookshot - Hacking the Objective C Runtime for Fun and Profiles

Today we're open sourcing hookshot, a library for iOS applications that lets you instrument your code to generate thread activity graphs:

and aggregated profiling stats:

The open source repo contains 5-minute installation instructions and an example project.

Instrumenting vs. Sampling

hookshot contains (among other things) an instrumenting profiler. Contrary to its name, Apple's profiler in Instruments is actually a sampling profiler. We find both useful for different occasions. Specifically, we find hookshot most useful for:

  • Thread activity graphs with drill-in
  • Seeing accurate counts of calls
  • More precise per-call timings of messages
  • Measurement of messages with highly variable performance

How it works

Hookshot works by transparently wrapping your classes with a custom NSProxy subclass that records profiling data. It replaces alloc for instrumented classes with code that activates this subclass. When it receives messages, it calls custom hooks before and after delegating to the original implementations.

Building hookshot was:

  1. Necessary. (because the quest for a highly performant app is never over!)
  2. A blast. (because getting under the hood of Objective C internals is something we’re strange enough to find fun)

Objective-C's flexible runtime

Objective-C has a very malleable run-time layer - there is a level or two of indirection between a message call and the code you wrote to handle said message. This level of indirection can be used to build interesting things through the Objective C Runtime API.

One very common use of the runtime API is method swizzling, in which you replace the implementation of a given message with your own implementation. One of our favorites is Mike Ash's implementation of zeroing weak references.

NSProxy for instrumentation

The Foundation framework provides NSProxy as a way to create objects that stand in for others and are aware of messages being passed to them.

So, first we built an NSProxy subclass for instrumentation. We have a global registry of callbacks to call before and after a given message, and override forwardInvocation to pass the call through to the original implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
- (void)forwardInvocation:(NSInvocation *)inv {
 Class targetClass = [self class];
 SEL selector = inv.selector;

 runCallbacks(beforeBlocks, self, targetClass, selector);
 @try {
 IMP imp = method_getImplementation(class_getInstanceMethod(targetClass, selector));
 // This is a private method and must not make it's way to production code!!
 [inv invokeUsingIMP:imp];
 } @finally {
 runCallbacks(afterBlocks, self, targetClass, selector);
 }
}

Hijacking alloc

The hookshot API allows us to instrument classes instead of individual objects. This means a user doesn't have to modify every single instantiation of an object they want instrumented.

Making this API possible requires us to override the class's allocWithZone.

This splits into two interesting cases - one where the class defines its own allocWithZone and one where it inherits it. Here's the core of the logic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
+ (id)instrumentInstanceMessagesSwizzledAllocWithZone:(NSZone *)zone {
 id result = [self instrumentInstanceMessagesSwizzledAllocWithZone:zone];
 liveObjectClass[result] = self;
 object_setClass(result, classes[self]);

 return result;
}

+ (id)instrumentInstanceMessagesAddedAllocWithZone:(NSZone *)zone {
 Method method = GetSuperMethod(self, @selector(allocWithZone:));
 IMP imp = method_getImplementation(method);

 id result = imp(self, @selector(allocWithZone:), zone);
 runCallbacks(afterBlocks, result, self, @selector(allocWithZone:));
 liveObjectClass[result] = self;
 object_setClass(result, classes[self]);

 return result;
}

// Later...

if (HasOwnClassMethod(cls, @selector(allocWithZone:))) {
 ClassMethodSwizzle(
 cls, @selector(instrumentInstanceMessagesSwizzledAllocWithZone:),
 @selector(allocWithZone:));
} else {
 ClassMethodCopy(cls, @selector(instrumentInstanceMessagesAddedAllocWithZone:),
 @selector(allocWithZone:));
}

We also don't return an NSProxy that wraps the result, we actually change the class of the object itself. This is necessary because otherwise any references in the instrumented class to self would still reference the un-instrumented original object.

Preventing instrumentation

Some messages don't play nice with instrumentation. In some cases, it's just that a method is called a lot and the cost of instrumentation starts skewing the results. invokeUsingIMP also fails with an error message on selectors with certain C++ return types and with a crash on selectors that use varargs.

We allow prevention of instrumentation with a simple trick. When looking for message sampleMessage, an NSProxy subclass will first see if it has an implementation of that message, and if so, call it. It's only when that fails that the forwardInvocation happens.

So, when we want to prevent instrumentation for sampleMessage, we copy the original implementation in to the NSProxy subclass.

1
2
3
4
5
+ (void)preventInstrumentation:(Class)cls selector:(SEL)selector {
 Method method = class_getInstanceMethod(cls, selector);
 class_addMethod(classes[cls], method_getName(method), method_getImplementation(method),
 method_getTypeEncoding(method));
}

There is one catch - since we want to prevent instrumentation of different messages for different classes, we need a unique NSProxy subclass for each instrumented class. Fortunately, the runtime allows you to create dynamic subclasses! This is what classes[class] in the above snippet refers to:

1
2
3
NSString *newName = [NSString stringWithFormat:@"CCInstrumentedProxy_%s", class_getName(cls)];
Class subClass = objc_allocateClassPair([CCInstrumentedProxy class], [newName UTF8String], 0);
classes[cls] = subClass;

A few last notes

The code itself contains a few more details - specifically around detecting and handling a few edge cases like KVO and methods with complex return types.

We love pull requests and we'd love to hear if hookshot is useful to you!

Lastly - if you enjoy crazy iOS runtime hacking as much as we do, we are hiring.

Regular Expressions Will Stab You in the Back

Regular expressions are super-useful, but if you're processing enough unpredictable text, and your regular expressions aren't very carefully written, one day they will turn around and betray you – specifically, they will slam your CPU usage up to 100% and clog up everything until you wake up and fix it, invariably in the middle of the night. This has happened to us, several times, and it really sucks. We have to process text that may be nightmarishly malformed or actively malicious, so ignoring regexplosions isn't an option for us. Instead, we've come up with a few practical ways to prevent them without too much effort.

Why do regular expressions go berserk?

Most regular expression matchers (including the ones in Python and Java and Pearl) are backtracking: whenever there are multiple possibilities, they try one, and if it fails, try the next one until either something matches, or all the possibilities fail. For example, suppose you try to match the regular expression /foo|bar/ against the strings "foo", "bar", and "baz". Here are the steps that a backtracking regular expression engine could take:

1
2
3
/foo|bar/.match("foo");   // Looks for "foo", and finds it. Match!
/foo|bar/.match("bar");   // Looks for "foo", but doesn't find it. Looks for "bar", and finds it. Match!
/foo|bar/.match("baz");   // Looks for "foo", but doesn't find it. Looks for "bar", doesn't find it. No match.

Sounds reasonable, right? This kind of regular expression matching engine is easy to write, usually very fast, and can support all sorts of nice features like backreferences. There's a reason why this kind of algorithm is popular! But things are not always so nice. Let's look at another, seemingly innocent, example:

1
/a?a/.match("a");

This looks for an optional 'a', and finds it. Then it looks for the required 'a', but it's at the end of the string already, so it backtracks to the beginning, decides that the optional 'a' is skipped, looks for the required 'a', finds it, and declares a match. Success!

In other words, a backtracking matcher considers the possibilities "don't skip" and then "skip" for the /a?/ part of the regex. So far, no horrible explosions.

1
/a?a?aa/.match("aa");

This considers the following possibilities for the two optional 'a' characters:

don't skip, don't skip
don't skip, skip
skip, don't skip
skip, skip

Only the last of these – skipping both optional 'a' characters – will produce a match. The thing is, in order to find this match, we had to consider four possibilities. In general, the regular expression /a?{n}a{n}/ can require backtracking through O(2^n) possibilities. This can get very slow, very quickly. Here's a benchmark of Python's re module on that regular expression, for fairly small values of n:

That plot doesn't look too scary until you notice that the y-axis is logarithmic. If it were a linear scale, you would see essentially a flat line for most of the graph, followed by the line abruptly rocketing upward for large enough n. (For small n, the overhead of compiling the regular expression and calling the matcher is significant, so the nice smooth exponential breaks down a bit for n < 6 or so. This is real empirical data, so it doesn't look quite as clean as theoretical projections.)

So, there's the bug; it starts out taking mere microseconds, but every time we add another 'a' to that regular expression, the time needed to match it doubles. If we raise n all the way to 58, it would take about a thousand years to match this one little regular expression. Obviously, this sucks and is ridiculous. And it's not just contrived examples, either; this kind of thing can really sneak up on you. Something needs to be done, but what?

Option 1: Just be careful

With some regular expressions, you can easily verify up-front that they won't go crazy. For example, /[^:]*:/ will look for characters that aren't colons, then a colon, and if it doesn't find that colon, it will fail immediately. The worst case is that it looks through the whole string, sees no colon, and then fails – in other words, matching this regular expression takes O(n) time and constant space. It will not go crazy.

Of course, being careful is inconvenient and can sometimes fail if you get careless, which you probably will. I sure do. So, what else?

Option 2: Use RE2 as your regular expression engine

This whole problem is the reason why Google's RE2 regular expression engine exists. RE2 doesn't use backtracking, and guarantees linear-time matching. Remember how matching /a?{n}a{n}/ with n=58 would take about a thousand years? With RE2, it takes about 5 microseconds on my laptop. If we raise n to 68 – about a million years with Python's regular expression engine – RE2 takes about 5.7 microseconds. So, big improvement there. If I wanted a more linkbait headline, I could have called this article "How to make your program a quadrillion times faster," and it would have been technically correct. So, how do we use RE2 in practice?

First, get the RE2 library. If you're on a Mac with brew, you can just brew install re2. Otherwise, you can get the code and do the usual ./configure && make && sudo make install dance. Next, get bindings for your language of choice. In Python, the re2 module is a drop-in replacement for the standard re module, and you can get it with pip install re2.

Next, just use it. RE2 uses essentially the same syntax as everything else, so this part is simple. Let's look for valid Skype usernames in some text, using re and RE2, and compare the code:

1
2
3
4
5
6
7
8
>>> import re, re2
>>> r = re.compile(r'([a-zA-Z][a-zA-Z0-9_,.-]{4,20}[a-zA-Z0-9_-])')
>>> r2 = re2.compile(r'([a-zA-Z][a-zA-Z0-9_,.-]{4,20}[a-zA-Z0-9_-])')
>>> doc = "I am truculent.cactuar on Skype."
>>> r.findall(doc)
['truculent.cactuar']
>>> r2.findall(doc)
['truculent.cactuar']

Same regular expression, same API, same results. Easy. Do this, and your servers won't hang in the middle of the night because of a regular expression rebellion. We use this a lot, and it works great.

Option 3: Use Ragel to generate a DFA-based parser

The most heavyweight option I want to mention, but potentially the fastest and most flexible, is to abandon conventional regular expression engines entirely, and instead use Ragel for some of your more complex text-processing code. Ragel bills itself as a "state machine compiler"; you can think of it as a parser generator that lets you build regular expressions that execute arbitrary code when the text has matched to certain points. To get a sense for how this works, consider how you might implement capturing groups if you were building a regular expression engine. Let's say you want to extract the numbers in strings like "In 28 days". The natural regular expression way to do this would be to put a capturing group around the number:

1
2
3
>>> import re
>>> re.match(r'In (\d+) days', 'In 28 days').group(1)
'28'

You could imagine a regular expression matcher which calls a hypothetical startGroup(pos) function when it has matched everything up to a '(' character, and then calls an endGroup(pos) function when it has successfully matched up to a ')' character. If the regular expression as a whole matches successfully, then you just look at those positions recorded by the calls to startGroup() and endGroup(), and the text between them will be the matched text for that group. In our example above, the execution if the regular expression matcher would be something like

  1. Match "In ". We are now at position 3.
  2. Start a capturing group. Record position 3 as the start of the capturing group.
  3. Match "\d+" against "28". We are now at position 5.
  4. Record position 5 as the end of our capturing group.
  5. Match " days". The regular expression matches!
  6. Return the capturing group (3, 5).

Ragel takes this basic concept and runs with it, letting you embed any code you like at any point you like. It compiles down to some very fast C code (or C++, Java, Go, Ruby, D, JavaScript, and a few others), and is easier to maintain than tangles of regular expressions. To get the flavor of working with Ragel, have a look at this parser for the HTTP protocol used by the Mongrel web server. Most of the Ragel code looks like it was translated from the pseudocode in the HTTP spec, and because of that simplicity, the code hasn't been a significant source of bugs. It didn't need to be complicated to be fast.

Let's try matching "In \d+ days" with Ragel, using the approach mentioned above. Ragel assumes the existence of some local variables that you must provide, in this case start and end pointers p and pe, and a state variable cs to keep track of the current state of the parser. You also need an eof pointer to the end of the file; in this case, that's the same as the end of the current buffer, but if you make a streaming parser, there could be a difference. This example tries to do things the easiest way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Write any global data -- tables, etc. -- needed by the parser. */
%%{
  machine example;
  write data;
}%%


static void getMyNumber(char *str) {
  /* Ragel needs start, end, and end-of-file pointers, plus a current state variable. */
  char *p = str, *pe = str + strlen(str); char *eof = pe; int cs;
  /* Start and end of capturing group. */
  char *gs = NULL, *ge = NULL;

  /* The finite state machine is defined here. The "number" pattern has actions to run
     when execution enters it and leaves it. In those actions, the p variable always
     points at the current position of our parser. */
  %%{
    number = digit+ >{ gs = p; } %{ ge = p; };
    main := "In " . number . " days" %{
      printf("Number: ");
      while (gs < ge) putchar(*gs++);
      printf("\n");
    };
    write init;
    write exec;
  }%%
}


int main(void) {
  getMyNumber("In 28 days");          /* Number: 28 */
  getMyNumber("In 4 days");           /* Number: 4 */
  getMyNumber("In 8 years");          /* No match. */
  getMyNumber("This won't match.");   /* No match. */
  getMyNumber("54 days from now");    /* No match. */
  return 0;
}

To compile this, you need to have Ragel generate C code, and compile that.

$ ragel -C -G2 simple.rl
$ gcc -O2 simple.c -o simple
$ ./simple
Number: 28
Number: 4

Ragel has a bit of a nasty learning curve, but it's ridiculously useful once you get the hang of it. Once you've got a Ragel file written and debugged, it will never take exponential time to match anything. Internally, it generates an NFA and compiles that into a DFA, which it minimizes before generating code in your target language (usually C). This guarantees matching time linear in the length of the input.

Another thing to keep in mind if you're considering Ragel is that, since it generates C code, there will be some extra effort involved in using it from another programming language. If you're using Python, probably the easiest way to connect C with Python is to write a bit of glue code in Cython, which we also use quite heavily when we need something written in Python to suddenly go 10-100x faster. (Come to think of it, that would make a good subject for another blog post. Who doesn't want easy orders-of-magnitude speed gains?)

Option 4: [fill in the blank]

These are just some of our favorite ways to solve the regular expression explosion problem, but there are lots of other options out there. Parsing things is a whole academic field, and there are a huge number of tools for automating it. (We use a few more of these ourselves.) The world is big, and you've probably heard of a lot of stuff I haven't.

Have I mentioned that we're hiring? If this sort of thing appeals to you, we have no shortage of interesting work. Take our programming challenge!

Cue: The 10 Hour Scramble to Launch Two Days Early

At 7:45pm Monday night, after a full day of interviews in New York about our upcoming launch, we got a pretty interesting email: "Subject: Embargo just got broken".

We had planned to launch and rebrand Cue on Thursday at 3am Pacific when newspapers hit doorsteps on the east coast. We still had work to do -the web site wasn't ready, bugs needed fixing, we were in the wrong city. Our new t-shirts hadn't even arrived yet.

We called the publication. An editorial assistant had accidentally swapped the print date of our story and another one. The online coverage was pulled right away. The print coverage, on the other hand, was already at press. We hated the idea of people reading about our product that wasn't yet available. We had about 10 hours. Everyone geared up for a long night.

It was a strange way to end a six month project that had started so much more traditionally. After our iPhone app launched last year, we started hearing a pretty consistent piece of feedback: people were using the service to enable real-world actions - when they were trying to get somewhere, trying to contact someone or trying to do something. While we were often the best or only way to get to the right information, we were making our users take steps that technology could and should take instead. We embarked on a major effort to evolve our service.

We needed code to analyze the text coming through our system to tag phone numbers (easy-ish, at least in the US), addresses (hard, e.g. “2 main drive” vs. "2 hour drive"), relative dates, signatures, and more -all in less than 1ms per document. We invented a DSL in Python to make this possible and, importantly, maintainable. We called it Pearl.

We also needed to understand that hugo@reyes.com is the same person as @hurley. We called this system Linus.

We needed a system that understood templated messages (Beemo, a new image format for iPhone (Let'em'tap), and hundreds of improvements to our core search technology (Hydra) and front end (Orchid). Not to mention stats and analytics (Faraday), text-based timezone inference (Whisky), some geodata (Delphiki), and close to a dozen more subsystems.

We even needed a new name. As search transitioned from the entire service to an important feature, having grep in our name wasn't quite right any more. We chose Cue as our new name because it means a hint as to what's next. It was also important to us to not lose our geeky flair. Luckily Cue has some great homophones - a data structure, a gadgets guru, and an all powerful life form.

Our insane bicoastal all-nighter was stressful but a strange kind of fun. We found a song called “Y Sin Embargo” and blasted it in both offices on turntable.fm. We got the app out 3 hours early. The temporary website arrived only 90 minutes late, and the one we had originally planned debuted yesterday. Updates to the web service are still on the way. Our biggest stumble was not having a web login link or maintenance notification, leading to confusion about just where search went. This was a mistake. We apologize and we promise to learn from it. Being Google Calendar fans we also missed adding iCal support - consider it coming very soon.

Despite being a bit bleary-eyed we kept our hundreds of AWS instances pretty happy. One scary moment came at 1am when we hit our AWS instance cap! By the end of Tuesday, things were looking a bit less crazy. Lots of users old and new liked the new features. And the press got it too.

It would be a massive understatement to say that we're proud of our team.

So, to Kevin, Aaron, Stefan, Mike, Peter, Kim, David, Matthaeus, Pratik, Julie, Shaneal and Brian, we say thank you, and cheers to the next challenges, which of course, we're already working on.

Additional thanks to Thrive (for loaning us emergency office space in NYC), to Greg McAdoo (who provides in-person input every week), to Vladimir (for helping with the app intro), to Melissa, Amie, Katherine, Mary, Isabelle, and Dave (who graciously loan us their significant others), to our families, and to our awesome users.

Greplin Zookeeper Utils

We at Greplin are huge fans of Apache's excellent Zookeeper proejct, which allows you to setup a cluster ('ensemble' in Zookeeper talk) of servers that can present a very fault-tolerant filesystem like abstraction to clients.

Zookeeper isn't meant to store significant amounts of data (in fact, the entire 'file system' must fit in memory and all writes are funneled through a single dynamically elected 'master') - but it's perfect for building distributed coordination, notification, locking, and configuration systems. The details are beyond the scope of this announcement, but I'd highly recommend Konar and Reed's introduction.

However, the filesystem abstraction provided by Zookeeper is a bit low level for many common tasks (like implementing leader elections or distributed transactions/locking). To that end, Apache has created a list of recipes (similar to 'patterns') for Zookeeper that explain common problems and document solutions.

At Greplin we have implemented some of these solutions (notably locking and leader elections), and are happy to open source them in our greplin-zookeeper-utils. The package also adds a simple way to programatically and quickly start/stop single node zookeeper ensembles - which is sometimes useful for testing.

This package is still new and growing - but we hope it's useful to others too!

A quick demonstration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
final int port = 2182;
final EmbeddedZookeeperServer server = EmbeddedZookeeperServer.builder().clientPort(port).build();
final RobustZooKeeper client = new RobustZooKeeper("localhost:" + port);

client.withLock("lockName",
    new Runnable() {
      @Override
      public void run() {
        System.out.println("I've got the lock while I'm running!");
      }
    });

client.shutdown();
server.shutdown();

Code Independence Day - Nagios Utilities, OOM Diagnostics, and “Hop”

For July 4th this year, we celebrated as many people in America do: by overeating, enjoying fireworks, and watching an amazing movie.

We decided it would also be fun to grant independence to some of our code. Today we're open sourcing:

  • polarbear - a Java tool that helps diagnose OutOfMemoryError conditions.
  • greplin-nagios-utils - Python-based DSL + utilities for writing and running Nagios checks
  • hop - a ridiculously easy way to navigate around your favorite directories

We hope you find these useful - as always we look forward to comments, suggestions, and pull requests.