How to Read Source Code

This is based on a talk I gave at Oneshot Nodeconf Christchurch.

I almost didn’t write this post. It seems preposterous that there are any programmers who don’t read source code. Then I met a bunch of programmers who don’t, and I talked to some more who wouldn’t read anything but the examples and maybe check if there are tests. And most of all, I’ve met a lot of beginning programmers who have a hard time figuring out where to start.

What are we reading for? Comprehension. Reading to find bugs, to find interactions with other software in a system. We read spource code for review. We read to see the interfaces, to undersand and to find the boundaries between the parts. We read to learn!

Reading isn’t linear. We think we can read source code like a book. Crack the introduction or README, then read through from chapter one to chapter two, on toward the conclusion. It’s not like that. We can’t even prove that a great many programs have conclusions. We skip back and forth from chapter to chapter, module to module. We can read the module straight through but we won’t have the definitions of things from other modules. We can read in execution order, but we won’t know where we’re going more than one call site down.

Do you start at the entry point of a package? In a node module, the index.js or the main script?

How about in a browser? Even finding the entry point, which files get loaded and how are a key task. Figuring out how the files relate to each other is a great place to start.

Other places to start are to find the biggest source code file and read that first, or try setting a breakpoint early and tracing down through functions in a debugger, or try setting a breakpoint deep in something meaty or hard to understand and then read each function in the call stack.

We’re used to categorizing source code by the language it’s written in, be it Javascript, C++, ES6, Befunge, Forth, or LISP. We might tackle a familiar language more easily, but not look at the parts written in a language we’re less familiar with.

There is another way to think of kinds of source code, which is to look at the broad purpose of each part. Of course, many times, something does more than one thing. Figuring out what it’s trying to be can be one of the first tasks while reading. There are a lot of ways to describe categories, but here are some:

Glue has no purpose other than to adjust interfaces between parts and bind them together. Not all the interfaces we want to use play nice together, where the output of one function can be passed directly to the input of another. Programmers make different decisions about the styles of interface, or adepters between systems where there are no rich data types, such as fields from a web form all represented as strings are connected to functions and objects that expect them to be represented more specifically. The way errors are handled often vary, too.

Connecting a function that returns a promise to something that takes a callback involves glue; inflating arguments into objects, or breaking objects apart into variables are all glue.

This is from Ben Drucker’s stream-to-promise:

internals.writable = function (stream) {
return new Promise(function (resolve, reject) {
stream.once('finish', resolve);
stream.once('error', reject);
});
};

In this, we’re looking for how two interfaces are shaped differently, and what’s common between them. The two interfaces involved are are node streams and promises.

In common they have the fact that they do work until they have a definite finish. In streams, with the finish event, and with promises by calling the resolution function. One thing to notice while you read this is that promises can only be resolved once, but streams can emit the same event multiple times. They don’t usually, but as programmers we usually know the difference between can’t and shouldn’t.

Here’s more glue, the sort you find when dealing with input from web forms.

var record = {
name: (input.name || '').trim(),
age: isNaN(Number(input.age)) ? null : Number(input.age),
email: validateEmail(input.email.trim())
}

In cases like this, it’s good to read with how errors are handled in mind. Look for which things in this might throw an exception, and which handle errors by altering or deleting a value.

Are these appropriate choices for the place where this exists? Do some of these conversions lose information, or are they just cleaning up into a canonical form?

Interface-defining code is one of the most important kinds. It’s what makes the outside boundary of a module, the surface area that other programmers have to interact with.

From node’s events.js

exports.usingDomains = false;

function EventEmitter() { }
exports.EventEmitter = EventEmitter;

EventEmitter.prototype.setMaxListeners = function setMaxListeners(n) { };
EventEmitter.prototype.emit = function emit(type) { };
EventEmitter.prototype.addListener = function addListener(type, listener) { };
EventEmitter.prototype.on = EventEmitter.prototype.addListener;
EventEmitter.prototype.once = function once(type, listener) { };
EventEmitter.prototype.removeListener = function removeListener(type, listener) { };
EventEmitter.prototype.removeAllListeners = function removeAllListeners(type) {};
EventEmitter.prototype.listeners = function listeners(type) { };
EventEmitter.listenerCount = function(emitter, type) { };

We’re defining the interface for EventEmitter here.

Look for whether this is complete. Look for internal details being exposed – the usingDomains in this case is a flag that is exposed to the outside world, because node domains have an effect system-wide, and debugging that is very difficult, that detail is shown outside the module.

Look for what guarantees these functions make.

Look for how namespacing works. Will the user be adding their own functions, or does this stand on its own, and the user of this interface will keep their parts separate?

Like glue code, look for how errors are handled and exposed. Is that consistent? Does it distinguish errors due to internal bugs from errors because the user made a mistake?

If you have strong interface contracts or guards, this is where you should expect to find them.

Implementation once it’s separated from the interface and the glue is one of the more studied parts of source code, and where books on refactoring and source code style aim much of their advice.

From Ember.Router:

startRouting: function() {
this.router = this.router || this.constructor.map(K);

var router = this.router;
var location = get(this, 'location');
var container = this.container;
var self = this;
var initialURL = get(this, 'initialURL');
var initialTransition;

// Allow the Location class to cancel the router setup while it refreshes
// the page
if (get(location, 'cancelRouterSetup')) {
return;
}

this._setupRouter(router, location);

container.register('view:default', _MetamorphView);
container.register('view:toplevel', EmberView.extend());

location.onUpdateURL(function(url) {
self.handleURL(url);
});

if (typeof initialURL === "undefined") {
initialURL = location.getURL();
}
initialTransition = this.handleURL(initialURL);
if (initialTransition && initialTransition.error) {
throw initialTransition.error;
}
},

This is the sort always needs more documentation about why it is how it is, and not so much about what these parts do. Implementation source code is where the every-day decisions about how something is built live, the parts that make this module do what it does..

Look how this fits into its larger whole.

Look for what’s coming from the public interface to this module, look for what needs validation. Look for what other parts this touches – whether they share properties on an object or variables in a closure or call other functions.

Look at what would be likely to break if this gets changed, and look to the test suite to see that that is being tested.

Look for the lifetime of these variables. This particular case is an easy one: This looks really well designed and doesn’t store needless state with a long lifetime – though maybe we should look at _setupRouter next if we were reading this.

You can look to understand the process entailment of a method or function, the things that were required to set up the state, the process entailed in getting to executing this. Looking forward from potential call sites, we can ask “How much is required to use this thing correctly?”, and as we read the implementation, we can ask “If we’re here, what got us to this point? What was required to set this up so that it works right?”

Is that state explicit, passed in via parameters? Is it assumed to be there, as an instance variable or property? Is there a single path to get there, with an obvious place that state is set up, or is it diffuse?

Algorithms are a kind of special case of implementation. It’s not so exposed to the outside world, but it’s a meaty part of a program. Quite often it’s business logic or the core processes of the software, but just as often, it’s something that has to be controlled precisely to do its job with adequate speed. There’s a lot of study of algorithmic source code out there because that’s what academia produces as source code.

Here’s an example:

function Grammar(rules) {
// Processing The Grammar
//
// Here we begin defining a grammar given the raw rules, terminal
// symbols, and symbolic references to rules
//
// The input is a list of rules.
//
// The input grammar is amended with a final rule, the 'accept' rule,
// which if it spans the parse chart, means the entire grammar was
// accepted. This is needed in the case of a nulling start symbol.
rules.push(Rule('_accept', [Ref('start')]));
rules.acceptRule = rules.length - 1;

// Build a list of all the symbols used in the grammar so they can be numbered instead of referred to
// by name, and therefore their presence can be represented by a single bit in a set.
function censusSymbols() {
var out = [];
rules.forEach(function(r) {
if (!~out.indexOf(r.name)) out.push(r.name);

r.symbols.forEach(function(s, i) {
var symNo = out.indexOf(s.name);
if (!~out.indexOf(s.name)) {
symNo = out.length;
out.push(s.name);
}

r.symbols[i] = symNo;
});

r.sym = out.indexOf(r.name);
});

return out;
}

rules.symbols = censusSymbols();

This bit is from a parser engine I’ve been working on called lotsawa. Reads like a math paper, doesn’t it?

It’s been said a lot that good comments say why something is done or done that way, rather than what it’s doing. Algorithms usually need more explanation of what is going on since if they were trivial, they’d probably be built into our standard library. Quite often to get good performance out of something, the exactly what-and-how matters a lot.

One of the things that you usually need to see in algorithms is the actual data structures. This one is building a list of symbols and making sure there’s no duplicates.

Look also for hints as to the running time of the algorithm. You can see in this part, I’ve got two loops. In Big-O notation, that’s O(n * m), then you can see that there’s an indexOf inside that. That’s another loop in Javascript, so that actually adds another factor to the running time. (twice – looks like I could make this more optimal by re-using one of the values here)

Configuration The line between source code and configuration file is super thin. There’s a constant tension between having a configuration be expressive and readable and direct.

Here’s an example using Javascript for configuration.

app.configure('production', 'staging', function() {
app.enable('emails');
});

app.configure('test', function() {
app.disable('emails');
});

What we can run into here is combinatorial explosion of options. How many environments do we configure? Then, how many things do we configure for a specific instance of that environment. It’s really easy to go overboard and end up with all the possible permutations, and to have bugs that only show up in one of them. Keeping an eye out for how many degrees of freedom the configuration allows is super useful.

Here is bit of kraken config file.

"express": {
"env": "", // NOTE: `env` is managed by the framework. This value will be overwritten.
"x-powered-by": false,
"views": "path:./views",
"mountpath": "/"
},

"middleware": {

"compress": {
"enabled": false,
"priority": 10,
"module": "compression"
},

"favicon": {
"enabled": false,
"priority": 30,
"module": {
"name": "serve-favicon",
"arguments": [ "resolve:kraken-js/public/favicon.ico" ]
}
},

Kraken took a ‘low power language’ approach to configuration and chose JSON. A little more “configuration” and a little less “source code”. One of the goals was keeping that combinatorial explosion under control. There’s a reason a lot of tools use simple key-value pairs or ini-style files for configuration, even though they’re not terribly expressive. It’s possible to write config files for kraken that vary with a bunch of parameters, but it’s work and pretty obvious when you read it.

Configuration has some interesting and unique constraints that are worth looking for.

The lifetime of a configuration value is often determined by other groups of people. They usually vary somewhat independently of the rest of the source code – hence why they’re not built in as hard-coded values inline.

They often need machine writability, to support configuration-generation tools.

The responsible people are different than regular source code. Systems engineers, operations and other people can be involved in the creation.

Configuration values often have to fit in weird places like environment variables, where there are no types, just string values.

They also often store security-sensitive information, and so won’t be committed to version control because of this.

Batches are an interesting case as well. They need transactionality. Often, some piece of the system needs to happen exactly once, and not at all if there’s an error. A compiler that leaves bad build products around is a great source of bugs. Double charging customers is bad. Flooding someone’s inbox because of a retry cycle is terrible. Look for how transactions are started and finished – clean-up processes, commit to permanent storage processes, the error handling.

Batch processes often need resumabilty. A need to continue where they left off given the state of the system. Look for the places where perhaps unfinished state is picked up and continued from.

Batch processes are also often sequential. If they’re not strictly linear processes, there’s usually a very directed flow through the program. Loops tend to be big ones, around the whole process. Look for those.

Reading Messy Code

So how do you deal with this?

      DuplexCombination.prototype.on = function(ev, fn) {
switch (ev) {
case 'data':
case 'end':
case 'readable':
this.reader.on(ev, fn);
return this
case 'drain':
case 'finish':
this.writer.on(ev, fn);
return this
default:
return Duplex.prototype.on.call(this, ev, fn);
}
};

You are seeing that right. That’s reverse indendation. Blame Isaac.

Put on your rose tinted glasses!

Try installing a tool like standard or jsfmt. Here’s what standard -F dc.js does to that reverse-indented Javascript:

DuplexCombination.prototype.on = function (ev, fn) {
switch (ev) {
case 'data':
case 'end':
case 'readable':
this.reader.on(ev, fn)
return this
case 'drain':
case 'finish':
this.writer.on(ev, fn)
return this
default:
return Duplex.prototype.on.call(this, ev, fn)
}
}

It’s okay to use tools while reading! There’s no technique that’s “cheating”.

Here’s another case:

(function(t,e){if(typeof define==="function"&&define.amd){define(["underscore","
jquery","exports"],function(i,r,s){t.Backbone=e(t,s,i,r)})}else if(typeof export
s!=="undefined"){var i=require("underscore");e(t,exports,i)}else{t.Backbone=e(t,
{},t._,t.jQuery||t.Zepto||t.ender||t.$)}})(this,function(t,e,i,r){var s=t.Backbo
ne;var n=[];var a=n.push;var o=n.slice;var h=n.splice;e.VERSION="1.1.2";e.$=r;e.
noConflict=function(){t.Backbone=s;return this};e.emulateHTTP=false;e.emulateJSO
N=false;var u=e.Events={on:function(t,e,i){if(!c(this,"on",t,[e,i])||!e)return t
his;this._events||(this._events={});var r=this._events[t]||(this._events[t]=[]);
r.push({callback:e,context:i,ctx:i||this});return this},once:function(t,e,r){if(
!c(this,"once",t,[e,r])||!e)return this;var s=this;var n=i.once(function(){s.off
(t,n);e.apply(this,arguments)});n._callback=e;return this.on(t,n,r)},off:functio
n(t,e,r){var s,n,a,o,h,u,l,f;if(!this._events||!c(this,"off",t,[e,r]))return thi
s;if(!t&&!e&&!r){this._events=void 0;return this}o=t?[t]:i.keys(this._events);fo
r(h=0,u=o.length;h<u;h++){t=o[h];if(a=this._events[t]){this._events[t]=s=[];if(e
||r){for(l=0,f=a.length;l<f;l++){n=a[l];if(e&&e!==n.callback&&e!==n.callback._ca
llback||r&&r!==n.context){s.push(n)}}}if(!s.length)delete this._events[t]}}retur
n this},trigger:function(t){if(!this._events)return this;var e=o.call(arguments,
1);if(!c(this,"trigger",t,e))return this;var i=this._events[t];var r=this._event
s.all;if(i)f(i,e);if(r)f(r,arguments);return this},stopListening:function(t,e,r)
{var s=this._listeningTo;if(!s)return this;var n=!e&&!r;if(!r&&typeof e==="objec

Here’s the start of that after uglifyjs -b < backbone-min.js:

(function(t, e) {
if (typeof define === "function" && define.amd) {
define([ "underscore", "jquery", "exports" ], function(i, r, s) {
t.Backbone = e(t, s, i, r);
});
} else if (typeof exports !== "undefined") {
var i = require("underscore");
e(t, exports, i);
} else {
t.Backbone = e(t, {}, t._, t.jQuery || t.Zepto || t.ender || t.$);
}
})(this, function(t, e, i, r) {
var s = t.Backbone;
var n = [];
var a = n.push;
var o = n.slice;
var h = n.splice;
e.VERSION = "1.1.2";
e.$ = r;
e.noConflict = function() {

Human parts and guessing the intent of what you’re reading

There’s a lot of tricks for figuring out what the author of something meant.

Look for guards and coercions

if (typeof arg != 'number') throw new TypeError("arg must be a number");

Looks like the domain of whatever function we’re in is ‘numbers’.

arg = Number(arg)

This coerces its input to be numeric. Same domain as above, but doesn’t reject errors via exceptions. There might be NaNs though. Probably smart to read and check if there’s comparisons that will be false against those.

NaN behavior in javascript mostly comes from the behavior in the IEEE floating-point number spec, as a way to propagate errors out to the end of a computation so you don’t get an arbitrary bogus result, and instead get a known-bad value. In some cases, that’s exactly the technique you want.

Look for defaults

arg = arg || {}

Default to an empty object.

arg = (arg == null ? true : arg)

Default to true only if a value wasn’t explicitly passed. Comparison to null with the == operator in Javascript is only true when what’s being compared is null or undefined – the two things that mean “nothing to see here” – this particular check hints that the author meant that any value is acceptable, as long as it was intended to be a value. false and 0 are both things that would override the default.

arg = (typeof arg == 'function' ? arg : function () {});

In this case, the guard uses a typeof check, and chooses to ignore its argument if it’s not the right type. A silent ignoring of what the caller specified.

Look for layers

As an example, req and res from Express are tied to the web; how deep do they go? Are they passed down into every layer, or is there some glue that picks out specific values and calls functions with an interface directly related to its purpose?

Look for tracing

Are there inspection points?

Debug logs?

Do those form a complete narrative? Or are they ad-hoc leftovers from the last few bugs?

Look for reflexivity

Are identifiers being dynamically generated? If so, that means you won’t find them by searching the source code – you’ll have to think at a different level to understand parts of what’s going on.

Is there eval? Metaprogramming? New function creation?

func.toString() is your friend! You can print out the source of a callback argument and see what it looks like, you can insert all kinds of debugging to see what things do.

Look at lifetimes

The lifetime of variables is particularly good for figuring out how something is built (and how well it’s built). Look for who or what initializes a variable or property of an object. Look for when it changes, and how that relates to the flow or scope of the process that does it. Look for who changes it, and see how related they are to the part you’re reading.

Look to see if that information is also somewhere else in the system at the same time. If it is, look to see if it can ever be inconsistent, where two parts of the system disagree on what that value is if you were to compare them.

Somewhere, someone typed the value you see into a keyboard, generated it from a random number generator, or computed it and saved it.

Somewhere else, some time else, that value will affect some human or humans. Who are these people?

What or who chooses who they are? Is that value ever going to change? Who changes it?

Maybe it’s a ‘name’ field typed into a form, then saved in a database, then displayed to the user. Stored for a long time, and it’s a value that can be inconsistent with other state – the user can change their name, or use a different one in a new context.

Look for hidden state machines

Sometimes boolean variables get used together as a decomposed state machine

Maybe there’s a process with variables like this:

var isReadied = false;
var isFinished = false;

The variables isReadied and isFinished might show a state machine like so:

START -> READY -> FINISHED

If you were to lay out how those variables relate to the state of the process, you might find this:

isReadied | isFinished | state
----------|------------|------------
false | false | START
false | true | invalid
true | false | READY
true | true | FINISHED

Note that they can also express the state !isReadied && isFinished – which might be an interesting source of bugs, if something can end up at the finished state without first being ready.

Look for composition and inheritance Is this made of parts I can recognize? Do those parts have names?

Look for common operations

map, transforming a list of values into a different list of values.

reduce, taking a list of values and giving a single value. Even joining an array of strings with commas to make a string is a ‘reduce’ operation.

cross-join, where two lists are compared, possibly pairwise, or some variation on that.

It’s time to go read some programs and libraries.

Enjoy!