## Story

The left-pad module was removed from npm on the 22nd of March 2016 and, some hours later, restored. In this small time window many packages such as Babel could not be installed anymore due to their dependencies on this module.

This event surfaced a vast issue in the nodejs ecosystem: the over-reliance on external code. Introducing an external dependency may introduce vulnerabilities and weaken the robustness of your application, so be cautious.

## The website

This website was made to make people realize how many popular packages depend on smaller ones which could be easily compromised, therefore infecting a large segment of the ecosystem. There is no reason for over 40k packages to depend on an 8 line package.

The idea was simple, choose a package and recursively traverse npm to find all of the packages that depend upon it, after all the npm website offers a ‘Dependents’ tab on a package’s page. I had not anticipated the issues that made this an interesting technical problem.

### The first version (Typescript)

It took me some time but after reading the CouchDB docs and analyzing skimdb.npmjs.com I found the endpoint I needed: https://skimdb.npmjs.com/registry/_design/app/_view/dependedUpon?startkey=["is-odd"]&endkey=["is-odd",{}]&reduce=false

This url will return a JSON file which lists all of the packages that depend upon is-odd.

With this information I wrote the first version: a pretty straightforward typescript crawler, which will start from a package, download all of its dependents and then recursively download their dependents until it gets to a package that nobody depends upon. This is where the fist issue surfaced: circular dependencies.

This was quite easily resolved by adding a list of packages which have been downloaded already and, every time the recursive function get_dependents(package_name) is called, the presence of package_name in the list is checked. If a package is in the list it means that we have already downloaded it so there is no reason to download it and its dependents again. It is a sort of memoization.

Unfortunately this approach did not work for any decently sized package: after over 30k requests and 5 minutes of waiting I could not count the packages dependent on is-odd. Making an http request for each package was way too slow.

### The second version (C++)

While analyzing the npm registry I found that if startkey and endkey are not set in https://skimdb.npmjs.com/registry/_design/app/_view/dependedUpon?startkey=["is-odd"]&endkey=["is-odd",{}]&reduce=false then CouchDB will return ALL of the dependecies for all packages. Downloading the file will give you a 220MB JSON file with everything you need.

This however, while solving the download time issue, created another one: how do you parse and efficiently traverse a 1.8M line JSON file? RapidJSON it is.

So I downloaded RapidJSON, and wrote a C++ program which was almost a direct translation of the nodejs one, but instead of downloading the packages, it would query the JSON using rapidJSON’s DOM api.

This worked but it was veeery slow, it took about 20.7 seconds to count the dependecies of jayson (197 dependencies, if you were wondering).

#### Removing recursion

My first optimization was removing the recursion: counting the number of unique dependent packages boils down to a Depth First Search.

This is a recursive implementation of a depth first search but it is not very efficient, especially on languages without TCO (tail call optimization).[1]

To rewrite this without any form of recursion a stack is needed and you can rewrite the pseudocode like this.

Unfortunately this optimization, while precious for later, right now was almost useless: the average time needed to count the dependencies of jayson went from 20.5-21s to 20-20.5s. Almost unnoticeable.

#### Finding the bottleneck

After removing the recursion (which I naively thought was the main performance issue) I started looking for other problems and ways to speed it up more.

The main issue was that every time we need to get a pacakge’s dependencies, all of the lines in the JSON file need to be scanned to find the dependencies. This is $$O(n)$$, especially slow if that $$n$$ is around 1.8 Million.

For example: to get all of the packages that depend on is-odd you need to scan all of the lines and you will find something like this:

If this were a database then adding an index on the key column would be enough, but since we are not using one we need to implement the index ourserves. We will store the data in an HashMap.[2]

This led to the third and final rewrite of the program. Which language? R U S T obviously.

### The third and final version (Rust)

In the Rust version the hasmap is built like this:

For each line in the file, get the id of the package and its dependent, then insert it into the hashmap which maps a package’s id to its dependents id.

This map gets created in less than 3 seconds and querying for lodash, the most depended upon package takes less than 150ms.

I’ve then created using Rocket a simple frontend to this script so that you can try it too. Check it out at http://howmuchofnpmcanyoubreak.ml/

## Notes

1. This function will still need to be rewritten to benefit from TCO, in this form (recursive calls inside a for) it will not get optimised at all.

2. I believe that when an index is created in a database the data structure behind that is a B-Tree which has $$O(\log{n})$$ index access.