How much of npm can you break?
Story
The left-pad
module was removed from npm on the 22nd of March 2016 and, some
hours later, restored. In this small time window many packages such as Babel
could not be installed anymore due to their dependencies on this module.
This event surfaced a vast issue in the nodejs ecosystem: the over-reliance on external code. Introducing an external dependency may introduce vulnerabilities and weaken the robustness of your application, so be cautious.
The website
This website was made to make people realize how many popular packages depend on smaller ones which could be easily compromised, therefore infecting a large segment of the ecosystem. There is no reason for over 40k packages to depend on an 8 line package.
How I made it
The idea was simple, choose a package and recursively traverse npm to find all of the packages that depend upon it, after all the npm website offers a ‘Dependents’ tab on a package’s page. I had not anticipated the issues that made this an interesting technical problem.
The first version (Typescript)
It took me some time but after reading the CouchDB docs and analyzing skimdb.npmjs.com
I found the endpoint I needed: https://skimdb.npmjs.com/registry/_design/app/_view/dependedUpon?startkey=["is-odd"]&endkey=["is-odd",{}]&reduce=false
This url will return a JSON file which lists all of the packages that depend upon is-odd
.
With this information I wrote the first version: a pretty straightforward typescript crawler, which will start from a package, download all of its dependents and then recursively download their dependents until it gets to a package that nobody depends upon. This is where the fist issue surfaced: circular dependencies.
This was quite easily resolved by adding a list of packages which have been
downloaded already and, every time the recursive function get_dependents(package_name)
is
called, the presence of package_name
in the list is checked. If a package is
in the list it means that we have already downloaded it so there is no reason to
download it and its dependents again. It is a sort of memoization.
Unfortunately this approach did not work for any decently sized package:
after over 30k requests and 5 minutes of waiting I could not count the
packages dependent on is-odd
. Making an http request for each package
was way too slow.
The second version (C++)
While analyzing the npm registry I found that if startkey
and endkey
are not set in
https://skimdb.npmjs.com/registry/_design/app/_view/dependedUpon?startkey=["is-odd"]&endkey=["is-odd",{}]&reduce=false
then CouchDB will return ALL of the dependecies for all packages.
Downloading the file will give you a 220MB JSON file with everything you need.
This however, while solving the download time issue, created another one: how do you parse and efficiently traverse a 1.8M line JSON file? RapidJSON it is.
So I downloaded RapidJSON, and wrote a C++ program which was almost a direct translation of the nodejs one, but instead of downloading the packages, it would query the JSON using rapidJSON’s DOM api.
This worked but it was veeery slow, it took about 20.7
seconds to count the dependecies
of jayson
(197 dependencies, if you were wondering).
Removing recursion
My first optimization was removing the recursion: counting the number of unique dependent packages boils down to a Depth First Search.
This is a recursive implementation of a depth first search but it is not very efficient, especially on languages without TCO (tail call optimization).[1]
To rewrite this without any form of recursion a stack is needed and you can rewrite the pseudocode like this.
Unfortunately this optimization, while precious for later, right now was almost useless:
the average time needed to count the dependencies of jayson
went from 20.5-21s
to 20-20.5s
. Almost unnoticeable.
Finding the bottleneck
After removing the recursion (which I naively thought was the main performance issue) I started looking for other problems and ways to speed it up more.
The main issue was that every time we need to get a pacakge’s dependencies, all of the lines in the JSON file need to be scanned to find the dependencies. This is \(O(n)\), especially slow if that \(n\) is around 1.8 Million.
For example: to get all of the packages that depend on is-odd
you need to scan
all of the lines and you will find something like this:
If this were a database then adding an index on the key
column would be enough,
but since we are not using one we need to implement the index ourserves.
We will store the data in an HashMap
.[2]
This led to the third and final rewrite of the program. Which language? R U S T obviously.
The third and final version (Rust)
In the Rust version the hasmap is built like this:
For each line in the file, get the id of the package and its dependent, then insert it into the hashmap which maps a package’s id to its dependents id.
This map gets created in less than 3 seconds and querying for lodash
, the
most depended upon package takes less than 150ms.
I’ve then created using Rocket a simple frontend to this script so that you can try it too. Check it out at http://howmuchofnpmcanyoubreak.ml/
Notes
1. This function will still need to be rewritten to benefit from TCO, in this form (recursive calls inside a for) it will not get optimised at all.
2. I believe that when an index is created in a database the data structure behind that is a B-Tree which has \(O(\log{n})\) index access.