Installing R with EasyBuild: Which path to insanity?
Submitted by lev_lafayette on Sat, 01/07/2017 - 06:21
There is a wonderful Spanish idiom, "Cada loco con su tema" which is sometimes massacred as the English idiom "To each their own". In Spanish of course it is more accurately transliterated as "Each madman with their topic" which in familiar conversation means the same, has a slightly different and is a more illustrative angle on the subject. With the in mind, which path to insanity does one take with R libraries and EasyBuild? A similar question can also be raised with other languages that have extensions, e.g., Python and Perl.
One path is to simply do a basic installation and let users have private extensions. Whilst this removes the pressure on sysadmins, gives responsibility to the user, and is efficient for the budget of line managers. Problem being of course, on an aggregate level, it is very inefficient for the reseaerch body itself to have private extensions. If more than one user is installing the package then that is a net loss automatically to having a common package, even assuming everything goes well. There is also the opportunity cost of having the researcher responsible for their own software package management. So this really isn't a viable option in a general case.
The usual method is to have an extension list, like the following snippet from an R Easybuild recipe:
exts_list = [
# default libraries, only here to sanity check their presence
('abind', '1.4-3', ext_options),
('magic', '1.5-6', ext_options),
('geometry', '0.3-5', ext_options),
('bit', '1.1-12', ext_options),
('filehash', '2.2-2', ext_options),
('ff', '2.2-13', ext_options),
This seems to work well enough in most cases and is probably the recommended method overall. There are of course a couple of issues to aware of. The sysadmin needs to work through the list and their dependencies with some care as order is important. Each library will be extracted and installed in order, and if it a dependency is not present the installation of a particular library will fail. This can be especially frustrating if the library in question hasn't made its dependencies clear, or a particular version of the library is needed and a different version has already been installed (which may, if reproducibility of results is taken to an optimal level, require an installation of the base package and all dependencies into a different version. The extensions approach also doesn't allow for configuration options to be passed to the installation from within the recipe itself; a separate patch file would have to be applied.
Another common mistake is to reinstall the entire package, base software and libraries and all, from scratch when a new package was requested. This might be barely tolerable when the number is a dozen or so, but when the number of libraries starts to reach the hundreds or more it is increasingly difficult - especially if a package fails at the end of the build. If this is the case, make use of the "-k" or "--skip" option, which will skip existing software; this has to be explicitly cited as it is by default set to false, e.g.,
module load EasyBuild
eb R-3.2.1-GCC-4.9.2.eb --skip
An alternative approach is to have a base package and then additional configuration files for each library. This is, appropriately described by Jack Perdue, its main designer, as R Madness. Still, at least one will have a description, dependencies, and versioning for each library. Certainly this approach is also evident in a number of research institutions (for example) who install popular Python packages such as SciPy and NumPy as separate EasyBuild scripts rather than as extension. An interesting approach, but for the time being it would seem that a careful tracking of extensions within an EasyBuild recipe and judicious use of the skip option is currently still preferable.