Skip to content

MADlib Installer Notes (2010 Oct)

agorajek edited this page Feb 25, 2011 · 1 revision

Gavin's excellent design document lays out a number of the issues we need to wrestle with. We should keep updating that document as we move forward. This wiki is for notes and discussion.

JMH Notes, 11/9

  • We have a basic fork in the road: OS package manager, or scripting-language (i.e. Python) package manager. The former will make it annoying to port to OS X and other platforms. The latter is annoying because Python has a bad ecosystem for package management (relative to, say, Perl and Ruby.)

  • There is uncertainty in the Python package management world, which includes pypi, pip, and a new commercial entrant: pypm. Looks like pypm is only available as part of ActivePython, so it's a non-starter. There is a plan to add some new distutils version in the upcoming version of Python. So pip is probably the best choice for now, but there's lots of churn.

  • As an aside, python has a nice thing called Virtualenv that allows you to set up an "installation sandbox" with your own versions of stuff. I think we should stick with this general notion if possible: allow for custom install dirs, schemas, etc. The main gotcha is the security of C extensions, which compromise the entire DB installation.

  • In linux package manager land, we pick two things: the package format (rpm vs. deb) and the repository tools (rpm, yum, apt, dpkg, up2date). Sounds like rpm is the winner for enterprise installations, and likely the right choice if we go that route.

  • The typical linux package managers orchestrate shell scripts, and hence we can fire up make or python at install/uninstall time. My general inclination is to go for rpm or apt packages, but with a stylized installer harness written in python to do database access, testing, etc. Then that harness can be extended via python inheritance to be customized to different platforms.

  • I sure would like to support OS X without having to support two package managers. It's only really important at the low end, so we could abandon it, but that would bum me out.

** Followup, end of 11/9 **

  • It's trivial to set up an RPM or whatnot. The issue is to write an installer framework once we're local -- something to replace the usual "configure;make install" commands. I.e. a simple standard for scripting or config-files that will connect to the DBMS, run SQL scripts, shovel files off to remote nodes, etc. Then we need to be able to detect what we've installed in the database. This requires something like Rails' config/database.yml to connect, and maybe something like Rails migrations to do install/uninstall. Maybe we look at Django?

  • Django is standardizing on a migrations package called South. That's fine, but more than we need. We don't really need migrations in full, in the sense that we don't need to rollback to arbitrary past states. We just want a couple simple things:

    • The DBMS should be self-documenting w.r.t. the madlib version currently installed, so the installer can interrogate and see what's there.
    • Like migrations, we need a framework for uninstall scripts that complement install scripts. This can be a simple coding convention, I think, perhaps by subclassing in Python.
    • See this link for discussion/examples on using Django for raw DB connectivity.
  • Maybe the Python DB:API is sufficiently standard that we can count on people knowing how to configure their DBMS with it?

** Short-term plan **

Using South as a syntax model, write a standalone template for invertable SQL installations. Once that works, try an rpm that builds, installs, and uninstalls profile and sketch. Then break it down to separate methods: do the same for just sketch, and then for just profile (but profile requires sketch and drags it in as a dependency.)

11/23/2010

The discussion below is out of date. See PackMan notes.

We now have a function Python-based installer/uninstaller for SQL scripts.

I don't have it setup as a proper installable python package with dependencies yet, so you have to do some manual stuff. Also, while I have this working with Postgres on my Macs, I'm having trouble on CentOS with GP because I can't get psycopg2 to compile via "easy_install" (it is having trouble with pgconfig and libpq-fe.h even though pgconfig is in my path... help welcome)

Anyhow, if you're eager you can have a peek:

USAGE

  1. I rely on the following Python packages, which you can install via easy_install. I think that's it.
  % sudo easy_install sqlparse argparse pyyaml psycopg2
  1. You need to manually add ${prefix_dir}/madlib_contrib/madpy to your PYTHONPATH, e.g. for bash:
  % export PYTHONPATH=${PYTHONPATH}:/Users/jmh/Dropbox/madlib/madlib_contrib/madpy
  1. Now cd into ...madlib_contrib/config. In there you'll find an example file "postgres.yml". Copy it (in the same directory) to "config.yml", and edit the connect_args appropriately.

  2. In that config directory, you can run "python config.py -p" and it will create a subdir called "scripts". If you look in there you'll see a python "migration file" that contains methods to roll forward (install) or backward (uninstall) the scripts for the methods specified in your config.yml file.

  3. An additional call of "python config.py -i" will do the installation by calling the forwards methods of any uninstalled scripts in the scripts directory. (A common error case here is if you have yet to define a "madlib" schema in your DB. The current scripts require that schema. In future this will be configurable from the config.yml file.)

  4. Calling "python config.py -u" will uninstall by calling the backwards methods of any uninstalled scripts in the scripts directory.

In future it will be possible to roll forward or back to a specific migration file number -- this is supported, it's just not surfaced at the command line right now.

ADDING YOUR MODULE

In order to have your module included in the installation, you need to do the following:

  1. add it to the "methods" in the config.yml file, with a name and a "port" directory under /src.

  2. within that method's port directory, you need an Install.yml file that includes a key "fw" with the name of a SQL script to roll forward (i.e. install), and a key "bw" with the name of a SQL script to roll backward (i.e. uninstall). See methods/sketch/src/extended_sql/pg_gp/Install.yml for an example. It is your responsibility to make these true inverses of each other. (In future we can perhaps write a test to ensure this, though I'm not sure how to do that portably.)

IMPLEMENTATION/FEATURES

This thing is inspired by the South migration package for Django though I wrote it from scratch. There's a South-like table called madlib.migrationhistory that tracks every script file that's been installed. You can roll forward or back based on the numeric prefix of the script file name. (This is also how Ruby on Rails migrations work.) Version information is taken from the Version.yml file in the config directory and embedded in the filenames for the migration scripts. Hence the script directory can have installer/uninstaller files for multiple madlib versions, and in principle roll forward or back at will.

I have a long list of features/protections I want to add, but this is a reasonable start if used with discipline.