Monday, February 13, 2017

PostgreSQL at 10TB and Beyond Recorded Talk

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch.  For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers.  You can book the course for the end of this month.  It will be held in Malmo, Sweden.

Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;


You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

Tuesday, January 24, 2017

PostgreSQL at 10 TB and Above

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden.  The seminar is free to attend.  I expect to be talking for about 45 minutes with some time for questions and answers.  I also have been invited to give the talk at PG Conf Russia in March.  I do not know whether either will be recorded.  But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own.  But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes).   The other two address volume of data.

  1. Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes.  This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
  2. Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables.  We will talk about how this race condition happens and we solved it without using row locks.  We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
  3. Slow access and poor plans regarding accessing data in large tables.  We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
  4. Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.
In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database.  These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.

Sunday, December 11, 2016

What Git got Right and Wrong

Having been using various vcs solutions for a while, I think it is worth noting that my least favorite from a user experience perspective is git.  To be sure, git has better handling of whitespace-only merge conflicts and a few other major features than any other vcs that I have used.

And the data structure and model are solid.

But even understanding what is going on behind the scenes, I find git to be an unintuitive mess at the CLI issue.

Let me start by saying things that git got right:


  1. The dag structure is nice
  2. Recursive merges are good
  3. The data models and what goes on behind the scenes is solid.
  4. It is the most full-featured vcs I have worked with
However, I still have real complaints with the software  These include fundamentally different concepts merged into the same label and the fact that commands may do many different things depending on how you call them.  The fact that the concepts are not clear means that it is worse than a learning curve issue.  One cannot have a good grasp of what git is doing behind the scenes because this is not always clear.  In particular:

  1. In what world is a fast-forward a kind of merge?
  2. Is there any command you can explain (besides clone) in three sentences or less?
  3. What does git checkout do?  Why does it depend on what you checkoout?   Can you expect an intermediate user to understand what to expect if you have staged changes on a file, when you try to check out a copy of that file from another revision?
  4. What does git reset do from a user perspective?  Is there any way a beginner can understand that from reading the documentation?
  5. In other words, git commands try to do too much at once and this is often very confusing.
  6. Submodules are an afterthought and not well supported across the entire tool chain (why does git-archive not have an option to recurse into submodules?)
These specific complaints come from what I see as a lack of clarity regarding what concepts mean and they indicate that those who wrote the tools did so at a time when the concepts were still not entirely clear in their own minds.  In essence it is not that things are named in unclear ways but that concepts have unclear boundaries.  

Some of this could be fixed in git.  fast-forward could be referred to as a shortcut to avoid a merge rather than a kind of merge. Some could be fixed with better documentation (we could describe git reset by what the user-facing changes are, rather than the internal changes).  Some would require a very different set of command layouts.

Thursday, August 18, 2016

PostgreSQL vs Hadoop

So one of the clients I do work with is moving a large database from PostgreSQL to Hadoop.  The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS).  However this has provided an opportunity to think about big data problems and solutions and their implications.  At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around.  No, LedgerSMB will never likely use Hadoop as a backend.  It is definitely not the right solution to any of our problems.

Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure.  It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone.  Moreover these solutions are hardly without their own significant costs.  All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.

So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions.  Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have.  No, Hadoop does not make a great general ETL platform.....

One of the key things to note is that Hadoop is built to solve all three problems simultaneously.  This means that you effectively buy into a lot of other costs if you are trying to solve only one of the V problems with it.

The single largest cost comes from the solutions to the variety of data issues.  PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.  You force a schema on write and if that is violated, you throw an error.  Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.  Ouch.  But that's very helpful when trying to make sense of a lot of non-structured data.

Now, solutions to check out first if you are faced with volume and velocity problems include Postgres-XL and similar shard/clustering solutions but these really require good data partitioning criteria.  If your data set is highly interrelated, it may not be a good solution because cross-node joins are expensive.  Also you wouldn't use these for smallish datasets either, certainly not if they are under a TB since the complexity cost of these solutions is not lightly undertaken either.

Premature optimization is the root of all evil and big data solutions have their place.  However don't use them just because they are cool or new, or resume-building.  They are specialist tools and overuse creates more problems than underuse.

Sunday, August 14, 2016

Forthcoming new scalable job queue extension

So for those of you who know, I now spend most of my time doing more general PostgreSQL consulting and a fair bit of time still on LedgerSMB.  One of my major projects lately has been on a large scientific computing platform currently run on PostgreSQL, but due to volume and velocity of data being moved to Hadoop (the client maintains other fairly large PostgreSQL instances with no intention of moving btw).

With this client's permission I have decided to take a lot of the work I have done in optimizing their job queue system and create an extension under PostgreSQL for it..  The job queue currently runs tens of millions of jobs per day (meaning twice that number of write queries, and a fair number of read queries too) and is one of the most heavily optimized parts of the system, so this will be based on a large number of lessons learned on what is a surprisingly hard problem.

It is worth contrasting this to pg_message_queue of which I am also the author.  pg_message_queue is intended as a light-weight, easy to use message queue extension that one can use to plug into other programs to solve common problems where notification and message transfer are the main problems.  This project will be an industrial scale job queuing system aimed at massive concurrency.  As a result simplicity and ease of use take second place to raw power and performance under load.  In other words here I am not afraid to assume the dba and programming teams know what they are doing and has the expertise to read the manual and implement appropriately.

The first version (1.x) will support all supported versions of PostgreSQL and make the following guarantees:


  1. massively multiparallel, non-blocking performance  (we currently use with 600+ connections to PostgreSQL by worker processes).
  2. Partitioning, coalescing, and cancelling of jobs similar in some ways to TheSchwartz
  3. Exponential pushback based on number of times a job has failed
  4. Jobs may be issued again after deletion but that this can always be detected and bad jobs pruned
  5. Optionally job table partitioning.
The first client written will rely on hand-coded SQL along with DBIx::Class's schema objects.  This client will guarantee that:

  1. Work modules done always succeeds or fails in a transaction
  2. A job notifier class will be shown
  3. Pruning of completed jobs will be provided via the  perl module and a second query.
The history of this is that this came from a major client's use of The Schwartz and they out grew it for scalability reasons.  While the basic approach is thus compatible, the following changes are made:

  1. Job arguments are in json format rather than in Storable format in bytea columns
  2. Highly optimized performance on PostgreSQL
  3. Coalesce is replaced by a single integer cancellation column
  4. Jobs may be requested by batches of various sizes
2.x will support 9.5+ and dispense with the need for both advisory locks and rechecking.  I would like to support some sort of graph management as well (i.e. a graph link that goes from one job type to another which specifies "for each x create a job for y" type of semantics.  That is still all in design.

Friday, August 5, 2016

use lib '.' considered harmful (exploits discussed)

Which the discussion of CVE-2016-1238, a quick and easy fix for broken code that has been suggested is to add the following line to the top of broken Perl scripts:  Note this applies to Perl as run anywhere, whether pl/perlU, plain perl, or something else.

use lib '.';

In some corners, this has become the goto solution for the problem (pun quite deliberate).  It works, gets the job done, and introduces subtle, hidden, and extremely dangerous problems in the process.

For those interested in the concerns specific to PostgreSQL, these will be discussed near the end of this article.

Security and the Garden Path


I am borrowing an idea here from linguistics, the idea of the garden path, as something that I think highlights a lot of subtle security problems.  Consider the newspaper headline "Number of Lothian patients made ill by drinking rockets."  Starts off simple enough and you get to the end, realizing you must have misread it (and you did, probably, since the number of patients increased who were made ill by drinking, not that some people got sick because they drank hydrazine).  The obvious reading and the logical reading diverge and this leads to a lot of fun in linguistic circles.

The same basic problem occurs with regard to security problems.  Usually security problems occur because of two problems. Either people do something obviously insecure (plain text authentication for ftp users where it matters) or they do something that looks on the surface like it is secure but behind the scenes does something unexpected.

Perl here has a few surprises here because parsing and running a script is a multi-pass process but we tend to read it as a single pass. Normally these don't cause real problems but in certain cases there are very subtle dangers lurking.  Here, with use lib '.', it is possible to inject code into a running program as long as an attacker can get a file placed in the current working directory of the program.

Relative paths, including the current working directory, do have legitimate use cases, but the problems and pitfalls must be understood before selecting this method.

What the lib pragma does


Perl looks for files to require or include based on an array of paths, globally defined, called @INC.  Use lib stores a copy of the original lib at first use, and then ensures that the directory specified occurs at the start of the search order.  So directories specified with use lib are searched before the default library directories.  This becomes important as we look at how Perl programs get executed.

How Perl runs a program


Perl runs a program in two passes.  First it creates the parse tree, then it runs the program.  This is a recurive process and because of how it works, it is possible for malicious code that gets accidently run in this process to transparently inject code into this (and other portions) of the Perl process.

Keep in mind that this makes Perl a very dynamic language which, on one hand, has serious garden path issues, but on the other ensures that it is an amazingly flexible language.

During the parse stage, Perl systematically works through the file, generating a parse tree, and running any "BEGIN" blocks, "use" statements, and "no" statements.  This means that injected code can be run even if later sytnax errors appear to prevent the bulk of the program from running at all or if earlier errors cause run-time exception.

After this process finishes, Perl executes the parse tree that results.  This means that Perl code can rewrite the parse tree before your code is ever really written and that code can be inserted into that part o the process.

Transparent code injection during 'use'


Consider a simple Perl script:

#!/usr/bin/perl

use lib '.';
use Cwd;
use 5.010;
use strict;
use warnings;

say getcwd();

Looks straight-forward.  And in most cases it will do exactly what it looks like it does.  It loads the standard Cwd module and prints out the current working directory.

However, suppose I run it in a different directory, where I add two additional files:

Cwd.pm contains:

package Cwd;

use Injected;

1;


hmmmm  that doesn't look good. What does Injected.pm do?


package Injected;
use strict;

sub import {
   local @INC = @INC;
   my ($module) = caller;
   warn $module;
   delete $INC{'Injected.pm'};
   delete $INC{"$module.pm"};
   @INC = grep { $_ ne '.' } @INC;
   eval "require $module";
   warn "Got you!, via $module";

}

1;


So when Cwd imports Injected, it deletes itself from the memory of having been included, deletes its caller too, reloads the correct caller (not from the current working directory) and then executes some code (here a harmless warning).

Cwd.pm then returns success
test.pl runs Cwd->import() which is now the correct one, but we have already run unintended code that could in theory do anything.

Any program capable of being run in arbitrary directories, written in Perl, which has this line in it (use lib '.') is subject to arbitrary code injection attacks using any module in the dependency tree, required or not.

Instead, do the opposite (where you can)


As a standard partof boilerplate in any secure Perl program, I strongly recommend adding the following line to the top of any script.  As long as modules don't add it back in behind your back (would be extremely rare that they would), adding the following line:

no lib '.';

Note that this strips out the current working directory even if it si supplied as a command-line argument.  So it may not be possible in all cases.  So use common sense, and do some testing, and document this as desired behavior.  Note one can still invoke with perl -I./. in most cases so it is possible to turn this safety off.....  Additionally if you put that at the start of your module, something you include could possibly put it back in.

Safer Alternatives


In a case where you need a path relative to the script being executed, FindBin is the ideal solution.  It gives you a fixed path relative to the script being run, which is usually sufficient for most cases of an application being installed on a system as a third party.  So instead you would do:

use FindBin;
use lib $FindBin::Bin;

Then the script's directory will be in the include path.

PL/PerlU notes:


I always add the explicit rejection of cwd in my plperlu functions.  However if someone has a program that is broken by CVE-2016-1238 related fixes, it is possible that someone would add a use lib '.' to a perl module, which is a bad idea.  As discussed in the previous post, careful code review is required to be absolutely safe.  Additionally, it is a very good idea to periodically check the PostgreSQL data directory for perl modules which would indicate a compromised system.