Tuesday, January 26, 2016

On Contributor Codes of Conduct and Social Justice


The PostgreSQL, Ruby, and PHP communities have all been considering codes of conduct for contributors. The LedgerSMB community already uses the Ubuntu Code of Conduct.  Because this addresses many projects, I am syndicating this further than where there are current issues.  This is not a technical post and it covers a wide range of very divisive issues for a very diverse audience.  I can only hope that the nuance I am trying to communicate comes across.

Brief History


A proximal cause seems to be an event referred to as "Opalgate" where an Italian individual who claimed to be a part of the Opal project made some unrelated tweets in an exchange about the politics of education and the question of how gender should be presented, and some people got offended and demanded his resignation (at least that is my reading of the Twitter exchange but I have been outside the US long enough to lose the context in which it would likely be read by an American in the US).  The details are linked below, but the core questions involve how much major contributors to projects need to keep from saying anything at all about divisive issues seems to be a recurring topic.  Moreover it is a legitimate one.

Like some of my blog posts, this goes into touchy territory.  I am discussing things which require a great deal of nuance.  Chances are, regardless of where you sit on some of these issues, you will be offended by things I say, but there are worse things than to be offended (one of them is never to be challenged by different viewpoints).

I write here as someone who has lived in a number of very different cultures and who can see perspectives on many of these issues which are not present in American political discourse  For this reason, I think it is important for me to share the concerns I see because otherwise open source software maintainers often don't have a perspective outside of Western countries, or even outside the US.

Of course as open source software maintainers we want everyone to feel safe and valued as members of the community.  But cultural tensions and ways of life do crop up and taking a position on these as a community will always do more harm than good.

Background Reading regarding Opalgate and the question of so-called "social justice warriors" in open source


It may seem strange to put a list of links for background reading near the start of an article, but I want to make sure that such material is available up front.  People can read about Opalgate here and the ongoing debate between various parties about it.  It's important background reading but somewhat peripheral to the overall problems involved.  It may or may not be the best example of the difficulties in running cross-cultural projects but it does highlight the difficulties that come in addressing diverse community bases, those which may have deep philosophical disagreements about things which people take very personally.

In the interest of full disclosure, I too worry that there is too much eagerness to liberate children from concepts of gender and too little thought about how this can and will be abused, and what the life costs for the children actually will be.  I believe that we must be human and humane to all but I am concerned that the US is going down a path that strikes me as anything but that in the long run.  That doesn't mean that the concerns of the trans community in the US should be ignored, but that doesn't mean they should be paramount either.  As communities we need to come together to solve problems not fight culture wars.

Twitter is not a medium which is conducive to thoughtful exchange so I also have to cut some slack.  Probably not the wisest medium to discuss controversial topics.  But people around the world have deep differences in views on major controversies.  My wife, for example, is far more opposed to abortion than I am, and having come to a deeper understanding of her culture, I don't disagree that in her cultural context, it is more harmful.  But that brings me to another problem, that many issues are contextual and we cannot see how others really are impacted by such changes, particularly when forced from the outside.

But my view doesn't matter everywhere.  It matters in my family, my discussions with people I know, and so forth.  But most of the world is not my responsibility nor should it be.  These are not entirely easy issues and there should be room for disagreement.

Is Open Source Political?


Caroline Ada Ehmke's basic argument is that open source is inherently political, that it seeks a positive change in the world, and therefore it should ally itself with others sharing the same drive to make the world a better place.  I think this viewpoint is misguided but because it is only half-wrong.

Aristotle noted that all human relationships are necessarily political.  The three he chose as primary in Politics is illustrative:  master and slave (we could update to boss and worker); husband and wife; and king and subject.  To Aristotle, the human being alone is incomplete.  We are our relationships and our politics follows from them.  While there has been an effort to separate the personal and the political in modern times, Feminist historians have kept this tradition alive and well.  A notion of the political grounded in humans as social animals is fundamentally more conducive to justice than cold, mechanical, highly engineered social machinery.  Moreover Aristotle notes that all communities are built on some concept of the good, that humans only want things that seem good to them and therefore we can assume that all groups seek a better world, but we don't always know which ones deliver, and that is the problem. 

Open source begins not with an ideology but with a conviction.  Not everybody shares the same conviction.  Not everyone participates in open source for the same reason.    But everyone has a reason, some conviction that what they are doing is good.  There is enough commonality for us all to work together, but that commonality is not as strong as one may think.

In a previous post on this blog I argued for a very different understanding of software freedom than Richard Stallman supposes, for example.  While he holds a liberal enumerated liberties view, I hold a traditionalist work-ownership view.  Naturally that leads to different things we look for in an ideal license.

And the diversity in viewpoint does not stop there.  Some come to open source because they believe that open source is a better way of writing software.  Some because they believe that open source software delivers benefits in use.  But regardless of our disagreement we share the understanding that open source software brings community and individual benefits.

In two ways then is open source software political:
  1. Communities require governance and this is inherently political, and
  2. To the extent there is a goal to transform the software industry to one of open source that is political.
The first as we will see is a major problem.  Open source communities are diverse in a way few Americans can fully comprehend (we like to think everyone is like us and there is one right way, the American Way whether that is in industry -- the right -- or formulations of rights -- the left).  Thus most discussions end up being Western-normative (and in particular American-normative) and disregard perspectives from places like India, Malaysia, Indonesia, and so forth.

However it is worth coming back to the point that what brings us together is an economic vision.  Yes, that is intrinsically and highly political, but it also has consequences for other causes and therefore it is worth being skeptical of alliances with groups in other directions.  What would an open source-based economy look like? What would the businesses look like? Will they be the corporations of today or the perpetual family businesses and trades of yesteryear?  And if the latter, what is the implication for the family?  Many of these questions (just like questions of same-sex marriage) depend in large part on the current social institutions in a culture -- the implications of an industrial, corporate, weak family society adopting something like same-sex marriage are very different than in an agrarian or family-business, strong family society.  My view is that these will likely have different answers in different places.

Thus when an open source community takes a position on, for example, gay rights in the name of providing a welcoming community, they make the community openly hostile to a very large portion of the world and I think that is not what we want.  Moreover such a decision is usually a product of white, Western privilege and effectively marginalize those in so-called developing countries who want to see their countries economically develop in a very different direction than the US has.  Worse, this is not an unintended side effect but the whole point.

A brief detour into the argument over white privilege


A discussion of so-called white privilege is needed I think for three groups reading this:
  • Non-Americans who will have trouble understanding the idea as it applies to American society (like all social ideas, it does not apply to all societies or even where it does apply, it may not in the same way).
  • White Americans who seem to have trouble understanding what people of color in the US mean when they use the term.
  • Activists who want to use the idea as a political weapon to enforce a sort of orthodoxy

I mentioned Western-Normative above.  It is worth pointing out that this forms a part of a larger set of structures that define what is normal or central in a culture, and what is abnormal, marginalized (or perhaps liminal).  It is further worth noting that the perception as to these models is more acute to those who are not treated as the paragons of normality.  In the US, the paragon of normality is the white, straight male.  But unspoken here is that it is the white, straight, urban, wealthy American male (or maybe European, they are white too).  Everyone else (women, people of color, Africans, Asians, etc) should strive to be like these paragons of success (I, myself, having lived most of my life now either outside the US or in the rural parts, am most certainly not included in this paragon of normality model, but nevertheless it took years of marriage to someone from a different culture and race to begin to be able to partially see a different perspective).

Now, it doesn't follow that white straight males live up to this image (which is one reason why white privilege theory has proven controversial among the arguably privileged) even where there is wealth, one is brought up in nice neighborhood in the city, etc.  But that isn't really the point.  The point is that society holds these things to be *normal* and everything else to be only normal to the extent it is like this model.  It would be better and more accurate to call this a model of normality rather than privilege and to state at the outset that we cannot really walk a mile in the shoes of people from across many social borders (culture included).

White privilege is real, as is male privilege (in some areas, particularly employment), urban privilege, American privilege, Western privilege, even female privilege (in some areas, particularly family law).

Issues exist in a sticky web of culture, and no culture is perfect


These issues of privilege aren't necessarily wrong in context:  it seems unlikely that the workplace can be made less male-normative without men sharing equally in the duties and rights of childrearing, but enforcing that cuts against the goal by some feminists of liberating women from men (and also exists in tension with things like same-sex marriage and gender-nonessentialism).  Insisting that men get the same amount of parental leave as women cuts one direction, but insisting that single women get free IVF cuts the other (both of these are either the case in Sweden or efforts are being made to make them the case).  In other words, addressing male privilege requires a transformation of the economic and family order together, in such a way that having children becomes an economic investment rather than an economic burden.  But then that has implications for the idea of gay rights as we understand the concept in the West because if having and raising children becomes normative then one is providing a sort of parental privilege, and gender equality becomes based on heteronormativity.

But the ultimate white privilege is to deny it is a factor when one uses one's own perception that other cultures are homophobic or transphobic to justify one's own racist paternalism.  No need to understand why.  We are white.  We know what is right.  We just need to educate them so they can join the ranks of the elite culturally white enlightened liberals as well.  Most of the world, however, disagrees, and as maintainers of open source projects we have to somehow keep the peace. (Note I use the term liberal as it is used in the history of ideas --  in the West it is no less prevalent on the mainstream right than on the mainstream left, though the application may be different.)

Since many of these issues necessarily exist in tension with eachother, there is no such thing as a perfect culture.  It isn't even clear the West does better than Southeast Asia on the whole (in fact I would say the SE Asia does better than the West on the whole).  But all culture is an effort at these tradeoffs, and it is not the job of open source communities to push Western changes on the rest of the world.

What is Social Justice?  Two Theories and a Problem


If open source is inherently political then social justice must in some way matter to open source.  Naturally we must understand what social justice is and how it applies.  Certainly a sense of being treated fairly by the community is essential for contributors from all walks of life.  The cult of meritocracy is an effort at social justice within the community.  As some argue it is not entirely without problems (see below) but as a technical community it is a start.

Western concepts of justice today tend to stress individuality, responsibility, and autonomy.  The idea is that justice is something that exists between individuals, and maybe between individuals and the state.  And while contemporary Western social justice theorists on the left try to relate the parts to the whole of society, it isn't clear that there is room for any parts other than the isolated individual and the state in their theories.  If one starts with the view that humans are born free but everywhere in chains (Rousseau), then the job of the state is to liberate people from eachother, and that leaves no room for any other parts.

The individualist view of justice, when seen as primary, breaks down in a number of important ways.  The most important is that it provides no real way of understanding parts and how they can be related to the whole.  Thus, the state becomes both stronger and isolates people more from eachother, and predictability becomes more important than human judgement.  Separatism cannot be tolerated, and assimilationism becomes the rallying cry when it comes to how the central model of normality should deal with those outside.  In other words the only way that this approach can deal with those on the margins is to destroy their culture and assimilate the individuals remaining.  Resistance must be made futile (Opalgate can be seen as such an effort).  For this reason, this view of justice is incompatible with real cultural pluralism.   This is not a question of the political spectrum in the US or Europe.  It is a fundamental cultural assumption in the much of the West.  Interestingly, the insistence that the personal is political means that intellectual feminism already exists in tension with this cold, mechanical view of justice.

Another view of justice can be found in Thomas Aquinas's view that in addition to justice between individuals, there is a need to recognize that just as individuals are parts in relation to the whole, so are other organs within society.  In other words, justice is a function of power, and justice is in part about just  design and proper distribution of power and responsibility.  In this regard, Aquinas built on the thought experiments of Plato's Republic and the Politics of Aristotle.  In this regard, key questions of social justice include the structure of an open source community and the relationship between the parts of the community (how users and developers interact and share power and responsibility), the relationship between open source projects and so forth.

In the end though, there was a reason why Socrates eventually rejected every formulation of justice he pondered.  Justice itself is complex and to formulate it removes a critical component of it, namely human judgement when weighing harms which are not directly comparable.  I think it is therefore quite necessary for people to remain humble about the topic and to realize that nobody sees all the pieces, and that we as humans learn more from disagreement than from agreement.  Therefore every one of us is ignorant to some extent on the nature of justice and so disagreements are healthy.

Open Source Projects and So-Called Social Justice Warriors


Coraline Ehmke, in her post on ticket on Opalgate asked:

Is this what the other maintainers want to be reflected in the project? Will any transgender developers feel comfortable contributing?"

This is  good question but another question needs to be asked as well.  Given that a lot of people live in societies with very different family and social structures, should people feel comfortable using software if the maintainers of the project have come out as openly hostile to the traditional family structures in a culture?  Does not a community that is welcoming of all need to avoid the impulse to delegitimize social institutions in other cultures, ones where one necessarily lacks an understanding into how it plays into questions of economic support and power?  If open source is already political do we want to ally ourselves with groups that could alienate important portions of our user base by insisting that they change their way of life?

It is important that we maintain a community that is welcoming to all, but that means we have to work with people we disagree with.  A mere difference of opinion should never be sufficient to trigger a problem with the code of conduct and expressing an opinion outside community resources should never be sufficient to consider the community unduly unwelcoming.  A key component of the community is whether people can work together with people when they disagree, and forcing agreement or even silencing opposition is the opposite of social justice when it comes to a large-reaching global project.

Should open source communities eject social justice warriors as ESR suggests?  Not if they are willing to work comfortably with people despite disagreements on hot button issues.  Should we welcome them?  If they are willing to work with people comfortably despite disagreements on hot button issues.  Should we require civility?  Yes.  Should we as communities take stances on hot button issues internationally?  Absolutely not.  What about as individuals?  Don't we have a civic duty to engage in our own communities as we feel best?  And if both those are true, must we not be tolerant of a wide range of differences in opinion, even those we find deeply and horribly wrong?

Wednesday, September 17, 2014

PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

Preface


This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject.   It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems.  The next article in this series will address more specific solutions and more complex scenarios.

The Problems


Often times we want to have database fields automatically turned into object types which are useful to an application.  The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy.  Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful.  Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly.  One solution here is to try to follow the basic functional programming approach and copy on write.  This prevents a lot of problems.  Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic.  However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert.  We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository


This article follows the code of PGObject::Type::BigFloat..  The code is licensed under the two-clause BSD license as is the rest of the PGObject framework.  You can read the code to see the boilerplate.  I won't be including it in here.  I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling


To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef."  This is fairly trivial:

sub is_undef {
    my ($self, $set) = @_;
    $self->{_pgobject_undef} = $set if defined $set;
    return $self->{_pgobject_undef};
}

How PGObject Serializes


When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments.  A query is generated to call the procedure, and each argument is checked for a "to_db" method.  That method, if it exists, is called and the output used instead of the argument provided.  This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value.  If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it).  This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
    my $self = shift @_;
    return undef if $self->is_undef;
    return $self->bstr;
}

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes


Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes.  We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what.  The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings.  If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts.  For this type handler:

sub register{
    my $self = shift @_;
    croak "Can't pass reference to register \n".
          "Hint: use the class instead of the object" if ref $self;
    my %args = @_;
    my $registry = $args{registry};
    $registry ||= 'default';
    my $types = $args{types};
    $types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
    for my $type (@$types){
        my $ret =
            PGObject->register_type(registry => $registry, pg_type => $type,
                                  perl_class => $self);
        return $ret unless $ret;
    }
    return 1;
}

Then we can just call this in another script as:

PGObject::Type::BigFloat->register;

Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object.  In this case:

sub from_db {
    my ($self, $value) = @_;
    my $obj = "$self"->new($value);
    $obj->is_undef(1) if ! defined $value;
    return $obj;
}

This supports subclassing, which is in fact the major use case.

Use Cases


This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase.  We subclass this module and add support for localized input and output (with different decimal and thousands separators).  This gives us a data type which can present itself to the user as one format and to the database as another.  The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.

Caveats


PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer.  Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0.  To this point, that behavior is probably correct.  More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series:  Advanced Serialization and Deserialization:  Dates, Times, and JSON

Monday, September 15, 2014

PGObject Cookbook Part 1: Introduction

Preface


I have decided to put together a PGObject Cookbook, showing the power of this framework.  If anyone is interested in porting the db-looking sides to other languages, please let me know.  I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications.  It addresses some of the same problems as ORMs but does so in a very different way.  Some modules are almost ORM-like and more such modules are likely to be added in the future.  However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed.  Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications.  And there are people who should not use this framework because it is not the right tool for the job.  For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework.  But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework.  Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN.  However I expect it to also be of far more general interest since the problems are common problems across frameworks.

Introduction


PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application.  Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities.  As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced.  In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality



  1. registered types, autoserialization, and autodeserialization.
  2. The simple stored procedure mapper
  3. Aggregates and ordering
  4. Declarative mapped methods

Intermediate Functionality


  1. The Bulk Loader
  2. The Composite Type stored procedure mapper
  3. The database admin functions

Advanced Functionality


  1. Memoization of Catalog Lookups
  2. Writing your own stored procedure mapper

This series will cover all the above functionality and likely more.  As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles


The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3.  In general we took what we liked and what seemed to work well and rewrote those things that didn't.  Our overall approach has been based on the following principles:
  • SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages.  The system should leverage hand-coded SQL.
  • Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them.  It's better to store them persistently in the db  or generate well-understood queries in general ways where necessary.
  • Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
  • DB-centric but Loosely Coupled:  The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right.  Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
  • Don't Make Unnecessary Decisions for the Developer:  Applications may use a framework in many atypical ways and we should support them.  This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this.  Consequently you can expect your application to have to slightly extend the framework to configure it.
This framework is likely to be very different from anything else you have used.  While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files.  And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series:  Registered Types:  Autoserialization and Deserialization between Numeric and Math::BigFloat.

Sunday, September 14, 2014

LedgerSMB 1.4.0 Released


15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

"LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

Better Productivity


LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

Better Reporting


The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

Better Integration


Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

Commercial Options


LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at http://ledgersmb.org/topic/commercial-support

Thursday, September 11, 2014

Math and SQL Part 6: The Problem with NULLs

This will be the final installment on Math and SQL and will cover the problem with NULLs.  NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model.  Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables.  This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set


A null set is simply a set with no members.  This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found.  This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown


The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?"  In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages.  But it can also mean that maybe there is an answer and that there is no value associated.  In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use.  The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of.  So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table.  1 + NULL IS NULL.  NULL OR FALSE IS NULL.  NULL OR TRUE IS TRUE.  This makes things complicated.  But sometimes we must....

Null as Not Applicable


One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value."  There are a few data types which have no natural empty/no-op types.  Prime among these are numeric types.  Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc.  In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means.  Consider for example:

CREATE TABLE wage_class (
   id int not null,
   label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
   ssn text not null,
   emp_id int not null,
   wage_class int not null references wage_class(id),
   hourly_wage numeric,
   salary numeric,
   check (wage_class = 1 or salary is null),
   check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not.  This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful.  This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic


Because NULLs can represent unknowns, they introduce three-valued predicate logic.  This itself can be pretty nasty.  Consider the very subtle difference between:

   WHERE ssn like '1234%' AND salary < 50000

vs

    WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints


Despite all the problems, NULLs have become a bit of a necessary evil.  Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify.  This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage.  This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys


Foreign keys are a special case of NULLs where the use is routine and poses no problems.  NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise.  Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like:  NULL, NOVALUE, UNKNOWN


In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL.  UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it.  NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection