The Problem With Cup Typing

2008-08-22

First I should explain what I mean with cup typing. When you buy a cup of coffee, you have the choice of short, tall, or grande sized cup. Sometimes you can also choose decaf or regular. When you declare an integer variable in Java, you have the choice of byte, short, int, and long. Sometimes (in languages like C++) you can also choose between signed and unsigned. The similarity is obvious. And it doesn’t end with integers. Floating point numbers come in two different flavours, namely as regular “float” values (32-bit) and as “double” values (64-bit).

Characters come in the form of 7-bit, 8-bit and 16-bit encodings. In statically typed programming languages, multiplicity is the rule rather than the exception. While Fortran and Pascal offer a moderate choice of two different integers, Java offers four plus a BigInteger implementation (“extra grande”) for really large numbers. However, it’s C# that takes the biscuit in cup typing with 9 different integer types and 3 different real types. Database systems are keeping up with this trend. For example, the popular MySQL RDBMS offers 5 different integer types and 3 different real types. Seeing the evolution from Fortran to C#, it almost appears as if type plurality has increased over time. We must ask two things: How did this come about and is it useful? We appreciate the fact that we can buy coffee in different cup sizes to match our appetite, but does the same advantage apply to data types?

The first question is easy to answer. Graduated types result from the fact that computer architectures have evolved in powers of two. Over several decades, the register width of the CPU of an average PC has expanded from 8 to 16 to 32 to 64 bits. Each step facilitated the use of larger types and numeric types in particular were closely matched to register width. Expressing data types in a machine-oriented way appears to be a C legacy and quite a few newer programming languages have been strongly influenced by C. - It is my contention that while curly braces and ternary operators are an acceptable C-language tradition, graduated types are definitely not. Why not? Because they counter abstraction. They hinder rather than serve the natural expression of mathematical constructs. Have you ever wondered whether you should index an array with byte- or short-sized integers? Whether you should calculate an offset using int or long values? Whether method calls comply with type widening rules? Whether an arithmetic operation might overflow? Whether a type cast may lose significant bits or not? All of this is a complete waste of time in my view. Wouldn’t it be better to let the virtual machine worry about such low-level questions, or the library if a VM is not present? Cup typing gets positively annoying when you have to write an API that is flexible enough to deal with parameters of different widths. If there’s no type hierarchy, you inevitably end up with multiple overloaded constructors and methods (one for each type) which add unnecessary bulk. The Java APIs are full of such examples and the valueOf() method is a case in point - it’s really ugly.

However, graduated types are beyond ugly; they are outright evil. They cause an enormous number of bugs and the small numeric types are the prime offenders. I wonder how many times a signed or unsigned byte has caused erratic program behaviour by silently overflowing. Such bugs can be hard to find and worse – they often don’t show until certain border conditions are reached. Casts that shorten types also belong to the usual suspects. I shall not even mention the insidious floating point operations that regularly unsettle newbie programmers with funny looking computation results. What numeric types does one really need? - Integer numbers and real numbers. One of each and not more. - If you want to be generous as a language designer, you can throw in an optimised implementation of a complex number type and a rational number type. However, in an object-oriented language with operator overloading, it’s fairly easy to express these in a library. The fixed comma type (sometimes called decimal type) is the subset of the rational type where the denominator is always a power of ten. So, that’s really all you need – a clean representation of the basic mathematical number systems.

At this point, you might object: “but the CPU register is only x bits wide,” or “how do I allocate an array of fifty thousand short values?”, or “can I still have 8-bit chars?” Unfortunately, there is no simple answer to these questions. The natural way to represent integers is to always use the machine’s native word width, but unfortunately that doesn’t solve the problem. First of all, the word width is architecture dependent. Second, it might be wasteful for large arrays that hold small numbers and on the other hand it would still be too small for applications that need big integers. The solution is of course a variable size type, i.e. an integer representation that can grow from byte size to multiple word lengths. We have variable length strings, so why shouldn’t we have variable length numbers? It seems perfectly natural. There is certainly some overhead involved, because variable length types need special encoding. The overhead will be most likely due to loading a descriptor value and/or to bit shifting operations. After all, variable length numbers don’t come for free, but they do offer tremendous advantages. They relieve the programmer from making type width decisions, as well as documenting these decisions - and worse - changing the type width later if the decision turned out to be inadequate. Furthermore, they eliminate the above mentioned bugs resulting from silent overflows and type cast errors, not to mention API proliferation due to type plurality. Thus variable length numbers are generally preferable to common fixed width types.

Of course, there are situations where you know that you will never need more than a byte. There are also situations where performance is paramount. In addition, APIs and libraries based on multiple fixed types are not going to disappear overnight. To provide backward compatibility and to offer optimisation pathways to the programmer, a language could present these as subsets of the mathematical type. For example, if a language defines the keyword “int” for variable length integer numbers, then “int(8)” could mean a traditional byte, “int(16)” could mean a short word, and so on. Now, this is a bit like reintroducing cup typing through the back door. Therefore the use of subtypes for general purpose computations should be discouraged. However, it’s always better to have a choice of fixed and variable types than having no variable types at all.

programming architecture