Towards a consistent PHP type checking system

For most of its life, PHP has been a “weakly” typed language: values can freely shift from one state to another, and mould themselves to what an operation requires; or, in the case of objects, largely duck-typed: you don’t need a formal contract to call a method. There is, however, a trend in the language to adopt stronger typing constraints, first in function parameters, and now elsewhere in the language. As with so much in PHP, this is being done piecemeal, and without a clear road map. This is partly deliberate: a gentle introduction of optional features is less disruptive; but it is dangerous: compromises made now will shape the language for years to come.

I think it is time to grasp the nettle and say that there is a definite policy of introducing a new typing model to PHP, and a roadmap of what that means for the language. The implementation may still be incremental, but each change will be a step in a consistent direction. This post is my musings on what such a model might look like.

Disclaimer: I don’t know nearly as much as I’d like about the theory behind all this; feel free to point out errors, or point me to further reading.

Gradual Typing

PHP is not the first language to want an element of static typing in an otherwise dynamic language, and there is literature out there to learn from. In particular, there is “gradual typing”, developed by Walid Taha and Jeremy Siek, and adopted by Python, Hack, and others. Siek has written a very good introduction to gradual typing, including some thoughts on the pros and cons of static and dynamic typing. In particular, he notes:

The reason why research on type systems continues to flourish is that it is difficult to design and implement a type system that is expressive enough to enable the straightforward expression of all programs that we would like to write.

The solution offered by gradual typing is that the type system has a built-in mechanism for getting out of your way. This same mechanism also allows the gradual upgrade of existing code which is dynamically typed, hence the name, and the applicability to existing scripting languages.

In a nutshell, gradual typing says that everything has a type, but things where no specific type is known are given a special type (called “?”, “dynamic”, or “Any”) which exists outside the normal type hierarchy. On the one hand, an “Any” constraint allows everything, as though every type is a sub-type of “Any”; on the other hand, an “Any” value passes any constraint, which can’t really be expressed with “is sub-type of”.

Gradual typing has been explicitly adopted by Python, and there is a comprehensive explanation of Python’s type hints called PEP-483, which includes further discussion of the implications of Any and other special types.

What is your Type System for?

Yesterday, I came upon a Stack Exchange question asking “What is a type system?“, which in turn led me to a great article by Chris Smith titled “What to know before debating type systems“. It makes the case that “static typing” and “dynamic typing” are quite separate things, with quite different aims. A “type system”, in the theory of static typing, is a way to formulate proofs about the behaviour of a program; it is about knowing, without executing any code, that certain states are impossible. A dynamic typing system doesn’t attempt to prove anything, only to detect states that are undesirable after they’ve already happened.

It’s tempting to say “static typing happens at compile time, dynamic typing happens at run time”, but that’s just a consequence of the more fundamental difference – executing a program is not a rigorous proof, and without executing it you have no values to observe.

What does this have to do with our discussion? Gradual typing, as described by Siek and implemented by Python and Hack, is a mechanism for static typing: it can provide proofs that certain programs are invalid. Unlike a more rigorous type system, these proofs are rather weak, since the statement that “a value of type Any is acceptable for all type constraints” is not actually true at run-time: if your “Any” value is actually the string “hello”, a type constraint of “int” should be violated. At this point, a parallel system of dynamic typing has to take over, and detect that something went wrong.

PHP has no static type system, and adding one might be tricky – IDEs certainly do static analysis, but they have to make assumptions, such as that your project contains exactly one definition of every class and function used, rather than choosing one dynamically. Instead, “type hints” act like assertions: points in the code where values are checked, at run-time, and rejected if they are not valid – in short, dynamic typing. So, for now at least, “gradual typing” won’t help us much.

Hybrid Typing

Annotations in the source code can be used to guide both static and dynamic typing, but with subtly different meanings: to the static type checker, you say “this is always an int”; to a dynamic type constraint, you say “this should be an int”. For the static type checker, remaining types need to be deduced based on the operations you perform with them; a dynamic type constraint has nothing to deduce, because it is looking at a single value.

Perl6 apparently takes a “mostly runtime” approach; like everything in Perl6, its type system has a lot of novel concepts, and a lot of flexibility, and I have struggled to find a policy document that talks specifically of its attitude to type hints. A big difference, though, is that Perl6 relies heavily on overloading / multiple dispatch, so a large part of the type system is to enable that.

Dart’s Optional Typing feature uses a very explicit mixture: it uses gradual typing in an official static analyser, like Python and Hack; but it also includes a “checked mode” that turns all the type annotations into run-time assertions, more similar to PHP:

Essentially, checked mode is like running your program under the debugger with watchpoints that run a subtype check on every assignment, return, and so on.

The language’s designers have taken the deliberate design decision that the whole feature should be simple and approachable, and the checks can be described as Optional and Unsound – for instance, generics take an “optimistic” view of allowable variance. The “checked mode”, meanwhile, is considered strictly a development and debugging tool, and comes with this significant warning:

All these checks impose a considerable performance penalty, so one cannot usually afford to run them in production.

Given that it is checked mode that is most like PHP’s “type hints”, this should be taken as a warning – should the whole type checking apparatus be toggled off in production, like assertions?

Casts and Coercions

PHP’s type system has a rather complicated system of casts, both explicit and implicit. Built-in operators perform “type juggling” to get their arguments to match, with varying usefulness – the “==” operator is notorious for coercing values to match when you’d rather it didn’t. This actually comes as a consequence of being a web-oriented language: the protocols of the web are all text-based, so every input and output is actually a string; to be useful, these strings need to be cast to other data types at some point, and the language aims to make this easy.

This led to a long and sometimes heated discussion of if and how type hints should be implemented for “scalar” values – objects having been the easy case because in PHP they can never be cast, even explicitly. The solution eventually agreed is a 2-mode system, where by default, PHP will attempt to coerce variables to the right type – for instance, the string “123” can be passed to a function requiring an integer, and the function will see it as an integer. If the user prefers, they can assert that for calls made in their code (not calls made by others to their functions) no coercion should be allowed.

When considering a holistic type system, we need to somehow reconcile these various behaviours. All string values can be explicitly cast to integer (even if only as “0”); some string values can be implicitly coerced to integer; and some situations will attempt coercion where others will not.

Proposed Principles

First, let’s define some terms:

  • a “value” is a piece of data, plus metadata describing what it is, but not where it is; in the Zend Engine, values are stored as zvals
  • a “container” is something that holds a value; it is given a name and a meaning by the programmer; a local variable is the most obvious container, but a method parameter, an object property, and a return value are also containers (I have borrowed this term from Perl6 to be more general than “variable”, although possibly not with exactly the same meaning)
  • a “type” is a named set of values, with a common set of available operations
  • a “type constraint” is a specification of one or more types; or more specifically, a set of allowed values

We can say that a value has the following properties:

  • exactly one current type
  • zero or more other types to which it can be coerced or cast; these vary by situation, and may not be the same for all values of the same type (e.g. string(“123”) may coerce to int where string(“hello”) does not)
  • zero or more additional constraints it is known to pass, alongside the trivial constraint of type equality

A full typing system would say that containers all have the following properties:

  • a current value, of some specific type
  • a type constraint, which determines which values are expected by, i.e. can be assigned to, that container

Note that “int|string” is a valid type constraint, but not a valid type: a container can specify that it expects “any value of type int or any value of type string”, but there is no value which is “either an int or a string”. Allowing this distinction gives us freedom to make more complex type constraints, such as “subsets” and “shapes”, without considering them types in their own right.

Finally, we define a “cast” based on a signature and an implementation:

  • a cast’s signature consists of a type constraint for the values it knows how to convert from, and a concrete type it converts to; this allows us to define casts like “strings consisting only of digits can be cast to integers”
  • note that since the simplest type constraint is to match a single type, we can also express “all integers can be cast to floats” (in as much as that is true given limits in byte length)
  • casts can be grouped into “implicit” and “explicit”; an implicit cast is called automatically by the engine when appropriate, whereas an explicit cast is essentially sugar for calling some underlying conversion function (e.g. (int)$foo can be seen, conceptually, as sugar for intval($foo))

Implementation

The above terms are just a way of representing concepts that the language already has, implicitly, but we can use them to design extensions to the language in a more consistent way.

The recent Typed Properties RFC had to contain a number of special cases because the concept of a “container” was not there to attach a type constraint to. Notably, it had to disallow any assignment of a typed property by reference, because there was no way to stop an invalid value being assigned to the reference (and thus the property). If we move the type constraint check from the special case of “setting an object property”, to any “assignment to a container”, this problem goes away: the type constraint of the reference is the type constraint of the container it references.

Storing this type constraint against every “container” would require a fairly fundamental change to the engine’s internal structure; a key problem would be minimising the memory impact this would have on the vast majority of variables which would still have no constraints.

The other performance impact, already hinted at, is the cost of checking values against type constraints. This turns out to be non-trivial for nearly all current type hints – class hints must check not just class equivalence, but inheritance and interface implementation; scalar hints need only check type equality in “strict mode”, but in “weak mode” they also need to check for coercability; and the “callable” type hint has to evaluate a particularly complex assertion. A possible mitigation for this would be to trade memory for CPU time by caching the results of constraint checks against each value – note, not against the container.

Are types the future, or just an option?

The Python type hinting specification, PEP 484, makes a point of saying this:

It should also be emphasized that Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

Some people would like PHP to make a similar commitment. In many ways, some kind of “mission statement” for PHP would be great for users and internals hackers alike – if the community could agree just what it would say…

1 Comment

  1. Chris Barnes

    The problem, as always, is backwards compatibility and the cost of regression testing and refactoring.

    https://www.youtube.com/watch?v=Nn-SGblUhi4&t=120s

    Best wishes

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.