Bohuslav Šimek

A few statistics about functions usage across the PHP ecosystem

May 9, 2022 6 min read
image

In one of the Czech fairy tales, there is a song about statistics. The song's catchiest phrase is a chorus, "statistics is boring, but its data are engrossing". I especially agree with the second part, "but its data are engrossing". That's why I will, in this post, look at how often individual functions in PHP are used in open source projects across the PHP ecosystem and draw a few conclusions from it.

Data set

Without a proper dataset, these statistics won't make any sense. Fortunately, there is a package repository, Packagist, which offers more than 323 970 open source packages. They are ideal sources of data for our analysis. Regrettably, some of them cannot be cloned from GIT, so the number of packages was slightly lower: 313 251. From the beginning, I decided to only include a default (master/main) branch from each repository. This can, at least in theory, impact results as it will favor the latest development version of each package.

As for the list of functions, it has been compiled with the latest version of PHP (8.1) in mind. I decided to include functions removed in the last four PHP versions (7.3. - 8.1) as they can still be relevant. Some language constructs, such as isset, have also been selected. I consider them a cornerstone of PHP, so I naturally include them. Only standard, always available functions have been included. I know that this can be tricky, so I attached a list of selected functions to this post. Just one last reminder - in this post, I focus only on functions, not on any built-in objects. They will be subject to the second part.

Results

Unexpectedly all of the functions have been used at least once. Even bizarre things like ezmlm_hash have at least three usages. This is quite an achievement as the function's purpose is to "Calculate the hash value needed by EZMLM" Ezmlm used to be a mailing list management software. But back to our results.

Name Occurrences Per package
isset 7 505 182 23.96
echo 3 336 574 10.65
empty 3 302 637 10.54
count 2 240 919 7.15
sprintf 2 220 083 7.09
is_array 1 731 813 5.53
substr 1 707 515 5.45
array_key_exists 1 376 405 4.39
unset 1 303 895 4.16
array_merge 1 200 980 3.83
complete file - csv

Looking into the ten most common functions, we see that isset is the most common one by miles. As I already mentioned, it's more a language construct than a function, and the same is true for the next three "functions": echo, empty, and count. Their dominance is hardly surprising. In fifth place, we finally have a "real function": sprintf. This is also not so staggering; sprintf is a well-known string formatting function in the C world. The following function is_array is PHP exclusive. It should not be so astonishing as PHP is a weakly typed language, and sometimes the variable type can be a mystery. The sixth function substr is again a visitor from C standard library.

Next contender array_key_exists is a staple of a dynamic language, a testament to one of the most potent abstractions in PHP associative array. Function unset also reminds us about the dynamic behavior of PHP. A list of the ten most common functions concludes with array_merge. Overly we can roughly divide the most commonly used functions into three groups:

  • functions handling variable types (isset, is_array, etc.),
  • functions working with strings (substr, sprintf) and
  • functions working with arrays (array_key_exists, array_merge).

This is hardly surprising given PHP's dynamic nature and the fact that commonly used functions are usually connected with arrays and strings in most languages.

Name Packages % of packages
isset 164 616 53%
empty 134 913 43%
count 119 492 38%
is_array 115 825 37%
in_array 99 973 32%
array_merge 98 017 31%
explode 96 613 31%
str_replace 93 454 30%
implode 93 077 30%
sprintf 88 015 28%
complete file - csv

The second metric that I choose is the percentage representation of each function across all available packages. The order of function is almost the same. Language construction isset is again a winner as nearly 53% of packages use it. Statement echo dropped out of the top ten, and it's now in 13th position. What is more astonishing is that almost 26% of projects need it. I would expect that in the case of Composer packages, echo won't be so common. The list is pretty much similar to the previous one.

Name Occurrences
ezmlm_hash 3
output_reset_rewrite_vars 8
hebrevc 8
getprotobynumber 8
timezone_location_get 9
imageaffinematrixconcat 9
output_add_rewrite_var 9
date_interval_format 10
timezone_name_get 10
phpcredits 11
complete file - csv

Bizarre yet interesting is the bottom part of the ladder. Probably the least used function is already mentioned, ezmlm_hash (doc). One would expect functions such as phpcredits (doc), but surprisingly enough, it's only the 10th least used PHP function. The second least used function is not so staggering: hebrevc (doc). It used to be one of the most bizarre PHP functions, so it's not a surprise that it has been deprecated and removed in PHP 8.0. And what was its purpose? According to the manual, it should: "Convert logical Hebrew text to visual text with newline conversion". Fortunately, we still have its sister function, hebrev (doc), the 51st least used function. Both functions are pretty much the same, but removed hebrevc (doc) will do an extra step by newline conversion.

output_reset_rewrite_vars (doc) is another falling down the rabbit hole. According to the documenta­tion, the function will "Reset URL rewriter values", which is far from self-explanatory. It's an opposite function to output_add_rewrite_var (doc), which sets up so-called URL rewriter rules. URL will add additional parameters to each URL and form during the output setting.

The Parade of bizarre functions continues with getprotobynumber (doc), which gets the network protocol name associated with the protocol number. The network protocol is meant one of the transport layer protocols from the transport layer, for example, UDP or TCP. The seventh function, timezone_location_get (doc), seems surprising, but it's another not-so-well-known alias, this time to method DateTimeZone::getLocation(). Many of the least used functions are just aliases to DateTime objects. The same goes for the next function diskfreespace (doc), which is again an alias to another function (in this case to disk_free_space). List is closed by function phpcredits (doc).

Overly we can say that the least used things are:

  • obscure functions that should/could be implemented in userland (hebrevc, ezmlm_hash),
  • nowaday obscure functionality such as URL rewriter that cannot be reasonably done in userland and
  • aliases providing procedural API to selected objects (timezone_location_get).

Conclusion

In this post, I have shown how often individual functions in PHP are used across multiple Packagist packages. Most commonly, functions can be divided into three groups: functions handling variable types, functions working with strings, and functions working with arrays. The first group is connected with dynamic typing in PHP. One sometimes needs to know what the variable type is. Fortunately, with gradual typing, this no longer might be a problem. Of course, gradual typing means that type checks are not mandatory, but this can still remove some of the "boilerplate checks".

As for the least used functions, they are mainly made up of obscure functions that could be implemented in userland, bizarre functionality such as URL rewriter, and aliases providing procedural API to selected objects. It is a question for discussion if the PHP should provide out of the box a function for converting logical Hebrew text to visual text.