training:hpc:additional-note

Regular Expressions: Additional note

A question came up in the Regular Expressions class regarding character classes and ordering. It is interesting to note that the language settings a person has on UNIX and Linux will affect character classes like [a-z] and [A-Z].

You can check your language settings by running a command called "locale". If this outputs LC_COLLATE=C, then character classes will behave as I have suggested in the class. By order of the ASCII character table (http://www.ascii-code.com/).

However, if LC_COLLATE is set to almost anything else, then the behavior may change. So, if your LC_COLLATE=en_US.UTF-8 (the default on Mills and Farber clusters) your ranges will follow the collation (sorting) order of your language. This often means the order of aAbBcC … yYzZ.

As an example, I have a directory with files starting with every letter of the alphabet, capital and lowercase.

The following command prints all files starting with R, s, and S:

  ls [R-S]*

Whereas the following command prints only files starting with R and s:

  ls [R-s]*

In neither case did files starting with a lower-case r get listed.

This problem is further exacerbated by the fact that the behavior can vary a little from system to system. Which leads most of the discussions on the Internet to the conclusion that character classes like [a-z] and [A-Z] should only be used when they would not be affected by these differences.