Regular Expressions: Additional note
A question came up in the Regular Expressions class regarding character
classes and ordering. It is interesting to note that the language
settings a person has on UNIX and Linux will affect character classes
like [a-z]
and [A-Z]
.
You can check your language settings by running a command called
"locale". If this outputs LC_COLLATE=C
, then character classes will
behave as I have suggested in the class. By order of the ASCII
character table (http://www.ascii-code.com/).
However, if LC_COLLATE
is set to almost anything else, then the behavior
may change. So, if your LC_COLLATE=en_US.UTF-8
(the default on Mills
and Farber clusters) your ranges will follow the collation (sorting)
order of your language. This often means the order of aAbBcC … yYzZ
.
As an example, I have a directory with files starting with every letter of the alphabet, capital and lowercase.
The following command prints all files starting with R
, s
, and S
:
ls [R-S]*
Whereas the following command prints only files starting with R
and s
:
ls [R-s]*
In neither case did files starting with a lower-case r
get listed.
This problem is further exacerbated by the fact that the behavior can
vary a little from system to system. Which leads most of the discussions
on the Internet to the conclusion that character classes like [a-z]
and
[A-Z]
should only be used when they would not be affected by these
differences.