Situatie
The wc
command is a small application. It’s one of the core Linux utilities, so there is no need to install it. It’ll already be on your Linux computer. You can describe what it does in a very few words. It counts the lines, words, and bytes in a file or selection of files and prints the result in a terminal window. It can also take its input from the STDIN stream, meaning the text you want it to process can be piped into it. This is where wc
really starts to add value.
It is a great example of the Linux mantra of “do one thing and do it well.” Because it accepts piped input, it can be used in multi-command incantations. As we’ll see, this little standalone utility is actually a great team player.
One way I use wc
is as a placeholder in a complicated command or alias I’m cooking up. If the finished command has the potential to be destructive and delete files, I often use wc
as a stand-in for the real, dangerous command. That way, during the development of the command I get visual feedback that each file is being processed as I expected. There’s no chance of anything bad happening while I’m wrestling with the syntax.
As simple as wc
is, there are still a few small quirks that you need to know about.
Solutie
Getting Started With wc
The simplest way to use wc
is to pass the name of a text file on the command line.
wc lorem.txt
This causes wc
to scan the file and count the lines, words, and bytes, and write them out to the terminal window.
Words are considered anything bounded by whitespace. Whether they are words from a real language or not is irrelevant. If a file contains nothing but “frd g lkj”, it still counts as three words. Lines are sequences of characters terminated by either a carriage return or the end of the file. It doesn’t matter if the line wraps around in your editor or in the terminal window, until wc
encounters a carriage return or the end of the file, it’s still the same line.
Our first example found one line in the entire file. Here’s the content of the “lorem.txt” file.
cat lorem.txt
All of that counts as a single line because there are no carriage returns. Compare this to another file, “lorem2.txt”, and how wc
interprets it.
wc lorem2.txt
cat lorem2.txt
This time, wc
counts 15 lines because carriage returns have been inserted into the text to start a new line at specific points. However, if you count the lines with text in them, you’ll see there are only 12. The other three lines are blank lines at the end of the file. These contain only carriage returns. Even though there is no text in these lines, a new line has been started and so wc
counts them as such.
We can pass as many files to wc
as we like.
wc lorem.txt lorem2.txt
We get the statistics for each individual file and a total for all the files. We can also use wildcards so that we can select matching files instead of explicitly named files.
wc *.txt *.?
The Command Line Options
By default, wc
will display the lines, words, and bytes in each file. It’s the same as using the -l
(lines) -w
(words) and -c
(bytes) options.
wc lorem.txt
wc -l -w -c lorem.txt
We can specify which combination of figures we wish to see.
wc -l lorem.txt wc -w lorem.txt wc -c lorem.txt wc -l -c lorem.txt
Special attention should be paid to the last figure, generated by the -c
(bytes) option. Many people mistake this as counting the characters. It actually counts bytes. The number of characters and the number of bytes might well be the same. But not always.
Let’s look at the contents of a file called “unicode.txt.”
cat unicode.txt
It has three words and a non-Latin alphabet character. We’ll let wc
process the file with its default setting of bytes, and we’ll do it again but request characters with the -m
(characters) option.
wc unicode.txt
wc -l -w -m unicode.txt
There are more bytes than there are characters.
Let’s have a look at the hex dump of the file and see what’s going on. The hexdump
command’s -C
(canonical) option displays the bytes in the file in lines of 16, with their plain ASCII equivalent (if there is one) shown at the end of the line. If there is no corresponding ASCII character, a period “.
” is shown instead.
hexdump -C unicode.txt
In ASCII, a hexadecimal value of 0x20
represents a space character. If we count three values in from the left, we see the next value is a space character. So the those first three values 0x62
, 0x6f
, and 0x79
represent the letters in “boy.”
Hopping over the 0x20
, we see another set of three hexadecimal values: 0x63
, 0x61
, and 0x74
. These spell out “cat.” Hopping over the next space character we see three more values for the letters in “dog.” These are 0x64
, 0x5f
, and 0x67
.
Right behind the word “dog” we can see a space character 0x20
, and five more hexadecimal values. The last two are carriage returns, 0x0a
.
The other three bytes represent the non-Latin character, which we’ve ringed in green. It is a Unicode character, and it takes three bytes to encode it. These are 0xe1
, 0xaf
, and 0x8a
.
So make sure you know what you’re counting, and that bytes and characters need not be the same. Usually, counting bytes is more useful because it tells you what is actually inside the file. Counting by characters gives you the number of things represented by the contents of the file.
Taking Filenames From a File
There’s another way to provide filenames to wc
. You can put the filenames in a file, and pass the name of that file to wc
. It opens the file, extracts the filenames, and processes them as if they had been passed on the command line. This allows you to store an arbitrary collection of filenames for re-use.
But there’s a gotcha, and it’s a big one. The filenames must be null terminated, not carriage return terminated. That is, after each filename there must be a null byte of 0x00
instead of the usual carriage return byte 0x0a
. You can’t open an editor and create a file with this format. Typically, files like this are generated by other programs. But, if you have such a file, this is how you would use it.
Here’s our file containing the filenames. Opening it in less
shows you the strange “^@
” characters that less
uses to indicate null bytes.
less source-files-list.txt
To use the file with wc
, we need to use --files0-from
(read input from) option and pass in the name of the file containing the filenames.
wc ---files0-from=source-files-list.txt
The files are processed exactly as though they were provided on the command line.
Piping Input to wc
A much more common, flexible, and productive way to send input to wc
is to pipe the output from other commands into wc
. We can demonstrate this with the echo
command.
echo "Count this for me" | wc
echo -e "Count this\nfor me" | wc
The second echo
command uses the -e
(escaped characters) option to allow escaped sequences like the “\n
” newline formatting code. This injects a new line, causing wc
to see the input as two lines.
Here’s a cascade of commands feeding their input from one to the other.
find ./* -type f | rev | cut -d'.' -f1 | rev | sort | uniq
- find looks for files (
type -f
) recursively, starting in the current directory.rev
reverses the filenames. - cut extracts the first field (
-f1
) by defining the field delimiter to be a period “.
” and reading from the “front” of the reversed filename up to the first period it finds. We’ve now extracted the file extension. - rev reverses the extracted first field.
- sort sorts them in ascending alphabetical order.
- uniq lists unique entries to the terminal window.
This command lists all of the unique file extensions in the current directory and any subdirectories.
If we added the -c
(count) option to the uniq
command it would count the occurrences of each extension type. But if we want to know how many different, unique file extensions there are, we can drop wc
as the last command on the line, and use the -l
(lines) option.
find ./* -type f | rev | cut -d'.' -f1 | rev | sort | uniq | wc -l
Here’s one last trick wc
can do for you. It’ll tell you the length of the longest line in a file. Sadly, it doesn’t tell you which line it is. It just gives you the length.
wc -L taf.c
Beware though, that tabs are counted as eight spaces. Viewed in my editor, there are three two-space tabs at the start of that line. Its real length is 124 characters. So the figure reported is artificially expanded.
I’d treat this function with a big pinch of salt. And by that I mean don’t use it. Its output is misleading. Despite its quirks, wc
is a great tool to drop into piped commands when you need to count all sorts of values, not just the words in a file.
Leave A Comment?