Coding wc from gnu coreutils in Java

So I ran into this coding challenge of coding up the gnu coreutils tool wc, i.e. the terminal application that counts bytes, words, lines and characters for its input. This is just a quick type-up of some observations and learnings, for code check out the finished jwc repo.

Open Table of contents

Implementation notes
GNU (Linux) vs BSD (macOS) implementations

Implementation notes

Implementation basically consisted of:

Implementing parsing of the potential arguments (short, long options and filename). Normally a third party app, something like Apache Commons CLI would be the way to go but i thought it could be fun to implement on my own.
Implementing logic for counting bytes, characters, lines, words and maximum line length for each file. This was relatively straightforward, although to be fair I stuck to single-byte encoding for this initial implementation. Might come back for multi-bytes another day.
Implementing the printing logic i.e. what to print based on command line arguments.

The key counting-logic looked like this:

public List<String> process() {
  List<String> results = new ArrayList<>();
  for (String filename: this.fileNames) {
    int numberOfBytes = 0;
    int numberOfWords = 0;
    // TODO Count lines not just based on UNIX line endings, but also Windows
    int numberOfLines = 0;
    int maximumLineLength = 0;

    try {
      byte[] bytes = Files.readAllBytes(Paths.get(filename));
      boolean inWord = false;
      int currentLineLength = 0;


      for (byte b: bytes) {
        char ch = (char) b;

        if (!Character.isWhitespace(ch)) {
          // Non-whitespace case
          currentLineLength++;
          inWord = true;
        } else {
          // Whitespace case
          if (inWord) {
              // If we were in a word we add to count and "exit" the word
              numberOfWords++;
              inWord = false;
          }
          if (b == '\n') {
              // If we are the end of a line, we increment line count and check
              // if we've found a greater max line length
              numberOfLines++;
              maximumLineLength = Math.max(currentLineLength, maximumLineLength);
              currentLineLength = 0;
          } else {
              // If we've hit some other whitespace we simply increment the current
              // line length count
              currentLineLength++;
          }
              }
              // Increment number of bytes regardless
              numberOfBytes++;
          }
          // If file ends with EOF we need to add one more word
          if (inWord)
              numberOfWords++;
          maximumLineLength = Math.max(currentLineLength, maximumLineLength);
    } catch (IOException e) {
      ProcessingResult processingResult =
              new ProcessingResult(filename, "No such file or directory");
      results.add(processingResult.toString());
      continue;
    }
      ProcessingResult processingResult =
        new ProcessingResult(
          filename,
          wordCounterOptions,
          numberOfBytes,
          numberOfWords,
          numberOfLines,
          maximumLineLength);
      results.add(processingResult.toString());
    }
    return results;
}

I had a look at the official GNU coreutils c implementation (981 lines at time of writing!) Some points of inspiration to learn from:

This implementaiton is more memory efficient in that it doesn’t read the whole file into memory but rather processes it in chunks
It also handles both windows (\r\n) and UNIX (\n) line ending as opposed to just handling UNIX
Multi-byte encoding is handled
A bunch of specific handling to ensure efficient processing.

A fun tidbit is the definition of the main method:

int
main (int argc, char **argv)
{
  ...
}

At first I thought having the return type on a separate preceding line looked wildly off-base, but apparently this is just the original Kernighan & Ritchie (K&R) style.

GNU (Linux) vs BSD (macOS) implementations

I ended up doing some of the coding on a Ubuntu system and some on macOS. To make sure I got things right I looked up the man pages for wc on both WSL 2 and macOS as I was developing:

man wc

I was expecting uniformity, but turns out the implementations differed in non-negligible ways.

Ubuntu version allowed the user to select a --max-line-length option, MacOS only allowed bytes, characters, lines and words:

wc --max-line-length README.md
wc: illegal option -- -
usage: wc [-clmw] [file ...]

In macOS the bytes (c) and character (m) flags are mutually exclusive. We can see here that the tool can individually display bytes and characters, but the character option overrides bytes when both are selected.

wc -c README.md
    9083 README.md
wc -m README.md
    8848 README.md
wc -cm README.md
    8848 README.md

Further, I noticed the MacOS version won’t accept the --version flag to print the tool version:

wc --version
wc: illegal option -- -
usage: wc [-clmw] [file ...]

I ended up mimicking the Linux version as this was the one I started out with. I did some more digging this appears and found out the difference in implementation stems from Linux using GNU coreutils while MacOS sticks to BSD implementations of the same tools. Getting into why appears to be a bit of a rabbit-hole, but apparently NeXTSTEP, an operating system that came out of Steve Jobs’ foray outside of Apple was in part derived from BSD. When Jobs came back to Apple, the then new macOS ended up being based in part on NeXTSTEP, so the BSD tools came along for the ride.

GNU coreutils can be installed on macOS with Homebrew. The tools are accessed by preceding the tool name with g. We can see that with gwc we can output --max-line-length:

gwc --max-line-length README.md
373 README.md

Bytes and characters are no longer mutually exclusive:

gwc -cm README.md
8848 9083 README.md

And we are able to output the version of the tool.

gwc --version
wc (GNU coreutils) 9.4
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Rubin and David MacKenzie.