Racing `sed` with Rust

As part of a project I’m working on, I sometimes find myself having to deal with quite large X12 files. For those who haven’t had the pleasure, X12 is an ANSI standard for “electronic data interchange.” It could be considered a very distant spiritual ancestor of XML; it was designed to meet the same kind of need, but way back in days of yore when individual bytes were valuable things to be cherished. The standards committee was originally chartered back in 1979, so it’s officially older than I am.

Anyway.

X12 has various issues, but the one that triggered this post is that a file often contains just a single line. An X12 file starts with a fixed-length 106-byte header called an ISA segment. Among other things, this header specifies three different terminator characters to be used in the rest of the document.

Once the fixed-length header is out of the way, the rest of the document comprises a sequence of variable-length records known as segments. The end of each segment is marked by a segment terminator character. A trailing newline is widely permitted by processing tools, but is not required.

The problem with everything being on a single line is that the vast majority of the standard Unix toolbox for data processing and exploration is designed to work with data a line at a time. For example, say you want to take a peek at the beginning of a large file to see what sort of records it contains: $ head < file will print the first 10 lines to your terminal. If the entire file you’re dealing with is a single 1.3 gigabyte line, this is less than helpful.

Of course, the toolbox does contain ways of dealing with this kind of problem. The traditional and most widely-used segment terminator in X12 is the tilde (~). If we wanted the first 10 lines of an X12 file, we could use sed to insert newlines after every tilde before piping the output to head:

$ sed -e 's/~/~\n/g' < INPUT | head

(This approach—which would be too naïve for most document formats—is safe in our case because X12 doesn’t support such unnecessary fripperies as “escaping”; it’s up to the author to ensure that the chosen terminators do not appear in any of the data fields. This is why each document can specify its own terminators.)

This works, but it has a number of problems:

Using ~ as a terminator is extremely common, but not required. A general-purpose tool needs to look up the correct terminator in the header.
You have to remember and type it every time.
It needs to read the whole source file into memory at once, which is awkward if it’s a large file
It also needs to process the entire input file, even though head is only going to take the first 10 lines. This can take some time on a large file.
It’s not idempotent. Running this command on a file that does already have newlines will give you double-spaced lines; it would be nice to have a command that always results in a single newline after each segment, regardless of what’s in the input file.

There are other tools that will fix some of these issues; for example, we could use Perl:

$ perl -pe 's/~[\n\r]*/~\n/g' < INPUT | head

This solves the idempotency issue, but not the others, and it’s even more to remember.

What I’d really like is a small, self-contained tool that I can pass an X12 file to and rely on it to Do The Right Thing™ without any unnecessary incantations. Since I’m dealing with large source files it would also be nice if it was at least as fast as standard tools like sed. Sounds like a job for…

Rust to the rescue!

Happily, Rust makes it really easy to write this sort of command-line utility without the issues that plague such code in C.

Since we’re interested in speed, let’s set a quick baseline. I’m running Linux on an Intel Core i9-7940X. I’ll be testing with a 1.3 GB X12 file, containing no newlines, stored on a RAM disk.

$ time sed -e 's/~/~\n/g' < testfile.x12 > /dev/null
# -> 7.65 seconds

Now we’ve got something to compare against, let’s try a straightforward Rust version:

use aho_corasick::AhoCorasick;
use std::io::{self, stdin, stdout, BufReader, BufWriter};

fn main() -> io::Result<()> {
    let reader = BufReader::new(stdin());
    let writer = BufWriter::new(stdout());
    let patterns = &["~"];
    let replace_with = &["~\n"];
    let ac = AhoCorasick::new(patterns);
    ac.stream_replace_all(reader, writer, replace_with)
}

We aren’t doing terminator detection yet, and we only handle IO via STDIN and STDOUT, but timing this with the same file gives us 1.68 seconds. Not. Bad.

Reading the correct terminator out of the ISA segment is a simple improvement:

use aho_corasick::AhoCorasick;
use std::io::{self, stdin, stdout, Read, BufReader, BufWriter};
use std::str;

fn main() -> io::Result<()> {
    let mut reader = BufReader::new(stdin());
    let writer = BufWriter::new(stdout());

    let mut isa = vec![0u8; 106];
    reader.read_exact(&mut isa)?;
    let terminator = str::from_utf8(&isa[105..=105])
        .unwrap();

    let patterns = &[terminator];
    let replace_with = &[format!("{}\n", terminator)];
    AhoCorasick::new(patterns)
        .stream_replace_all(reader, writer, replace_with)
}

This adds a few lines to the source, but doesn’t affect the runtime measurably. (The eagle-eyed might have noticed that this version fails to write the ISA segment; I’m ignoring that in favour of keeping the code short.)

This is already a significant improvment over the sed one-liner in terms of speed, and it will automatically detect the correct terminator for us. Unfortunately, we start to run into difficulty with this approach if want to handle newlines correctly. We could easily replace all newlines, or we could replace a single newline following a terminator, but we can’t match “any number of newlines but only following a terminator.”

We could try applying a regular expression to the stream, but that seems like overkill for such a simple transformation. What if we process the bytestream ourselves without relying on the aho_corasick library?

use std::io::{self, stdin, stdout, Read, BufReader, BufWriter, ErrorKind};
use byteorder::{ReadBytesExt, WriteBytesExt};

fn main() -> io::Result<()> {
    let mut reader = BufReader::new(stdin());
    let mut writer = BufWriter::new(stdout());

    let mut isa = vec![0u8; 106];
    reader.read_exact(&mut isa)?;
    let terminator = isa[105];

    loop {
        match reader.read_u8() {
            Ok(c) => {
                writer.write_u8(c)?;
                if c == terminator {
                    writer.write_u8(b'\n')?;
                }
            }
            Err(ref e) if e.kind() == ErrorKind::UnexpectedEof => {
                return Ok(());
            }
            Err(err) => {
                return Err(err);
            }
        };
    }
}

Not much longer, and, although we aren’t handling newlines in this version, at least we have an easy place to add the code for them.

How does the performance compare?

13 seconds. Ouch. Turns out that stream_replace_all was doing quite a lot of work for us to make the operation efficient.

We can regain a lot of that time—at the cost of some more code—by managing our own buffer rather than relying on the BufReader and lots of 1-byte read_u8() calls:

use std::io::{self, stdin, stdout, Read, Write, BufReader, BufWriter, ErrorKind};
use byteorder::{WriteBytesExt};

const BUF_SIZE: usize = 16384;

fn main() -> io::Result<()> {
    let mut reader = BufReader::new(stdin());
    let mut writer = BufWriter::new(stdout());

    let mut isa = vec![0u8; 106];
    reader.read_exact(&mut isa)?;
    let terminator = isa[105];

    let mut buf = vec![0u8; BUF_SIZE];
    loop {
        match reader.read(&mut buf) {
            Ok(0) => { return Ok(()) } // EOF
            Ok(n) => {
                let mut i = 0;
                let mut start = 0;

                loop {
                    if i == n {
                        // No terminator found in the rest
                        // of the buffer. Write it all out
                        // and read some more.
                        writer.write_all(&buf[start..])?;
                        break;
                    }
                    if buf[i] == terminator {
                        writer.write_all(&buf[start..=i])?;
                        writer.write_u8(b'\n')?;
                        start = i + 1;
                    }
                    i += 1;
                }
            }
            Err(ref e) if e.kind() == ErrorKind::UnexpectedEof => {
                return Ok(());
            }
            Err(err) => {
                return Err(err);
            }
        };
    }
}

As well as greatly reducing the number of read_u8 calls, we also gain a huge speedup by writing each segment in a single call to write_all (or two, for the last segment in a buffer) rather than a write_u8 call per character.

The code is starting to get a bit more involved, but we’re down to 1.75 seconds. That’s a bit better! But… still slower than the first version. What magic is going on there? A look at the dependencies of the aho_corasick crate offers a clue: it depends on memchr.

SIMD and memchr

Modern processors support a variety of different instructions for performing operations on vectorized data, often broadly lumped together under the heading of SIMD, for Single Instruction Multiple Data. Without getting into too much detail, this means that instead of checking each byte in the whole file to see if it is a terminator character it’s possible to load a bunch of them into a register in one go and compare them all in just a single cycle. Precisely how many depends on the exact features offered by the particular CPU the code is running on.

Sounds a bit complicated to manage, right? Under the hood it is, but the excellent memchr library in Rust hides all of the internals and gives us an interface that’s higher-level than the manual code above, easier to use, and faster.

Since nothing else has changed I’ve just included the main loop this time:

loop {
    let n = reader.read(&mut buf)?;
    if n == 0 { return Ok(()); }
    let mut start = 0;

    while start < n {
        // Finds the index of the next terminator in the buffer,
        // checking a chunk of buffer in parallel if possible.
        match memchr(terminator, &buf[start..n]) {
            Some(offset) => {
                writer.write_all(&buf[start..=start + offset])?;
                writer.write_u8(b'\n')?;
                start = start + offset + 1;
            }
            None => {
                writer.write_all(&buf[start..])?;
                break;
            }
        }
    }
}

How much faster? This processes the same file in 1.01 seconds, using nicer code.

We’re still not actually processing newlines yet, and in fact this unfortunately makes the code a lot more intricate since it introduces a few tricky edge cases. The problem is that we might have a terminator as the very last character in one buffer, followed by a series of newlines as the first few characters the next time the buffer is read. This makes life a bit uglier, but doesn’t really change the performance, so we’ll stop here! If you’re interested in what the full version with newline and error handling looks like, you can get check out the code on GitHub.

Conclusion

Let’s recap the different approaches, using the same 1.3 gigabyte test file.

Method	Time
`perl -le '$/="~";print $_, "~" while <>'`	6.25s
`sed -e 's/~/~\n/g'`	7.25s
`awk 'BEGIN{RS="~";OFS=""}{print $0, RS}'`	7.54s
Aho-Corasick	1.68s
Simple manual loop over characters	13s
Buffering; consolidated writes	1.75s
ditto with `memchr`	1.01s

It shouldn’t be too surprising that we were able to beat the general-purpose tools—it’s really an unfair comparison because we’re doing so much less work.

In general, as long as you’re willing to accept the trade-off of writing and debugging quite a bit of code—and handling the edge cases that crop up along the way—it should almost always be possible to outperform a general-purpose tool for a single, specific use-case.

The real question is whether it’s worth the time and effort in any given case; the sed and perl alternatives are quick-and-dirty one-liners that can be written off the top of your head. A full version in GNU awk with all the features I needed, including terminator detection and newline handling, is only 14 lines. The Rust version weighs in at around 100 lines of code (some of which deal with CLI option processing and file handling), and took considerable time and experimention before I ended up with the final version.

The final product is available on GitHub and is just what I wanted; it gives me an easy—and fast!—way to split X12 files into segments, either for quick inspection or further processing.

I'm Mike Clarke, an independent consultant and contractor specialising in high-performance APIs and systems development.

Feedback

I hope you found the article useful!

Please let you know what you enjoyed, how you think it could have been better, or what kind of articles you’d like to see in the future by dropping me a note in the box below.

If you’d like a response don’t forget to include a way for me to get in touch.

Racing sed with Rust

Rust to the rescue!

SIMD and memchr

Conclusion

Feedback

Racing `sed` with Rust