We can't find the internet
Attempting to reconnect
sed
with RustAs part of a project I’m working on, I sometimes find myself having to deal with quite large X12 files. For those who haven’t had the pleasure, X12 is an ANSI standard for “electronic data interchange.” It could be considered a very distant spiritual ancestor of XML; it was designed to meet the same kind of need, but way back in days of yore when individual bytes were valuable things to be cherished. The standards committee was originally chartered back in 1979, so it’s officially older than I am.
Anyway.
X12 has various issues, but the one that triggered this post is that a file often contains just a single line. An X12 file starts with a fixed-length 106-byte header called an ISA segment. Among other things, this header specifies three different terminator characters to be used in the rest of the document.
Once the fixed-length header is out of the way, the rest of the document comprises a sequence of variable-length records known as segments. The end of each segment is marked by a segment terminator character. A trailing newline is widely permitted by processing tools, but is not required.
The problem with everything being on a single line is that the vast majority
of the standard Unix toolbox for data processing and exploration is designed
to work with data a line at a time. For example, say you want to take a peek
at the beginning of a large file to see what sort of records it contains:
$ head < file
will print the first 10 lines to your terminal. If the entire
file you’re dealing with is a single 1.3 gigabyte line, this is less than
helpful.
Of course, the toolbox does contain ways of dealing with this kind of problem.
The traditional and most widely-used segment terminator in X12 is the tilde
(~
). If we wanted the first 10 lines of an X12 file, we could use sed
to
insert newlines after every tilde before piping the output to head
:
$ sed -e 's/~/~\n/g' < INPUT | head
(This approach—which would be too naïve for most document formats—is safe in our case because X12 doesn’t support such unnecessary fripperies as “escaping”; it’s up to the author to ensure that the chosen terminators do not appear in any of the data fields. This is why each document can specify its own terminators.)
This works, but it has a number of problems:
~
as a terminator is extremely common, but not required. A
general-purpose tool needs to look up the correct terminator in the
header. head
is only
going to take the first 10 lines. This can take some time on a large file. There are other tools that will fix some of these issues; for example, we could use Perl:
$ perl -pe 's/~[\n\r]*/~\n/g' < INPUT | head
This solves the idempotency issue, but not the others, and it’s even more to remember.
What I’d really like is a small, self-contained tool that I can pass an X12
file to and rely on it to Do The Right Thing™ without any unnecessary
incantations. Since I’m dealing with large source files it would also be nice
if it was at least as fast as standard tools like sed
. Sounds like a job
for…
Happily, Rust makes it really easy to write this sort of command-line utility without the issues that plague such code in C.
Since we’re interested in speed, let’s set a quick baseline. I’m running Linux on an Intel Core i9-7940X. I’ll be testing with a 1.3 GB X12 file, containing no newlines, stored on a RAM disk.
$ time sed -e 's/~/~\n/g' < testfile.x12 > /dev/null
# -> 7.65 seconds
Now we’ve got something to compare against, let’s try a straightforward Rust version:
use aho_corasick::AhoCorasick;
use std::io::{self, stdin, stdout, BufReader, BufWriter};
fn main() -> io::Result<()> {
let reader = BufReader::new(stdin());
let writer = BufWriter::new(stdout());
let patterns = &["~"];
let replace_with = &["~\n"];
let ac = AhoCorasick::new(patterns);
ac.stream_replace_all(reader, writer, replace_with)
}
We aren’t doing terminator detection yet, and we only handle IO via STDIN
and
STDOUT
, but timing this with the same file gives us 1.68 seconds. Not. Bad.
Reading the correct terminator out of the ISA segment is a simple improvement:
use aho_corasick::AhoCorasick;
use std::io::{self, stdin, stdout, Read, BufReader, BufWriter};
use std::str;
fn main() -> io::Result<()> {
let mut reader = BufReader::new(stdin());
let writer = BufWriter::new(stdout());
let mut isa = vec![0u8; 106];
reader.read_exact(&mut isa)?;
let terminator = str::from_utf8(&isa[105..=105])
.unwrap();
let patterns = &[terminator];
let replace_with = &[format!("{}\n", terminator)];
AhoCorasick::new(patterns)
.stream_replace_all(reader, writer, replace_with)
}
This adds a few lines to the source, but doesn’t affect the runtime measurably. (The eagle-eyed might have noticed that this version fails to write the ISA segment; I’m ignoring that in favour of keeping the code short.)
This is already a significant improvment over the sed
one-liner in terms of
speed, and it will automatically detect the correct terminator for us.
Unfortunately, we start to run into difficulty with this approach if want to
handle newlines correctly. We could easily replace all newlines, or we could
replace a single newline following a terminator, but we can’t match “any number
of newlines but only following a terminator.”
We could try applying a regular expression to the stream, but that seems like
overkill for such a simple transformation. What if we process the bytestream
ourselves without relying on the aho_corasick
library?
use std::io::{self, stdin, stdout, Read, BufReader, BufWriter, ErrorKind};
use byteorder::{ReadBytesExt, WriteBytesExt};
fn main() -> io::Result<()> {
let mut reader = BufReader::new(stdin());
let mut writer = BufWriter::new(stdout());
let mut isa = vec![0u8; 106];
reader.read_exact(&mut isa)?;
let terminator = isa[105];
loop {
match reader.read_u8() {
Ok(c) => {
writer.write_u8(c)?;
if c == terminator {
writer.write_u8(b'\n')?;
}
}
Err(ref e) if e.kind() == ErrorKind::UnexpectedEof => {
return Ok(());
}
Err(err) => {
return Err(err);
}
};
}
}
Not much longer, and, although we aren’t handling newlines in this version, at least we have an easy place to add the code for them.
How does the performance compare?
13 seconds. Ouch. Turns out that stream_replace_all
was doing quite a lot
of work for us to make the operation efficient.
We can regain a lot of that time—at the cost of some more code—by managing
our own buffer rather than relying on the BufReader
and lots of 1-byte
read_u8()
calls:
use std::io::{self, stdin, stdout, Read, Write, BufReader, BufWriter, ErrorKind};
use byteorder::{WriteBytesExt};
const BUF_SIZE: usize = 16384;
fn main() -> io::Result<()> {
let mut reader = BufReader::new(stdin());
let mut writer = BufWriter::new(stdout());
let mut isa = vec![0u8; 106];
reader.read_exact(&mut isa)?;
let terminator = isa[105];
let mut buf = vec![0u8; BUF_SIZE];
loop {
match reader.read(&mut buf) {
Ok(0) => { return Ok(()) } // EOF
Ok(n) => {
let mut i = 0;
let mut start = 0;
loop {
if i == n {
// No terminator found in the rest
// of the buffer. Write it all out
// and read some more.
writer.write_all(&buf[start..])?;
break;
}
if buf[i] == terminator {
writer.write_all(&buf[start..=i])?;
writer.write_u8(b'\n')?;
start = i + 1;
}
i += 1;
}
}
Err(ref e) if e.kind() == ErrorKind::UnexpectedEof => {
return Ok(());
}
Err(err) => {
return Err(err);
}
};
}
}
As well as greatly reducing the number of read_u8
calls, we also gain a huge
speedup by writing each segment in a single call to write_all
(or two, for
the last segment in a buffer) rather than a write_u8
call per character.
The code is starting to get a bit more involved, but we’re down to 1.75
seconds. That’s a bit better! But… still slower than the first version. What
magic is going on there? A look at the dependencies of the aho_corasick
crate offers a clue: it depends on memchr
.
Modern processors support a variety of different instructions for performing operations on vectorized data, often broadly lumped together under the heading of SIMD, for Single Instruction Multiple Data. Without getting into too much detail, this means that instead of checking each byte in the whole file to see if it is a terminator character it’s possible to load a bunch of them into a register in one go and compare them all in just a single cycle. Precisely how many depends on the exact features offered by the particular CPU the code is running on.
Sounds a bit complicated to manage, right? Under the hood it is, but the
excellent memchr
library in Rust hides all of the internals and gives us an
interface that’s higher-level than the manual code above, easier to use, and
faster.
Since nothing else has changed I’ve just included the main loop this time:
loop {
let n = reader.read(&mut buf)?;
if n == 0 { return Ok(()); }
let mut start = 0;
while start < n {
// Finds the index of the next terminator in the buffer,
// checking a chunk of buffer in parallel if possible.
match memchr(terminator, &buf[start..n]) {
Some(offset) => {
writer.write_all(&buf[start..=start + offset])?;
writer.write_u8(b'\n')?;
start = start + offset + 1;
}
None => {
writer.write_all(&buf[start..])?;
break;
}
}
}
}
How much faster? This processes the same file in 1.01 seconds, using nicer code.
We’re still not actually processing newlines yet, and in fact this unfortunately makes the code a lot more intricate since it introduces a few tricky edge cases. The problem is that we might have a terminator as the very last character in one buffer, followed by a series of newlines as the first few characters the next time the buffer is read. This makes life a bit uglier, but doesn’t really change the performance, so we’ll stop here! If you’re interested in what the full version with newline and error handling looks like, you can get check out the code on GitHub.
Let’s recap the different approaches, using the same 1.3 gigabyte test file.
Method | Time |
---|---|
perl -le '$/="~";print $_, "~" while <>' |
6.25s |
sed -e 's/~/~\n/g' |
7.25s |
awk 'BEGIN{RS="~";OFS=""}{print $0, RS}' |
7.54s |
Aho-Corasick | 1.68s |
Simple manual loop over characters | 13s |
Buffering; consolidated writes | 1.75s |
ditto with memchr |
1.01s |
It shouldn’t be too surprising that we were able to beat the general-purpose tools—it’s really an unfair comparison because we’re doing so much less work.
In general, as long as you’re willing to accept the trade-off of writing and debugging quite a bit of code—and handling the edge cases that crop up along the way—it should almost always be possible to outperform a general-purpose tool for a single, specific use-case.
The real question is whether it’s worth the time and effort in any given case;
the sed
and perl
alternatives are quick-and-dirty one-liners that can be
written off the top of your head. A full version in GNU awk
with all the
features I needed, including terminator detection and newline handling, is only
14 lines. The Rust version weighs in at around 100 lines of code (some of
which deal with CLI option processing and file handling), and took considerable
time and experimention before I ended up with the final version.
The final product is available on GitHub and is just what I wanted; it gives me an easy—and fast!—way to split X12 files into segments, either for quick inspection or further processing.
I'm Mike Clarke, an independent consultant and contractor specialising in high-performance APIs and systems development.
I hope you found the article useful!
Please let you know what you enjoyed, how you think it could have been better, or what kind of articles you’d like to see in the future by dropping me a note in the box below.
If you’d like a response don’t forget to include a way for me to get in touch.