performance testing – markdown to html parser

Programming

performance testing – markdown to html parser

In my earlier post discussing the markdown to html parser written in awk, I mentioned the potential to migrate the code over to Perl for some potential speed increases.

This was driven mostly by two things

  • Dependency on gawk only functions, reducing portability
  • Something else I don’t remember. Probably suspicious about performance of gensub

I did write the Perl version, it reads very similar to the awk version of the script. The two versions do not have feature parity (the awk version has been deprecated, the perl one has all the new features).

How much faster is it? Great question. My strategy for testing is to generate a sample markdown file which contains all of the features in both versions. We’ll write a script which makes copies of the sample file, each copy is made by appending the script to itself a number of times. In this way, we can simulate the time it takes to parse a markdown file which is 100, 1000, 1000, etc. lines long.

The sample

$ cat blog.md
#2019-01-14
##Markdown to html convertor

Started writing a markdown to html convertor using awk. 
The guts of the routine are quite simple so far,
the awk script parses the lines for '#' patterns.
The pattern must be found at the start of the line.
The script replaces this with the html <hx> and </hx>
Where x is the number of hashes the script found.

##TODO
Theres a few issues to address.
--The web doesnt respect my newlines.--
--We will need to inject those.--
This has been fixed!
If the last character is a '\' then we won't insert a break. \
See, like this!

The awk script needs to parse the markdown script for web special characters
like '<' and '>' for example and escape them (not sure how to escape yet)
Then start reading up on markdown codes for <b>bold</b>, <u>underline</u>, <i>italics</i>, etc.

It might be nice to have a ~~simple~~ css sheet **included** __as__ well, to pretty it up a bit.

The generated html is very flat.
Run it through a prettification routine after awk has made it. 

------

##Syntax
###Italics
Putting the \*word\* character around a word will write it in *italics*
To write an asterisk normally, like \*, prepend a \ character
>Remeber, that a slash at the end of a line is interpretted as \
>"don't insert a newline". If you want a \\
>At the end of a line, you need to type '\\'
>like this \\

Inline code example `$ echo "Hello, world!"`.

###Bold
Putting \*\*two asterisks\*\* around a word will make it **bold**

###Links
Check out this [link](20190118.md)!
external link: [google](http://google.com.au)

###Images
![Taiwan](taiwan.jpg)

Generating the test files

$ cat gentests.sh
#! /usr/bin/bash

# the file that we'll make the tests out of
basefile="blog.md"

# the directory to put the tests
# ignore error if the directory exists
mkdir tests 2>/dev/null

# clear all existing tests
# ignore errors if tests directory doesnt exist
echo "Clearing all .txt files from test/"
rm tests/*.txt 2>/dev/null

# make the test files
# i is a list of the number of times to repeat the concatenation
echo "Creating test files..."
for i in 1 10 20 50 100 200 500 1000 2000 5000 10000 20000 50000 100000 200000 300000 400000 500000 600000 700000 800000 900000
do
	# setup the file handle for (hopefully) better performance for large files
	> tests/$i.txt
	for (( c = 0; c < $i; c++ )) 
	do
		cat blog.md
	done >> tests/$i.txt
	# rumour has it that putting the redirection outside the loop is faster

	# let the user know what we're up to
	echo "lines: `wc -l tests/$i.txt`"
done

$ chmod +x gentests.sh
$ ./gentests.sh
Clearing all .txt files from test/
Creating test files...
lines: 49 tests/1.txt
lines: 490 tests/10.txt
lines: 980 tests/20.txt
lines: 2450 tests/50.txt
lines: 4900 tests/100.txt
lines: 9800 tests/200.txt
lines: 24500 tests/500.txt
lines: 49000 tests/1000.txt
lines: 98000 tests/2000.txt
lines: 245000 tests/5000.txt
lines: 490000 tests/10000.txt
lines: 980000 tests/20000.txt
lines: 2450000 tests/50000.txt
lines: 4900000 tests/100000.txt
lines: 9800000 tests/200000.txt
lines: 14700000 tests/300000.txt
lines: 19600000 tests/400000.txt
lines: 24500000 tests/500000.txt

Great, we’ve made our test files. Be warned, this takes a very long time for the larger files.

Running the tests

$ cat runtests.sh
#! /usr/bin/bash
# runs the markdown parsers on the test files

# where the output will go
mkdir output

echo "clearing the results file"
echo "program,lines,seconds,file" > results.csv

progs=(markd.awk markd.pl)

for cm in "${progs[@]}"
do
	for testfile in tests/*.txt
	do
		echo "Running $cm on $testfile"
		fname=$(basename $testfile)
		lines=$(wc -l $testfile | awk '{print $1}')
		
		# time how long it takes to run the program and output
		# to a file in directory output with the same name as the
		# test file. Redirect the result of time from stderr to stdout
		tm=$( (/usr/bin/time -f "%e" ./$cm $testfile > output/$fname) 2>&1 | tr -d '\n')

		echo "$cm,$lines,$tm,$testfile >> results.csv
	done
done

$ chmod +x runtests.sh
$ ./runtests.sh

There’s a small gotcha in that bash script. Note that I used /usr/bin/time rather than just time. time is a bash builtin which does not have the formatting capabilities of the time(1) program. I had to install time in order for this to work.

The nice thing about this is that we can easily add more or different versions of parsers to the test suite.

Plotting the data

Ok, we have our results.csv file. To plot the data, I’m going to take a stab at using maxima. First problem, getting the data into maxima. A suggestion I found online is to store the data like a maxima batch file, rather than as csv. The data should look something like this:

$ cat results.mac 
program: ["markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.awk","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl","markd.pl"];
lines: [4900000,490000,49000,4900,490,49,9800000,980000,98000,9800,980,14700000,19600000,24500000,2450000,245000,24500,2450,22027754,4900000,490000,49000,4900,490,49,9800000,980000,98000,9800,980,14700000,19600000,24500000,2450000,245000,24500,2450,22027754];
seconds: [21.06,2.00,0.22,0.02,0.00,0.00,40.37,4.07,0.41,0.04,0.01,62.74,80.34,104.56,10.39,1.01,0.11,0.01,91.24,10.08,1.00,0.10,0.01,0.00,0.00,19.95,2.01,0.20,0.02,0.00,30.07,40.23,50.19,4.99,0.50,0.05,0.00,45.62];
file: ["tests/100000.txt","tests/10000.txt","tests/1000.txt","tests/100.txt","tests/10.txt","tests/1.txt","tests/200000.txt","tests/20000.txt","tests/2000.txt","tests/200.txt","tests/20.txt","tests/300000.txt","tests/400000.txt","tests/500000.txt","tests/50000.txt","tests/5000.txt","tests/500.txt","tests/50.txt","tests/700000.txt","tests/100000.txt","tests/10000.txt","tests/1000.txt","tests/100.txt","tests/10.txt","tests/1.txt","tests/200000.txt","tests/20000.txt","tests/2000.txt","tests/200.txt","tests/20.txt","tests/300000.txt","tests/400000.txt","tests/500000.txt","tests/50000.txt","tests/5000.txt","tests/500.txt","tests/50.txt","tests/700000.txt"];

I whipped up a quick awk script to do the conversion for us. First it looks for the field names on line 1 and stores these in the header array. Next it builds a two dimensional array with our data in it. Finally, we print the header and data in the correct format. The result of this is redirected to a file called results.mac and imported into maxima.

$ cat csv_to_maxima.awk
#! /usr/bin/awk -f

BEGIN {
	#input a comma separated file
	FS=",";

	# row number
	rn = 0;
}

# Assuming first line is headers
NR == 1 {
	# can only have as many fields as headers
	max_fields = NF;
	# loop through and save all the header names
	for(i = 1; i <= NF; i++ )
		header[i] = $i;
	next;
}

{
	++rn;
	for( i = 1; i <= NF; i++ ){
		data[i][rn] = $i;
	}
}

END {
	for( i = 1; i <= max_fields; i++ ){
		print_row(header[i], data[i], rn);
	}
}

function print_row(title, arr, len,	strlike, j) {
	printf "%s: [", title;
	for( j=1; j<= len; j++){
		if( j != 1)
			printf ","
		
		if( arr[j] ~ /[^0-9\.]/ ){
			strlike = 1;
			printf "\"";
		}
		printf("%s", arr[j]);
		if( strlike )
			printf "\"";
	}
	print "];"
}

$ chmod +x csv_to_maxima.awk
$ ./csv_to_maxima.awk results.csv > results.mac

Nice. Now we can open the batch script in maxima. Tricky to plot though, because our data is all in a single series.

(%i166)	batch("/home/brandon/Documents/markd/results.mac")$
read and interpret file: /home/brandon/Documents/markd/results.mac
 <...snip...>
(%i174)	wxplot2d([discrete, lines, seconds], 
	    [x, 0, lmax(lines)], 
	    [y, 0, lmax(seconds)],
	    [style, points]
	);

I split it by using the sublist_indices and part functions, like so:

(%i177)	iawk: sublist_indices(program, lambda([x], x="markd.awk"));
        iperl: sublist_indices(program, lambda([x], x="markd.pl"));
(%o177)	[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
(%o178)	[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]

(%i181)	sawk: part(seconds, iawk);
        sperl: part(seconds, iperl);
(%o181)	[21.06,2.0,0.22,0.02,0.0,0.0,40.37,4.07,0.41,0.04,0.01,62.74,80.34,104.56,10.39,1.01,0.11,0.01,91.24]
(%o182)	[10.08,1.0,0.1,0.01,0.0,0.0,19.95,2.01,0.2,0.02,0.0,30.07,40.23,50.19,4.99,0.5,0.05,0.0,45.62]

(%i184)	lawk: part(lines, iawk);
	lperl: part(lines, iperl);
(%o183)	[4900000,490000,49000,4900,490,49,9800000,980000,98000,9800,980,14700000,19600000,24500000,2450000,245000,24500,2450,22027754]
(%o184)	[4900000,490000,49000,4900,490,49,9800000,980000,98000,9800,980,14700000,19600000,24500000,2450000,245000,24500,2450,22027754]

Which makes plotting as simple as:

wxplot2d([ [discrete, lawk, sawk], [discrete, lperl, sperl] ],
    [x, 0, lmax(lines)], 
    [y, 0, lmax(seconds)],
    [style, points],
    [legend, "awk", "perl"],
    [xlabel, "Lines in File"],
    [ylabel, "Time to Process (s)"],
    [title, "Markdown to HTML Parser Performance"]
);

Looking at the results, the Perl version of the parser is ~2x faster in all cases!

Nice one.

Errata: Maxima is probably not the right tool to be using for plotting like this. It would be more sensible to use the python, pandas, matplotlib (oh my), or (dare I say it) excel instead. An advantage of using maxima for this, though, is that now we’re ready to fit curves to the data. Maybe next time…

There’s nice concepts in this post. We automated testing and test file generation using bash. We created a small parser/interpreter for translating .csv to maxima batch files. And we plotted the results using maxima. awk was certainly a good tool for the job here. It’s generic enough that it can be reused for a different dataset, mostly due to our use of nested arrays (which is not portable by the way. gawk only). The script has potential to be extended further, it should include the ability to specify if headers are present (and what should it do if they aren’t?), it needs to be able to skip lines, it probably should be more verbose about our data structure, and it might be nice to be able to choose our field separator. Not a bad prototype though.

More errata: I messed up the gentests.sh script the first time, asking it to generate a file from 6,000,0000 repetitions rather than 600,000. The resulting file was rather large and ate all my hard drive space (yup, not a lot of space on this harddrive) before throwing an error… Woops.

Even more errata: Who on earth would write 22,027,754 lines of markdown… That really is a lot. Considering that, 40 seconds isn’t a bad time to process. That said, however, there’s still scope for improvement. There are a lot of global substitutions in the Perl script which I expect is slowing it down. The next version should split each line into words and make substitutions on each word, rather than each line. This should stop of re-parsing the start of each line once we’ve already parsed. The great news? Should we decide to implement it, we have a performance testing process already in place.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.