awk – markdown to html

Uncategorized

awk – markdown to html

Got myself a copy of The Rust Programming language, the new 2018 edition version.

The Rust Programming Language
<meta>is that a new keyboard???</meta>

One of the exercises near the end of the book leads you to write your own multi threaded webserver. The web server you end up with is very rudimentary, requiring you to code in ahead of time individual pages which can be served. I’ve extended it somewhat to be able to serve arbitrary files within the webroot. I’ve also included some (basic) mitigations against directory traversal attacks (webserver:port/../../../../../etc/passwd)

The server is (obviously) insecure and vulnerable to a plethora of attacks, for now it is relegated to localhost only. (I’m getting off topic. We’re here to talk about awk)

Problem 1: I don’t like writing html. I want to be able to write content, quickly, without having to think too much about <this> and </that>.

Enter markdown. This is a simple syntax for basic text formatting. Language features are sumarised here. This is great because I can just zoom through writing the occasional *this* or _that_. vim is actually clever enough to format on the fly in the terminal. (nice one vim).

Problem 2: my roll-your-own multithreaded rust server doesn’t have a clue what markdown is. A user trying to open a file written in markdown just gets the text, not the pretty formatting.

Well that’s no good. The browser expects html. So why don’t we convert the markdown to html?

We could do this in rust (probably). But that sounds time expensive, we already have the tools we need in awk! (did I say awk or gawk? Note that the below makes use for the gensub function. “generic substitute” is a gawk only feature. Check yo’ version). Lets have a crack:

#! /usr/bin/awk -f

BEGIN {
	RS="\n";

	print "<!DOCTYPE html>"
	print "<html lang=\"en\">"
	print "<head><title>Markdown in HTML!</title></head>"
#	print "<link rel=\"stylesheet\" href=\"blogstyle.css\">"

	print "<body>"

	paraflag = 0;
}

We start off by printing out some html header stuff and initializing the paraflag variable to 0. You’ll note a commented out line linking to a stylesheet. In the action version of the script, this lets me make pretty pages like this:

Maybe not so pretty
/^#/ {
	if( paraflag ){
		print "</p>";
		paraflag = 0;
	}
	
	num = split($0, tmp, "");	
	#cound the number of hashes
	for( i=1; tmp[i] == "#"; i++){
		;
	}
	--i;
	sub(/^[#]* ?/, "<h"i">")

	print $0 "</h"i">"
	next;
}

In the next section, we match for lines begining with #. These are headers. we count the number of #‘s and print out a string resembling <h1>Content here</h1>

{
	#protect the escaped characters
	gsub(/\\\*/, "\\*");						#asterisk
	gsub(/\\\\$/, "\\\");						#backslash
	gsub(/\\>/, "\\>");							# >
	gsub(/\\\[/, "\\[");							# [
	gsub(/\\\]/, "\\]");							# ]
	gsub(/\\\(/, "\\(");							# (
	gsub(/\\\)/, "\\)");							# )

	#formatting sequences
	$0 = gensub(/\*\*(.*)\*\*/, "<b>\\1</b>", "g"); 			#Bold
	$0 = gensub(/\*(.*)\*/, "<i>\\1</i>", "g");				#italics
	$0 = gensub(/--([^-].*[^-])--/, "<strike>\\1</strike>", "g");		#strikethrough 


	#images
	$0 = gensub(/!\[([^\]]+)\]\(([^\)]+)\)/, "<img src=\"\\2\" alt=\"\\1\">", "g");
	#links
	$0 = gensub(/\[([^\]]+)\]\(([^\)]+)\)/, "<a href=\\2>\\1</a>", "g");

	if( ! blockflag && /^>/){
		#$0 = "<blockquote>\n"$0
		print("<blockquote>");
		blockflag = 1;
	} else
	if( blockflag && !/^>/ ){
		$0 = "</blockquote>\n"$0;
		blockflag = 0;
	}
	gsub(/^>/, "");

	if( ! paraflag ){
		$0 = "<p>\n"$0;
		paraflag = 1;
	}
	if( paraflag ){
		if( !/\\$/ )			#put breaks inside paragraphs
			$0 = $0"<br>";
		else				#but not if theres a \
			gsub(/\\$/, "");
	}
	print $0;
}

The meat of the interpreter lives in the ‘every line’ block. We start by protecting characters which have been escaped by a backslash by replacing them with their html codes.

Next we look for formatting sequences, such as **make this bold** and substitute in the correct html (for example, <b>make this bold</b>).

Once we’ve dealt with formatting, we move on to links and images. the conversion goes from [Link Text](http://link.com/location) to <a href="http://link.com/location">ILink Text</a>. The conversion is similar for images, except with the format ![Image Text](http://image.com/link)

Next we handle the block symbol >which is used to create <blockquotes>.

Finally, we insert paragraph, <p>symbols all over the place. Now that we’ve made of all the necessary in place modifications, we print the string with print $0 and move into END {}.

END {
	if( paraflag ) {
		print "</p>";
		paraflag = 0;
	}
	print "</body>"
	print "</html>"
}

The code finishes by printing a trailing paragraph </p>if required, and closes of the body and html tags.

Thats it! Save all that code in a file called markd.awk (get the whole code at the bottom of this post). Running the final code on a markdown file like below:

$ cat in.md
#Header 1
## sub title 2
### this is my title 3

*italics*
**bold**

--cross out--

> a block quote
> not bad
> [Link in a quote](https://faeredia.com)

#### End with a picture
![Picture from the Interwebs](https://images.unsplash.com/photo-1533709752211-118fcaf03312?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60)
*image by [@markusspiske](https://unsplash.com/@markusspiske)*

*__EOF*

$ chmod +x markd.awk
$ ./markd.awk in.md > out.html
$ firefox out.html

Use whatever browser you want to view it. (note that the image and authors name link to an external site

Nice one! To finish up, my rust web server inspects incoming requests for the .mdextension. The webserver then runs our little awk markdown parser, and serves the result. Building the markdown parser into the server like this allows us to skip the step of manually regenerating the html files after every change to the markdown file.

“But wait!” I can here you say. “Surely it is slow for the server to rebuild the html for *every* request”. Yes and no. We can see that for small markdown files like the above sample, our awk script is quite efficient.

$ time ./markd.awk in.md > out.html

real	0m0.008s
user	0m0.008s
sys	0m0.000s

For larger files?

$ for i in {1..10000}; do cat quick.md >> big.md; done
$ wc -l big.md
216198 big.md
$ time ./markd.awk big.md > out.html

real	0m1.477s
user	0m1.445s
sys	0m0.030s

So for a file of ~200,000 lines it takes 1.477s. Sure, that’s starting to get on the slow side.

A feature I’ve got planned for the rust web server is caching. Rather than perform the markdown to html conversion each time, it should save the output in the server cache. On the next request, the server should compare the timestamp of the cached version to the timestamp of the markdown source. If the cached file is out of date, the conversion is performed again. Otherwise, we avoid the 1.5s delay and serve up the cached version immediately.

There is also room for optimization in the awk script itself. I expect that the constant operation on $0 is not particularly speedy. I would also consider replacing the gensub‘s with a more portable solution, and switch over to mawk which *should* give a speed bonus. Or perhaps a perlscript. Or consider doing it all in rust with the regex crate.

Anyway, that’s it for now. Nice one. Whole code for markd.awk is below.

#! /usr/bin/awk -f

BEGIN {
	RS="\n";

	print "<!DOCTYPE html>"
	print "<html lang=\"en\">"
	print "<head><title>Markdown in HTML!</title></head>"
#	print "<link rel=\"stylesheet\" href=\"blogstyle.css\">"

	print "<body>"

	paraflag = 0;
}

/^#/ {
	if( paraflag ){
		print "</p>";
		paraflag = 0;
	}
	
	num = split($0, tmp, "");	
	#cound the number of hashes
	for( i=1; tmp[i] == "#"; i++){
		;
	}
	--i;
	sub(/^[#]* ?/, "<h"i">")

	print $0 "</h"i">"
	next;
}

{
	#protect the escaped characters
	gsub(/\\\*/, "\\*");						#asterisk
	gsub(/\\\\$/, "\\\");						#backslash
	gsub(/\\>/, "\\>");							# >
	gsub(/\\\[/, "\\[");							# [
	gsub(/\\\]/, "\\]");							# ]
	gsub(/\\\(/, "\\(");							# (
	gsub(/\\\)/, "\\)");							# )

	#formatting sequences
	$0 = gensub(/\*\*(.*)\*\*/, "<b>\\1</b>", "g"); 			#Bold
	$0 = gensub(/\*(.*)\*/, "<i>\\1</i>", "g");				#italics
	$0 = gensub(/--([^-].*[^-])--/, "<strike>\\1</strike>", "g");		#strikethrough 


	#images
	$0 = gensub(/!\[([^\]]+)\]\(([^\)]+)\)/, "<img src=\"\\2\" alt=\"\\1\">", "g");
	#links
	$0 = gensub(/\[([^\]]+)\]\(([^\)]+)\)/, "<a href=\\2>\\1</a>", "g");

	if( ! blockflag && /^>/){
		#$0 = "<blockquote>\n"$0
		print("<blockquote>");
		blockflag = 1;
	} else
	if( blockflag && !/^>/ ){
		$0 = "</blockquote>\n"$0;
		blockflag = 0;
	}
	gsub(/^>/, "");

	if( ! paraflag ){
		$0 = "<p>\n"$0;
		paraflag = 1;
	}
	if( paraflag ){
		if( !/\\$/ )			#put breaks inside paragraphs
			$0 = $0"<br>";
		else				#but not if theres a \
			gsub(/\\$/, "");
	}
	print $0;
}

END {
	if( paraflag ) {
		print "</p>";
		paraflag = 0;
	}
	print "</body>"
	print "</html>"
}

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.