I wanted to process Unicode CSV file to extract the first two columns into JSON. With awk
it seemed easy enough:
awk '
BEGIN {
FS=";"
print "["
}
{
print " { \"code\": \"" $1 "\", \"description\": \"" $2 "\" },"
}
END {
print "]"
}
' UnicodeData.txt | less
This will give you ALMOST parsable output. One thing that will spoil it is the last “hanging” comma making the whole JSON invalid (albeit some parsers will still load it). And no, there is no way to tell awk
to do something special with the last line as processing of the lines is done one-by-one and thus there is no telling which line is last at any give moment.
What we can do is tell awk
to process lines with a single line delay:
awk '
BEGIN {
FS=";"
print "["
}
NR>1 {
print " { \"code\": \"" code "\", \"description\": \"" description "\" },"
}
{
code = $1
description = $2
}
END {
print " { \"code\": \"" code "\", \"description\": \"" description "\" }"
print "]"
}
' UnicodeData.txt | less
This prints content starting from the second line (NR>1
) and what we do in the main loop is just storing fields into our variables that’ll be read in the next iteration. Essentially what we have is a single line delay mechanism. To catch up with the last line we just print it out without trailing comma in END
portion of our program.
Valid JSON at last.