When it comes to command line text processing, from an abstract point of view, there are three major pillars

Download 125.91 Kb.

Page	60/60
Date	09.03.2023
Size	125.91 Kb.
	#60849

1 ... 52 53 54 55 56 57 58 59 60

Learn GNU AWK

Further Reading

Faster execution

Changing locale to ASCII (assuming current locale is not ASCII and the input file has only ASCII characters) can give significant speed boost. Using mawk is another way to speed up execution, provided you are not using GNU awk specific features. Among other feature differences, mawk doesn't support {} form of quantifiers, see unix.stackexchange: How to specify regex quantifiers with mawk? for details. See also wikipedia: awk Versions and implementations.
$ # time shown is best result from multiple runs $ # speed benefit will vary depending on computing resources, input, etc $ # /usr/share/dict/words contains dictionary words, one word per line $ time awk '/^([a-d][r-z]){3}$/' /usr/share/dict/words > f1 real 0m0.029s $ time LC_ALL=C awk '/^([a-d][r-z]){3}$/' /usr/share/dict/words > f2 real 0m0.022s $ time mawk '/^[a-d][r-z][a-d][r-z][a-d][r-z]$/' /usr/share/dict/words > f3 real 0m0.009s $ # check that the results are same $ diff -s f1 f2 Files f1 and f2 are identical $ diff -s f2 f3 Files f2 and f3 are identical $ # clean up temporary files $ rm f[123]
Here's another example.
$ # count words containing exactly 3 lowercase 'a' $ time awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words 1102 real 0m0.034s $ time LC_ALL=C awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words 1102 real 0m0.023s $ time mawk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words 1102 real 0m0.014s

man awk and info awk and online manual
Information about various implementations of awk
- awk FAQ — great resource, but last modified 23 May 2002
- grymoire: awk tutorial — covers information about different awk versions as well
- cheat sheet for awk/nawk/gawk
Q&A on stackoverflow/stackexchange are good source of learning material, good for practice exercises as well
- awk Q&A on unix.stackexchange
- awk Q&A on stackoverflow
Learn Regular Expressions (has information on flavors other than POSIX too)
- regular-expressions — tutorials and tools
- rexegg — tutorials, tricks and more
- stackoverflow: What does this regex mean?
- online regex tester and debugger — not fully suitable for cli tools, but most of the POSIX syntax works
My repo on cli text processing tools
Related tools
- GNU datamash
- bioawk
- frawk — an efficient awk-like language, implemented in Rust
- hawk — based on Haskell
- miller — similar to awk/sed/cut/join/sort for name-indexed data such as CSV, TSV, and tabular JSON
  - See this news.ycombinator discussion for other tools like this
miscellaneous
- unix.stackexchange: When to use grep, sed, awk, perl, etc
- awk-libs — lots of useful functions
- awkaster — Pseudo-3D shooter written completely in awk
- awk REPL — live editor on browser
ASCII reference and locale usage
- ASCII code table
- wiki.archlinux: locale
- shellhacks: Define Locale and Language Settings
examples for some of the topics not covered in this book
- unix.stackexchange: rand/srand
- unix.stackexchange: strftime
- stackoverflow: arbitrary precision integer extension
- stackoverflow: recognizing hexadecimal numbers
- unix.stackexchange: sprintf and file close
- unix.stackexchange: user defined functions and array passing
- unix.stackexchange: rename csv files based on number of fields in header row

Download 125.91 Kb.

Share with your friends:

1 ... 52 53 54 55 56 57 58 59 60

When it comes to command line text processing, from an abstract point of view, there are three major pillars

Faster execution

Further Reading