When it comes to command line text processing, from an abstract point of view, there are three major pillars

Download 125.91 Kb.

Page	53/60
Date	09.03.2023
Size	125.91 Kb.
	#60849

1 ... 49 50 51 52 53 54 55 56 ... 60

Learn GNU AWK

Column wise duplicates
Duplicate count

Whole line duplicates

awk '!a[$0]++' is one of the most famous awk one-liners. It eliminates line based duplicates while retaining input order. The following example shows it in action along with an illustration of how the logic works.
$ cat purchases.txt coffee tea washing powder coffee toothpaste tea soap tea $ awk '{print +a[$0] "\t" $0; a[$0]++}' purchases.txt 0 coffee 0 tea 0 washing powder 1 coffee 0 toothpaste 1 tea 0 soap 2 tea $ # only those entries with zero in first column will be retained $ awk '!a[$0]++' purchases.txt coffee tea washing powder toothpaste soap

Column wise duplicates

Removing field based duplicates is simple for single field comparison. Just change $0 to the required field number after setting the appropriate field separator.
$ cat duplicates.txt brown,toy,bread,42 dark red,ruby,rose,111 blue,ruby,water,333 dark red,sky,rose,555 yellow,toy,flower,333 white,sky,bread,111 light red,purse,rose,333 $ # based on last field $ awk -F, '!seen[$NF]++' duplicates.txt brown,toy,bread,42 dark red,ruby,rose,111 blue,ruby,water,333 dark red,sky,rose,555
For multiple fields comparison, separate the fields with , so that SUBSEP is used to combine the field values to generate the key. As mentioned before, SUBSEP has a default value of \034 non-printing character, which is typically not used in text files.
$ # based on first and third field $ awk -F, '!seen[$1,$3]++' duplicates.txt brown,toy,bread,42 dark red,ruby,rose,111 blue,ruby,water,333 yellow,toy,flower,333 white,sky,bread,111 light red,purse,rose,333

Duplicate count

In this section, how many times a duplicate record is found plays a role in determining the output.
First up, printing only a specific numbered duplicate.
$ # print only the second occurrence of duplicates based on 2nd field $ awk -F, '++seen[$2]==2' duplicates.txt blue,ruby,water,333 yellow,toy,flower,333 white,sky,bread,111 $ # print only the third occurrence of duplicates based on last field $ awk -F, '++seen[$NF]==3' duplicates.txt light red,purse,rose,333
Next, printing only the last copy of duplicate. Since the count isn't known, the tac command comes in handy again.
$ # reverse the input line-wise, retain first copy and then reverse again $ tac duplicates.txt | awk -F, '!seen[$NF]++' | tac brown,toy,bread,42 dark red,sky,rose,555 white,sky,bread,111 light red,purse,rose,333
To get all the records based on a duplicate count, you can pass the input file twice. Then use the two file processing trick to make decisions.
$ # all duplicates based on last column $ awk -F, 'NR==FNR{a[$NF]++; next} a[$NF]>1' duplicates.txt duplicates.txt dark red,ruby,rose,111 blue,ruby,water,333 yellow,toy,flower,333 white,sky,bread,111 light red,purse,rose,333 $ # all duplicates based on last column, minimum 3 duplicates $ awk -F, 'NR==FNR{a[$NF]++; next} a[$NF]>2' duplicates.txt duplicates.txt blue,ruby,water,333 yellow,toy,flower,333 light red,purse,rose,333 $ # only unique lines based on 3rd column $ awk -F, 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt blue,ruby,water,333 yellow,toy,flower,333

Download 125.91 Kb.

Share with your friends:

1 ... 49 50 51 52 53 54 55 56 ... 60