How do I remove similar instances of lines using Unix commands?

Posted on

Problem :

I have a file that contains lines that look like the following:

14|geauxtigers|90
14|geauxtigers|null

I want to remove all instances in the file with the null as the last term. Is there a way to do this with Unix commands?

I was going to read in the file with Java and look at adjacent lines and remove the line whose adjacent line has similar first two terms but null as the third term. Is there a way to do this through Unix tools?

Edit: I don’t want to blindly remove all of the terms with null as the third term, I might have the following entry:
15|lsu|null
I’d like to keep it since it is the only entry. It’s just that, if there is another line with a third term that is non-null, I would like to keep the non-null value.

Solution :

I would like add one more answer, using awk:

awk -F'|' '{if($3!="null"){a=$1;b=$2;print}else{if(a!=$1 || b!=$2)print}}' yourFile

test

kent$  echo "14|geauxtigers|90
14|geauxtigers|null
foo|bar|blah
x|y|z
x|y|null"|awk -F'|' '{if($3!="null"){a=$1;b=$2;print}else{if(a!=$1 || b!=$2)print}}'    
14|geauxtigers|90
foo|bar|blah
x|y|z
grep -v '|null$' yourfile.txt > filtered.txt

Assuming lines can come in any order, and the result is ordered numerically on first field, here’s a Perl solution:

echo -e "2|asd|null
11|bla|asd
14|geauxtigers|90
2|asd|2
15|lsu|null
14|geauxtigers|null" | perl -e '
while(<>) {
  $line=$_;
  s@|[^|]*$@@;
  $hash{$_}=$line
}
for $line (sort {$a<=>$b} keys %hash) {
  print $hash{$line}
}'

Assuming the lines might appear in any order, scan the file twice, first finding the non-null lines: I assume the “key” is the first two columns:

awk -F '|' '
  NR == FNR  && $NF != "null" { notnull[$1 FS $2]; next }
  $NF == "null" && $1 FS $2 in notnull {next}
  {print} 
' filename filename > file.nonulls 

If the null line always follows it’s partner:

awk -F '|' '
  $NF != null {seen[$1 FS $2]}
  $NF == "null" && $1 FS $2 in seen {next}
  {print}
' filename > file.nonulls