From 98d77ecc96e9208cc6935cfd3c435bb455368a2d Mon Sep 17 00:00:00 2001 From: JustAnotherArchivist Date: Tue, 22 Oct 2019 14:52:13 +0000 Subject: [PATCH] Deduplicate output This uses mawk's extensions `-W interactive` and `delete array`; it will probably work with certain other AWK implementations as well, but for now it depends on mawk explicitly. --- wiki-recursive-extract-normalise | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/wiki-recursive-extract-normalise b/wiki-recursive-extract-normalise index dd51fea..ca357c2 100755 --- a/wiki-recursive-extract-normalise +++ b/wiki-recursive-extract-normalise @@ -3,7 +3,7 @@ # Everything that looks like a social media link (including YouTube) is run through social-media-extract-profile-link. # Everything else is run through website-extract-social-media. # This is done recursively until no new links are discovered anymore. -# The output is further fed through url-normalise before and during processing to avoid equivalent but slightly different duplicates. +# The output is further fed through url-normalise before and during processing to avoid equivalent but slightly different duplicates, and the output is deduplicated within each section at the end. verbose= while [[ $# -gt 0 ]] @@ -80,4 +80,4 @@ do done done fi -done +done | mawk -W interactive '! /^\*/ { print; } /^\*/ && !seen[$0]++ { print; } /^==/ { delete seen; }'