JustAnotherArchivist
256a94443e
Fix deduplication within each section processing
pirms 4 gadiem
JustAnotherArchivist
98d77ecc96
Deduplicate output
This uses mawk's extensions `-W interactive` and `delete array`; it will probably work with certain other AWK implementations as well, but for now it depends on mawk explicitly.
pirms 4 gadiem
JustAnotherArchivist
6ce64baf87
Remove redundant url-normalise after the extraction
Since all input is run through url-normalise before processing and all output of website and social media extraction is also normalised, it's not necessary to re-normalise again at the end.
pirms 4 gadiem
JustAnotherArchivist
318183148e
Fix URL extraction from Facebook profile overview pages
pirms 4 gadiem
JustAnotherArchivist
869ade27eb
Separate names in stderr annotations for the various url-normalise processes
pirms 4 gadiem
JustAnotherArchivist
79f0bd4332
Normalise URLs everywhere to reduce duplicates
pirms 4 gadiem
JustAnotherArchivist
dc4efcfbfb
One URL normalisation script to rule them all
Consolidate social media profile, YouTube, and (new) generic web page URL normalisation into one script
pirms 4 gadiem
JustAnotherArchivist
0f13a1fadd
Add verbosity options, and annotate stderr on wiki-recursive-extract
pirms 4 gadiem
JustAnotherArchivist
3ec816cd04
Add script for link extraction from social media profiles
pirms 4 gadiem
JustAnotherArchivist
5285c406d9
Add script for recursive website and social media discovery
pirms 4 gadiem
JustAnotherArchivist
2be9ca922e
Ignore more useless Facebook links
pirms 4 gadiem
JustAnotherArchivist
c3b0e5543e
Add support for facebook.com/pg/something
pirms 4 gadiem
JustAnotherArchivist
7c389f1fef
Add support for hashbang fragments on Twitter links
pirms 4 gadiem
JustAnotherArchivist
c56736bc4a
Ignore /intent on Twitter
pirms 4 gadiem
JustAnotherArchivist
4f34753788
Add support for Instagram posts and ignore spurious links from the CDN
pirms 4 gadiem
JustAnotherArchivist
ad030f5d21
Add support for Facebook pages and groups
pirms 4 gadiem
JustAnotherArchivist
cd0b3f6214
Ignore /vi/* on YouTube (video thumbnails)
pirms 4 gadiem
JustAnotherArchivist
6f1cca73ad
Support hashtags
pirms 4 gadiem
JustAnotherArchivist
c61efa03f0
Make social media normalisation script snscrape-independent
pirms 4 gadiem
JustAnotherArchivist
e6008eb971
Add script for automatic social media discovery
pirms 4 gadiem
JustAnotherArchivist
fed66542fa
Support python3 in any directory instead of just /usr/bin
pirms 4 gadiem
JustAnotherArchivist
5982e131a4
Stop gracefully when encountering a SIGPIPE
pirms 4 gadiem
JustAnotherArchivist
c13a1150df
Add support for WARC/1.1
pirms 4 gadiem
JustAnotherArchivist
376cde7b8c
Fix broken block digest calculation on malformed HTTP responses
pirms 4 gadiem
JustAnotherArchivist
b121cbd958
Write all log messages to stderr
pirms 4 gadiem
JustAnotherArchivist
ed1270d988
Add support for upper-cased chunk lengths
pirms 4 gadiem
JustAnotherArchivist
d4826abde2
Add record ID to log messages
pirms 4 gadiem
JustAnotherArchivist
4925a912c0
Add youtube-filter-autogen-channels
pirms 4 gadiem
JustAnotherArchivist
9b8f223776
Add wiki-sections-sort
pirms 4 gadiem
JustAnotherArchivist
552a4147c2
Fix not returning complete body for non-chunked responses
Leftover from debugging
pirms 4 gadiem
JustAnotherArchivist
0dc0de6b50
Add support for lists
pirms 4 gadiem
JustAnotherArchivist
9d344df8c6
+x
pirms 4 gadiem
JustAnotherArchivist
f6a7cbfc70
Fix --with-list-urls help message
pirms 4 gadiem
JustAnotherArchivist
9743aa7c35
Add s3-bucket-list
pirms 4 gadiem
JustAnotherArchivist
91adce786f
Add YouTube normalisation script
pirms 4 gadiem
JustAnotherArchivist
5ca90c3b7d
Update tmux session commands
pirms 4 gadiem
JustAnotherArchivist
679923d37d
Add support for Twitter hashtag extraction
pirms 4 gadiem
JustAnotherArchivist
663383830c
Add support for lists
pirms 4 gadiem
JustAnotherArchivist
d85d142def
Handle parameters on Twitter URLs
pirms 5 gadiem
JustAnotherArchivist
5984565417
Handle Twitter URLs with trailing slash
pirms 5 gadiem
JustAnotherArchivist
8647ccaa8f
Support subdomain-less Facebook URLs
pirms 5 gadiem
JustAnotherArchivist
66ec0c93c4
Handle more Facebook URLs
pirms 5 gadiem
JustAnotherArchivist
baa8a566bd
Add script for scraping MEP links from europarl.europa.eu
pirms 5 gadiem
JustAnotherArchivist
c2413b2c4f
Add ArchiveBot wiki list helper
pirms 5 gadiem
JustAnotherArchivist
72818019bc
Extract external links from Twitter
pirms 5 gadiem
JustAnotherArchivist
b262d893da
Silence by default
pirms 5 gadiem
JustAnotherArchivist
6fb9587a2b
More flexible normalisation
pirms 5 gadiem
JustAnotherArchivist
06be216f4c
Print Instagram ignore immediately after upload instead of at the end
pirms 5 gadiem
JustAnotherArchivist
1be4ed829b
Add helper for AB/chromebot-ing YouTube channels and users
pirms 5 gadiem
JustAnotherArchivist
2a7a4ea6dc
Fix HTTPS handling
pirms 5 gadiem