JustAnotherArchivist
5c907488e1
Handle broken pipe on stdout
4年前
JustAnotherArchivist
b38349e91f
Fix duplicate slashes
4年前
JustAnotherArchivist
f23e4cc71e
Retry on internal errors
4年前
JustAnotherArchivist
bfe5f59e25
Add marker loop detection
4年前
JustAnotherArchivist
66bdef3247
Take a bucket URL argument instead of hostname + bucketname
4年前
JustAnotherArchivist
e385c1d302
Limit curl to 10 seconds
4年前
JustAnotherArchivist
74162445aa
Replace curl-archivebot-ua with a more general curl-ua script that supports different UAs selected by aliases
4年前
JustAnotherArchivist
9d712d64d7
Ignore certain URLs on Twitter and Instagram entirely
4年前
JustAnotherArchivist
87826d4844
Use line variable instead of prefix+url
4年前
JustAnotherArchivist
163aacf13c
Print deletion URL on stderr
4年前
JustAnotherArchivist
486a593f15
Add support for more weird Facebook URLs
4年前
JustAnotherArchivist
256a94443e
Fix deduplication within each section processing
4年前
JustAnotherArchivist
98d77ecc96
Deduplicate output
This uses mawk's extensions `-W interactive` and `delete array`; it will probably work with certain other AWK implementations as well, but for now it depends on mawk explicitly.
4年前
JustAnotherArchivist
6ce64baf87
Remove redundant url-normalise after the extraction
Since all input is run through url-normalise before processing and all output of website and social media extraction is also normalised, it's not necessary to re-normalise again at the end.
4年前
JustAnotherArchivist
318183148e
Fix URL extraction from Facebook profile overview pages
4年前
JustAnotherArchivist
869ade27eb
Separate names in stderr annotations for the various url-normalise processes
4年前
JustAnotherArchivist
79f0bd4332
Normalise URLs everywhere to reduce duplicates
4年前
JustAnotherArchivist
dc4efcfbfb
One URL normalisation script to rule them all
Consolidate social media profile, YouTube, and (new) generic web page URL normalisation into one script
4年前
JustAnotherArchivist
0f13a1fadd
Add verbosity options, and annotate stderr on wiki-recursive-extract
4年前
JustAnotherArchivist
3ec816cd04
Add script for link extraction from social media profiles
4年前
JustAnotherArchivist
5285c406d9
Add script for recursive website and social media discovery
4年前
JustAnotherArchivist
2be9ca922e
Ignore more useless Facebook links
4年前
JustAnotherArchivist
c3b0e5543e
Add support for facebook.com/pg/something
4年前
JustAnotherArchivist
7c389f1fef
Add support for hashbang fragments on Twitter links
4年前
JustAnotherArchivist
c56736bc4a
Ignore /intent on Twitter
4年前
JustAnotherArchivist
4f34753788
Add support for Instagram posts and ignore spurious links from the CDN
4年前
JustAnotherArchivist
ad030f5d21
Add support for Facebook pages and groups
4年前
JustAnotherArchivist
cd0b3f6214
Ignore /vi/* on YouTube (video thumbnails)
4年前
JustAnotherArchivist
6f1cca73ad
Support hashtags
4年前
JustAnotherArchivist
c61efa03f0
Make social media normalisation script snscrape-independent
4年前
JustAnotherArchivist
e6008eb971
Add script for automatic social media discovery
4年前
JustAnotherArchivist
fed66542fa
Support python3 in any directory instead of just /usr/bin
4年前
JustAnotherArchivist
5982e131a4
Stop gracefully when encountering a SIGPIPE
4年前
JustAnotherArchivist
c13a1150df
Add support for WARC/1.1
4年前
JustAnotherArchivist
376cde7b8c
Fix broken block digest calculation on malformed HTTP responses
4年前
JustAnotherArchivist
b121cbd958
Write all log messages to stderr
4年前
JustAnotherArchivist
ed1270d988
Add support for upper-cased chunk lengths
4年前
JustAnotherArchivist
d4826abde2
Add record ID to log messages
4年前
JustAnotherArchivist
4925a912c0
Add youtube-filter-autogen-channels
4年前
JustAnotherArchivist
9b8f223776
Add wiki-sections-sort
4年前
JustAnotherArchivist
552a4147c2
Fix not returning complete body for non-chunked responses
Leftover from debugging
4年前
JustAnotherArchivist
0dc0de6b50
Add support for lists
4年前
JustAnotherArchivist
9d344df8c6
+x
4年前
JustAnotherArchivist
f6a7cbfc70
Fix --with-list-urls help message
4年前
JustAnotherArchivist
9743aa7c35
Add s3-bucket-list
4年前
JustAnotherArchivist
91adce786f
Add YouTube normalisation script
4年前
JustAnotherArchivist
5ca90c3b7d
Update tmux session commands
4年前
JustAnotherArchivist
679923d37d
Add support for Twitter hashtag extraction
4年前
JustAnotherArchivist
663383830c
Add support for lists
4年前
JustAnotherArchivist
d85d142def
Handle parameters on Twitter URLs
5年前