Commit Graph

1fa57d4 Fix extraction on Wix sites from JSON inside a data attribute by JustAnotherArchivist 2020-02-10 18:23:36 +0000
4a74216 Suppress output if there are no matched jobs by JustAnotherArchivist 2020-02-10 01:18:11 +0000
fe72d57 Add filtering based on substrings anywhere in the string and on regex by JustAnotherArchivist 2020-02-10 00:47:19 +0000
cf30a53 Add case-insensitive filtering by JustAnotherArchivist 2020-02-10 00:42:36 +0000
711e444 Highlight jobs that have been inactive for over 6 hours by JustAnotherArchivist 2020-02-02 05:28:17 +0000
b291903 Fix sorting on numerical columns by JustAnotherArchivist 2020-02-02 05:27:05 +0000
257b578 Add descending sort by JustAnotherArchivist 2020-02-02 05:18:08 +0000
6e7449d Support column names in any capitalisation by JustAnotherArchivist 2020-02-02 05:11:16 +0000
e5e7bdf Add more filtering options by JustAnotherArchivist 2020-02-02 05:08:48 +0000
c611420 Remove options from usage line by JustAnotherArchivist 2020-02-02 05:03:54 +0000
824eb5e Add script for getting an AB job overview table by JustAnotherArchivist 2020-02-02 04:38:28 +0000
34c1a58 Fix detection of multiple transfer encodings by JustAnotherArchivist 2019-12-11 02:34:56 +0000
195df08 Fix marker loop on some filenames due to lacking HTML entity processing by JustAnotherArchivist 2019-12-03 04:48:21 +0000
3cc3a1e Fix nested tags by JustAnotherArchivist 2019-12-03 04:43:35 +0000
5c90748 Handle broken pipe on stdout by JustAnotherArchivist 2019-11-23 02:41:58 +0000
b38349e Fix duplicate slashes by JustAnotherArchivist 2019-11-23 02:36:33 +0000
f23e4cc Retry on internal errors by JustAnotherArchivist 2019-11-23 02:31:14 +0000
bfe5f59 Add marker loop detection by JustAnotherArchivist 2019-11-22 17:04:07 +0000
66bdef3 Take a bucket URL argument instead of hostname + bucketname by JustAnotherArchivist 2019-11-22 17:00:02 +0000
e385c1d Limit curl to 10 seconds by JustAnotherArchivist 2019-11-18 01:40:42 +0000
7416244 Replace curl-archivebot-ua with a more general curl-ua script that supports different UAs selected by aliases by JustAnotherArchivist 2019-11-13 18:00:10 +0000
9d712d6 Ignore certain URLs on Twitter and Instagram entirely by JustAnotherArchivist 2019-11-13 13:55:18 +0000
87826d4 Use line variable instead of prefix+url by JustAnotherArchivist 2019-11-13 13:54:44 +0000
163aacf Print deletion URL on stderr by JustAnotherArchivist 2019-11-08 15:47:43 +0000
486a593 Add support for more weird Facebook URLs by JustAnotherArchivist 2019-10-22 16:24:54 +0000
256a944 Fix deduplication within each section processing by JustAnotherArchivist 2019-10-22 14:53:20 +0000
98d77ec Deduplicate output by JustAnotherArchivist 2019-10-22 14:52:13 +0000
6ce64ba Remove redundant url-normalise after the extraction by JustAnotherArchivist 2019-10-22 13:48:14 +0000
3181831 Fix URL extraction from Facebook profile overview pages by JustAnotherArchivist 2019-10-20 18:43:06 +0000
869ade2 Separate names in stderr annotations for the various url-normalise processes by JustAnotherArchivist 2019-10-20 18:42:50 +0000
79f0bd4 Normalise URLs everywhere to reduce duplicates by JustAnotherArchivist 2019-10-20 18:14:15 +0000
dc4efcf One URL normalisation script to rule them all by JustAnotherArchivist 2019-10-20 18:13:14 +0000
0f13a1f Add verbosity options, and annotate stderr on wiki-recursive-extract by JustAnotherArchivist 2019-10-20 18:10:21 +0000
3ec816c Add script for link extraction from social media profiles by JustAnotherArchivist 2019-10-20 18:05:48 +0000
5285c40 Add script for recursive website and social media discovery by JustAnotherArchivist 2019-10-20 17:16:56 +0000
2be9ca9 Ignore more useless Facebook links by JustAnotherArchivist 2019-10-19 18:38:12 +0000
c3b0e55 Add support for facebook.com/pg/something by JustAnotherArchivist 2019-10-19 18:16:59 +0000
7c389f1 Add support for hashbang fragments on Twitter links by JustAnotherArchivist 2019-10-19 18:14:32 +0000
c56736b Ignore /intent on Twitter by JustAnotherArchivist 2019-10-19 18:12:42 +0000
4f34753 Add support for Instagram posts and ignore spurious links from the CDN by JustAnotherArchivist 2019-10-19 13:09:20 +0000
ad030f5 Add support for Facebook pages and groups by JustAnotherArchivist 2019-10-19 13:08:50 +0000
cd0b3f6 Ignore /vi/* on YouTube (video thumbnails) by JustAnotherArchivist 2019-10-19 12:51:56 +0000
6f1cca7 Support hashtags by JustAnotherArchivist 2019-10-19 12:49:48 +0000
c61efa0 Make social media normalisation script snscrape-independent by JustAnotherArchivist 2019-10-19 12:04:57 +0000
e6008eb Add script for automatic social media discovery by JustAnotherArchivist 2019-10-19 11:59:26 +0000
fed6654 Support python3 in any directory instead of just /usr/bin by JustAnotherArchivist 2019-10-16 11:36:50 +0000
8f3c3bf Add readme and licence by JustAnotherArchivist 2019-09-30 23:35:51 +0000
5982e13 Stop gracefully when encountering a SIGPIPE by JustAnotherArchivist 2019-09-10 00:19:09 +0000
c13a115 Add support for WARC/1.1 by JustAnotherArchivist 2019-08-26 13:34:53 +0000
376cde7 Fix broken block digest calculation on malformed HTTP responses by JustAnotherArchivist 2019-07-29 14:52:21 +0000
b121cbd Write all log messages to stderr by JustAnotherArchivist 2019-07-29 13:23:37 +0000
ed1270d Add support for upper-cased chunk lengths by JustAnotherArchivist 2019-07-29 13:12:55 +0000
d4826ab Add record ID to log messages by JustAnotherArchivist 2019-07-29 13:08:46 +0000
4925a91 Add youtube-filter-autogen-channels by JustAnotherArchivist 2019-07-22 16:36:28 +0000
9b8f223 Add wiki-sections-sort by JustAnotherArchivist 2019-07-22 12:50:35 +0000
552a414 Fix not returning complete body for non-chunked responses by JustAnotherArchivist 2019-07-19 12:49:31 +0000
0dc0de6 Add support for lists by JustAnotherArchivist 2019-07-19 12:48:19 +0000
9d344df +x by JustAnotherArchivist 2019-07-13 21:37:33 +0000
f6a7cbf Fix --with-list-urls help message by JustAnotherArchivist 2019-07-13 21:34:54 +0000
9743aa7 Add s3-bucket-list by JustAnotherArchivist 2019-07-13 21:33:32 +0000
91adce7 Add YouTube normalisation script by JustAnotherArchivist 2019-06-17 01:54:45 +0000
5ca90c3 Update tmux session commands by JustAnotherArchivist 2019-06-16 16:19:38 +0000
679923d Add support for Twitter hashtag extraction by JustAnotherArchivist 2019-06-16 16:16:55 +0000
6633838 Add support for lists by JustAnotherArchivist 2019-06-16 16:14:36 +0000
d85d142 Handle parameters on Twitter URLs by JustAnotherArchivist 2019-05-25 00:06:05 +0000
5984565 Handle Twitter URLs with trailing slash by JustAnotherArchivist 2019-05-25 00:03:38 +0000
8647cca Support subdomain-less Facebook URLs by JustAnotherArchivist 2019-05-24 16:39:38 +0000
66ec0c9 Handle more Facebook URLs by JustAnotherArchivist 2019-05-24 12:46:16 +0000
baa8a56 Add script for scraping MEP links from europarl.europa.eu by JustAnotherArchivist 2019-05-21 14:53:31 +0000
c2413b2 Add ArchiveBot wiki list helper by JustAnotherArchivist 2019-05-18 14:58:23 +0000
7281801 Extract external links from Twitter by JustAnotherArchivist 2019-05-16 22:54:45 +0000
b262d89 Silence by default by JustAnotherArchivist 2019-05-13 23:50:20 +0000
6fb9587 More flexible normalisation by JustAnotherArchivist 2019-05-13 16:14:35 +0000
06be216 Print Instagram ignore immediately after upload instead of at the end by JustAnotherArchivist 2019-05-09 00:27:28 +0000
1be4ed8 Add helper for AB/chromebot-ing YouTube channels and users by JustAnotherArchivist 2019-05-08 23:49:30 +0000
2a7a4ea Fix HTTPS handling by JustAnotherArchivist 2019-05-06 23:19:36 +0000
a812cb5 More snscrape helper tools by JustAnotherArchivist 2019-05-06 22:48:36 +0000
3ee3ffc Generate commands for Blogspot by JustAnotherArchivist 2019-05-06 22:29:21 +0000
5090a8a Enumerate users on a Mastodon instance by JustAnotherArchivist 2019-05-05 21:32:55 +0000
0000d8f Add script to queue derive on IA by JustAnotherArchivist 2019-05-01 20:36:05 +0000
6dc711c Further helper scripts for snscrape: normalising usernames and extracting them from a list of URLs by JustAnotherArchivist 2019-04-30 15:23:15 +0000
e3a3745 Add uniqify by JustAnotherArchivist 2019-04-30 14:38:54 +0000
3210678 Proper script for tracking size of uploaded data by JustAnotherArchivist 2019-04-30 14:28:23 +0000
5c654cb Split out size formatting by JustAnotherArchivist 2019-04-30 14:27:58 +0000
de2cdc0 curl with ArchiveBot UA by JustAnotherArchivist 2019-04-30 13:42:03 +0000
89ccd68 Helper tools for snscrape and the wiki pages by JustAnotherArchivist 2019-04-30 13:40:56 +0000
f2e836d Add support for differently formatted digests by JustAnotherArchivist 2019-04-30 04:14:05 +0000
94c4f76 Fix crash when a digest is missing from a record by JustAnotherArchivist 2019-04-30 04:13:45 +0000
ef78a33 Colour only the header field names but not the values by JustAnotherArchivist 2019-04-30 02:29:34 +0000
9ce4653 Document colouring and usage by JustAnotherArchivist 2019-04-30 02:29:10 +0000
e7c5d82 Coloured WARCs?! by JustAnotherArchivist 2019-04-30 02:20:43 +0000
70b413f Better events: include raw WARC header data and separate HTTP requests into headers and body by JustAnotherArchivist 2019-04-30 02:19:20 +0000
641bc7a Fix infinite loop at end of WARC by JustAnotherArchivist 2019-04-30 02:16:42 +0000
a700e8e Add tcp-closer command by JustAnotherArchivist 2019-04-30 00:19:40 +0000
859c75a Add tool for WARC verification and extraction by JustAnotherArchivist 2019-04-28 23:34:21 +0000
e867a23 Replace urlencoded @ symbol by JustAnotherArchivist 2019-04-28 15:43:17 +0000
cbd9520 Workaround for hash no longer needed with current transfer.sh code by JustAnotherArchivist 2019-04-24 15:44:49 +0000
61431c2 Add VK scraping helper by JustAnotherArchivist 2019-04-21 23:53:43 +0000
d6ff566 Instagram always uses lower-case usernames by JustAnotherArchivist 2019-04-21 22:31:02 +0000
138c2a2 Get rid of post-processing now that snscrape (dev version) has clean URLs by JustAnotherArchivist 2019-04-18 17:09:24 +0000

1 2 3 4 5