1fa57d4
Fix extraction on Wix sites from JSON inside a data attribute by
2020-02-10 18:23:36 +0000
4a74216
Suppress output if there are no matched jobs by
2020-02-10 01:18:11 +0000
fe72d57
Add filtering based on substrings anywhere in the string and on regex by
2020-02-10 00:47:19 +0000
cf30a53
Add case-insensitive filtering by
2020-02-10 00:42:36 +0000
711e444
Highlight jobs that have been inactive for over 6 hours by
2020-02-02 05:28:17 +0000
b291903
Fix sorting on numerical columns by
2020-02-02 05:27:05 +0000
257b578
Add descending sort by
2020-02-02 05:18:08 +0000
6e7449d
Support column names in any capitalisation by
2020-02-02 05:11:16 +0000
e5e7bdf
Add more filtering options by
2020-02-02 05:08:48 +0000
c611420
Remove options from usage line by
2020-02-02 05:03:54 +0000
824eb5e
Add script for getting an AB job overview table by
2020-02-02 04:38:28 +0000
34c1a58
Fix detection of multiple transfer encodings by
2019-12-11 02:34:56 +0000
195df08
Fix marker loop on some filenames due to lacking HTML entity processing by
2019-12-03 04:48:21 +0000
3cc3a1e
Fix nested tags by
2019-12-03 04:43:35 +0000
5c90748
Handle broken pipe on stdout by
2019-11-23 02:41:58 +0000
b38349e
Fix duplicate slashes by
2019-11-23 02:36:33 +0000
f23e4cc
Retry on internal errors by
2019-11-23 02:31:14 +0000
bfe5f59
Add marker loop detection by
2019-11-22 17:04:07 +0000
66bdef3
Take a bucket URL argument instead of hostname + bucketname by
2019-11-22 17:00:02 +0000
e385c1d
Limit curl to 10 seconds by
2019-11-18 01:40:42 +0000
7416244
Replace curl-archivebot-ua with a more general curl-ua script that supports different UAs selected by aliases by
2019-11-13 18:00:10 +0000
9d712d6
Ignore certain URLs on Twitter and Instagram entirely by
2019-11-13 13:55:18 +0000
87826d4
Use line variable instead of prefix+url by
2019-11-13 13:54:44 +0000
163aacf
Print deletion URL on stderr by
2019-11-08 15:47:43 +0000
486a593
Add support for more weird Facebook URLs by
2019-10-22 16:24:54 +0000
256a944
Fix deduplication within each section processing by
2019-10-22 14:53:20 +0000
98d77ec
Deduplicate output by
2019-10-22 14:52:13 +0000
6ce64ba
Remove redundant url-normalise after the extraction by
2019-10-22 13:48:14 +0000
3181831
Fix URL extraction from Facebook profile overview pages by
2019-10-20 18:43:06 +0000
869ade2
Separate names in stderr annotations for the various url-normalise processes by
2019-10-20 18:42:50 +0000
79f0bd4
Normalise URLs everywhere to reduce duplicates by
2019-10-20 18:14:15 +0000
dc4efcf
One URL normalisation script to rule them all by
2019-10-20 18:13:14 +0000
0f13a1f
Add verbosity options, and annotate stderr on wiki-recursive-extract by
2019-10-20 18:10:21 +0000
3ec816c
Add script for link extraction from social media profiles by
2019-10-20 18:05:48 +0000
5285c40
Add script for recursive website and social media discovery by
2019-10-20 17:16:56 +0000
2be9ca9
Ignore more useless Facebook links by
2019-10-19 18:38:12 +0000
c3b0e55
Add support for facebook.com/pg/something by
2019-10-19 18:16:59 +0000
7c389f1
Add support for hashbang fragments on Twitter links by
2019-10-19 18:14:32 +0000
c56736b
Ignore /intent on Twitter by
2019-10-19 18:12:42 +0000
4f34753
Add support for Instagram posts and ignore spurious links from the CDN by
2019-10-19 13:09:20 +0000
ad030f5
Add support for Facebook pages and groups by
2019-10-19 13:08:50 +0000
cd0b3f6
Ignore /vi/* on YouTube (video thumbnails) by
2019-10-19 12:51:56 +0000
6f1cca7
Support hashtags by
2019-10-19 12:49:48 +0000
c61efa0
Make social media normalisation script snscrape-independent by
2019-10-19 12:04:57 +0000
e6008eb
Add script for automatic social media discovery by
2019-10-19 11:59:26 +0000
fed6654
Support python3 in any directory instead of just /usr/bin by
2019-10-16 11:36:50 +0000
8f3c3bf
Add readme and licence by
2019-09-30 23:35:51 +0000
5982e13
Stop gracefully when encountering a SIGPIPE by
2019-09-10 00:19:09 +0000
c13a115
Add support for WARC/1.1 by
2019-08-26 13:34:53 +0000
376cde7
Fix broken block digest calculation on malformed HTTP responses by
2019-07-29 14:52:21 +0000
b121cbd
Write all log messages to stderr by
2019-07-29 13:23:37 +0000
ed1270d
Add support for upper-cased chunk lengths by
2019-07-29 13:12:55 +0000
d4826ab
Add record ID to log messages by
2019-07-29 13:08:46 +0000
4925a91
Add youtube-filter-autogen-channels by
2019-07-22 16:36:28 +0000
9b8f223
Add wiki-sections-sort by
2019-07-22 12:50:35 +0000
552a414
Fix not returning complete body for non-chunked responses by
2019-07-19 12:49:31 +0000
0dc0de6
Add support for lists by
2019-07-19 12:48:19 +0000
9d344df
+x by
2019-07-13 21:37:33 +0000
f6a7cbf
Fix --with-list-urls help message by
2019-07-13 21:34:54 +0000
9743aa7
Add s3-bucket-list by
2019-07-13 21:33:32 +0000
91adce7
Add YouTube normalisation script by
2019-06-17 01:54:45 +0000
5ca90c3
Update tmux session commands by
2019-06-16 16:19:38 +0000
679923d
Add support for Twitter hashtag extraction by
2019-06-16 16:16:55 +0000
6633838
Add support for lists by
2019-06-16 16:14:36 +0000
d85d142
Handle parameters on Twitter URLs by
2019-05-25 00:06:05 +0000
5984565
Handle Twitter URLs with trailing slash by
2019-05-25 00:03:38 +0000
8647cca
Support subdomain-less Facebook URLs by
2019-05-24 16:39:38 +0000
66ec0c9
Handle more Facebook URLs by
2019-05-24 12:46:16 +0000
baa8a56
Add script for scraping MEP links from europarl.europa.eu by
2019-05-21 14:53:31 +0000
c2413b2
Add ArchiveBot wiki list helper by
2019-05-18 14:58:23 +0000
7281801
Extract external links from Twitter by
2019-05-16 22:54:45 +0000
b262d89
Silence by default by
2019-05-13 23:50:20 +0000
6fb9587
More flexible normalisation by
2019-05-13 16:14:35 +0000
06be216
Print Instagram ignore immediately after upload instead of at the end by
2019-05-09 00:27:28 +0000
1be4ed8
Add helper for AB/chromebot-ing YouTube channels and users by
2019-05-08 23:49:30 +0000
2a7a4ea
Fix HTTPS handling by
2019-05-06 23:19:36 +0000
a812cb5
More snscrape helper tools by
2019-05-06 22:48:36 +0000
3ee3ffc
Generate commands for Blogspot by
2019-05-06 22:29:21 +0000
5090a8a
Enumerate users on a Mastodon instance by
2019-05-05 21:32:55 +0000
0000d8f
Add script to queue derive on IA by
2019-05-01 20:36:05 +0000
6dc711c
Further helper scripts for snscrape: normalising usernames and extracting them from a list of URLs by
2019-04-30 15:23:15 +0000
e3a3745
Add uniqify by
2019-04-30 14:38:54 +0000
3210678
Proper script for tracking size of uploaded data by
2019-04-30 14:28:23 +0000
5c654cb
Split out size formatting by
2019-04-30 14:27:58 +0000
de2cdc0
curl with ArchiveBot UA by
2019-04-30 13:42:03 +0000
89ccd68
Helper tools for snscrape and the wiki pages by
2019-04-30 13:40:56 +0000
f2e836d
Add support for differently formatted digests by
2019-04-30 04:14:05 +0000
94c4f76
Fix crash when a digest is missing from a record by
2019-04-30 04:13:45 +0000
ef78a33
Colour only the header field names but not the values by
2019-04-30 02:29:34 +0000
9ce4653
Document colouring and usage by
2019-04-30 02:29:10 +0000
e7c5d82
Coloured WARCs?! by
2019-04-30 02:20:43 +0000
70b413f
Better events: include raw WARC header data and separate HTTP requests into headers and body by
2019-04-30 02:19:20 +0000
641bc7a
Fix infinite loop at end of WARC by
2019-04-30 02:16:42 +0000
a700e8e
Add tcp-closer command by
2019-04-30 00:19:40 +0000
859c75a
Add tool for WARC verification and extraction by
2019-04-28 23:34:21 +0000
e867a23
Replace urlencoded @ symbol by
2019-04-28 15:43:17 +0000
cbd9520
Workaround for hash no longer needed with current transfer.sh code by
2019-04-24 15:44:49 +0000
61431c2
Add VK scraping helper by
2019-04-21 23:53:43 +0000
d6ff566
Instagram always uses lower-case usernames by
2019-04-21 22:31:02 +0000
138c2a2
Get rid of post-processing now that snscrape (dev version) has clean URLs by
2019-04-18 17:09:24 +0000