JustAnotherArchivist
0000d8ffd9
Add script to queue derive on IA
há 5 anos
JustAnotherArchivist
6dc711c54e
Further helper scripts for snscrape: normalising usernames and extracting them from a list of URLs
há 5 anos
JustAnotherArchivist
e3a37455ba
Add uniqify
há 5 anos
JustAnotherArchivist
321067819c
Proper script for tracking size of uploaded data
há 5 anos
JustAnotherArchivist
5c654cb16b
Split out size formatting
há 5 anos
JustAnotherArchivist
de2cdc0aae
curl with ArchiveBot UA
há 5 anos
JustAnotherArchivist
89ccd68b59
Helper tools for snscrape and the wiki pages
há 5 anos
JustAnotherArchivist
f2e836d2e9
Add support for differently formatted digests
há 5 anos
JustAnotherArchivist
94c4f76570
Fix crash when a digest is missing from a record
há 5 anos
JustAnotherArchivist
ef78a3318c
Colour only the header field names but not the values
há 5 anos
JustAnotherArchivist
9ce4653094
Document colouring and usage
há 5 anos
JustAnotherArchivist
e7c5d82254
Coloured WARCs?!
há 5 anos
JustAnotherArchivist
70b413f5c1
Better events: include raw WARC header data and separate HTTP requests into headers and body
há 5 anos
JustAnotherArchivist
641bc7a207
Fix infinite loop at end of WARC
há 5 anos
JustAnotherArchivist
a700e8e2fe
Add tcp-closer command
há 5 anos
JustAnotherArchivist
859c75a591
Add tool for WARC verification and extraction
há 5 anos
JustAnotherArchivist
e867a2327f
Replace urlencoded @ symbol
The fix for https://github.com/dutchcoders/transfer.sh/issues/215 led to @ being encoded as %40 in filenames in the URL returned, which is awkward when working with social media scrapes since ArchiveBot normalises it to @ again.
há 5 anos
JustAnotherArchivist
cbd952024b
Workaround for hash no longer needed with current transfer.sh code
há 5 anos
JustAnotherArchivist
61431c2054
Add VK scraping helper
há 5 anos
JustAnotherArchivist
d6ff566c4d
Instagram always uses lower-case usernames
há 5 anos
JustAnotherArchivist
138c2a2d39
Get rid of post-processing now that snscrape (dev version) has clean URLs
Keep the dirty URLs on Instagram because they're not that dirty and are linked from the profile pages. I usually throw it into ArchiveBot anyway such that it grabs the non-"taken-by" URLs as well.
há 5 anos
JustAnotherArchivist
27b0d2da75
Better username capitalisation extraction method
há 5 anos
JustAnotherArchivist
3aa828a0ac
transfer.kiska.pw -> transfer.notkiska.pw
há 5 anos
JustAnotherArchivist
63f4a8b3d3
transfer.sh -> transfer.kiska.pw
há 5 anos
JustAnotherArchivist
0168d50f62
Automatically fix capitalisation of Facebook and Twitter usernames
há 5 anos
JustAnotherArchivist
db0104b3c8
Get correct capitalisation for a Facebook username
há 5 anos
JustAnotherArchivist
4a1a9a10e0
Allow overriding the "remote filename"
há 5 anos
JustAnotherArchivist
769f95808e
Add ix.io upload script
há 5 anos
JustAnotherArchivist
c79721337b
+x
há 5 anos
JustAnotherArchivist
c30dcf5985
Finding outdated Mastodon instances
há 5 anos
JustAnotherArchivist
1748a6b607
Better workaround for the 5000 results limit; works for FoolFuuka 2.0.1 and up
há 5 anos
JustAnotherArchivist
fd680551df
Add Bing, Reddit/Pushshift, and FoolFuuka scrapers
há 5 anos
JustAnotherArchivist
ede77ad142
Filter Twitter hashtag scrapes based on account scrapes
há 5 anos
JustAnotherArchivist
57ef544c6c
Fix line endings
há 5 anos
JustAnotherArchivist
07c3e7baaa
Add snscrape helpers
há 5 anos
JustAnotherArchivist
b7e3a703d8
Monitor how a pipeline's wget processes are faring
há 5 anos
JustAnotherArchivist
168f61b39a
Quote filename so it works with any weird characters in the paths
(Last reconstructed commit from text file full of different versions)
há 5 anos
JustAnotherArchivist
8f77c8c72a
xargs -r flag to not run the second find if the first produces no results (GNU extension)
há 5 anos
JustAnotherArchivist
9d7a4096f9
Pipe into second find directly
há 5 anos
JustAnotherArchivist
e3a4bf6a47
Replace slow lsof with procfs access
há 5 anos
JustAnotherArchivist
4a83a54616
Print host for each stuck request
há 5 anos
JustAnotherArchivist
2b2c65f034
Print PID
há 5 anos
JustAnotherArchivist
fadb70e297
Fixed version which handles multiple roots correctly
há 5 anos
JustAnotherArchivist
d10a1d3675
First set of little things
há 5 anos
JustAnotherArchivist
a00607f28e
Initial commit
há 5 anos