The little things give you away... A collection of various small helper stuff
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
JustAnotherArchivist c50a8fd796 Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed 2 years ago
LICENSE Initial commit 5 years ago
README.md Initial commit 5 years ago
alphabetseq Swap syntaxes 2 years ago
archivebot-blogspot Fix HTTPS handling 5 years ago
archivebot-high-memory Support python3 in any directory instead of just /usr/bin 4 years ago
archivebot-irccloud-paste Add archivebot-irccloud-paste 3 years ago
archivebot-jobid-calculation More snscrape helper tools 5 years ago
archivebot-jobs Pass through datetime, math, re, and time to --pyfilter 3 years ago
archivebot-list-stuck-requests Fix line endings 5 years ago
archivebot-log-extract-ignores Add archivebot-log-extract-ignores 3 years ago
archivebot-monitor-job-queue First set of little things 5 years ago
archivebot-youtube Add helper for AB/chromebot-ing YouTube channels and users 5 years ago
azure-storage-list Add --jsonl option 2 years ago
b64grep Add b64grep 2 years ago
bing-scrape Add Bing, Reddit/Pushshift, and FoolFuuka scrapers 5 years ago
bugzilla-url-list Add Bugzilla URL list generator 2 years ago
combine-by-prefix Add combine-by-prefix 2 years ago
curl-ua Add IE6 UA 3 years ago
deb-repo-urls Fix deb file URLs 3 years ago
dedupe Another alternative and performance/memory comparison 3 years ago
europarl-meps-collect Add script for scraping MEP links from europarl.europa.eu 5 years ago
foolfuuka-search Better workaround for the 5000 results limit; works for FoolFuuka 2.0.1 and up 5 years ago
format-size Split out size formatting 5 years ago
fos-ftp-upload First set of little things 5 years ago
get-crx4chrome-urls First set of little things 5 years ago
github-list-repos Fix org repo listing on new design/site structure 2 years ago
gitlab-list-repos Add support for other instances and full-instance listing 2 years ago
gofile.io-dl Add support for password-protected folders 2 years ago
ia-cdx-search Fix crash on an empty response 2 years ago
ia-derive Add script to queue derive on IA 5 years ago
ia-files-xml-to-jsonl Guarantee stable output order 2 years ago
ia-upload-progress Proper script for tracking size of uploaded data 5 years ago
ia-verify-file Add a timeout to prevent potentially indefinite blocking 2 years ago
ia-wait-item-tasks Add ia-wait-item-tasks 2 years ago
iasha1check Colourise sha1sum output 3 years ago
ix.io-upload Allow overriding the "remote filename" 5 years ago
kill-wpull-connections Merge kill-wpull-connections repository into little-things 3 years ago
killcx-all-https First set of little things 5 years ago
mastodon-enumerate-users Enumerate users on a Mastodon instance 5 years ago
mastodon-outdated Finding outdated Mastodon instances 5 years ago
parent-urls Refactor, strip query/fragment 3 years ago
pipelines-launch-in-tmux-windows First set of little things 5 years ago
pipelines-monitor-tmux-wget-outcomes Monitor how a pipeline's wget processes are faring 5 years ago
pipelines-stop-gracefully First set of little things 5 years ago
reddit-pushshift-search Add Bing, Reddit/Pushshift, and FoolFuuka scrapers 5 years ago
run-every-five-minutes First set of little things 5 years ago
s3-bucket-list Ignore TLS issues 3 years ago
s3-bucket-list-qwarc Record wrapper script in meta WARC as well 3 years ago
snscrape-extract Add support for Twitter hashtag extraction 4 years ago
snscrape-facebook-user Silence by default 5 years ago
snscrape-instagram-user Silence by default 5 years ago
snscrape-prepare-commands Add support for Twitter hashtag extraction 4 years ago
snscrape-tmux Update tmux session commands 4 years ago
snscrape-twitter-filter Filter Twitter hashtag scrapes based on account scrapes 5 years ago
snscrape-twitter-hashtag Extract external links from Twitter 5 years ago
snscrape-twitter-user Extract external links from Twitter 5 years ago
snscrape-upload Print Instagram ignore immediately after upload instead of at the end 5 years ago
snscrape-vk-user Silence by default 5 years ago
snscrape-wiki-transfer-merge Helper tools for snscrape and the wiki pages 5 years ago
social-media-extract-profile-link Fix decoding of links on Facebook profiles 4 years ago
sum-sizes Add sum-sizes 2 years ago
tar-many-files-progress First set of little things 5 years ago
tcp-closer Add tcp-closer command 5 years ago
transfer.archivete.am-upload Handle HTTP/2 lowercase headers 3 years ago
transfer.notkiska.pw-check-ia Switch to HTTPS 3 years ago
uniqify Add uniqify 5 years ago
url-normalise Normalise domain name to lower-case before further processing 4 years ago
warc-peek Add WARC/1.1 support 3 years ago
warc-size Split out size formatting 5 years ago
warc-tiny Fix compatibility with wpull 2.x 3 years ago
website-extract-social-media Add support for Facebook /pages/category/Category/Name-ID URLs 4 years ago
wget-spider-estimate-size First set of little things 5 years ago
wiki-list-to-main Add ArchiveBot wiki list helper 5 years ago
wiki-recursive-extract-normalise Fix deduplication within each section processing 4 years ago
wiki-sections-sort Add wiki-sections-sort 4 years ago
wiki-website-extract-social-media Add script for automatic social media discovery 4 years ago
wpull1-parallel-progress-monitor First set of little things 5 years ago
wpull1-progress-monitor First set of little things 5 years ago
wpull2-extract-remaining Clean up wpull DB commands 3 years ago
wpull2-log-extract-errors Treat NXDOMAIN and no A/AAAA record errors as ok 3 years ago
wpull2-requeue Print number of modified records on requeueing 2 years ago
wpull2-url-origin Clean up wpull DB commands 3 years ago
youtube-channel-list.py Add YouTube channel listing script 2 years ago
youtube-extract Handle ancient /?v= URLs 2 years ago
youtube-filter-autogen-channels Add youtube-filter-autogen-channels 4 years ago
zstdwarccat Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed 2 years ago

README.md

Over the past few years, I’ve written and accumulated a number of useful little things to help with archival-related tasks. This repository collects them. I hope someone finds some of them useful.

License (applies to all programs in this repository)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.