little-things

The little things give you away... A collection of various small helper stuff

JustAnotherArchivist ca206e162e Disable deletion by default		1週間前
.gitignore	Add infrastructure for simple C-based tools	2年前
.make-and-exec	Warnings are bad, mmkay?	11ヶ月前
.urldecode-test	Get rid of Makefile for more control; add proper debug build support	11ヶ月前
.warc-dump-responses-test	Add test for warc-dump-responses	5ヶ月前
.youtube-extract-rapid-test	Get rid of Makefile for more control; add proper debug build support	11ヶ月前
LICENSE	Initial commit	5年前
README.md	Initial commit	5年前
alphabetseq	Swap syntaxes	2年前
archivebot-blogspot	Fix HTTPS handling	5年前
archivebot-compress-db	Add archivebot-compress-db	4ヶ月前
archivebot-fix-queue-counters	Fix TypeError	1年前
archivebot-high-resources	Replace archivebot-high-memory with more capable archivebot-high-resources	4ヶ月前
archivebot-irccloud-paste	Add archivebot-irccloud-paste	3年前
archivebot-jobid-calculation	More snscrape helper tools	5年前
archivebot-jobs	Not-so-new new ArchiveBot domain	1年前
archivebot-list-stuck-requests	Fix line endings	5年前
archivebot-log-extract-ignores	Add archivebot-log-extract-ignores	3年前
archivebot-monitor-job-queue	First set of little things	5年前
archivebot-pipelines-count-jobs	Add archivebot-pipelines-count-jobs	5ヶ月前
archivebot-youtube	Add helper for AB/chromebot-ing YouTube channels and users	5年前
at-tracker-sample-user-item-size	Add at-tracker-sample-user-item-size	2年前
azure-storage-list	Add --jsonl option	2年前
b64grep	Add b64grep	2年前
base64url	Add base64url	2年前
bencode2json	Add bencode2json	1年前
bing-scrape	Fix extraction of search results	6ヶ月前
bugzilla-url-list	Add Bugzilla URL list generator	2年前
cdx-chunk	Add cdx-chunk	2年前
cloudflare-email-decode	Add cloudflare-email-decode	2年前
combine-by-prefix	Add combine-by-prefix	2年前
curl-ia	Add header mode (e.g. for tasks API)	1年前
curl-ua	Add IE6 UA	3年前
deb-repo-urls	Fix deb file URLs	3年前
dedupe	Another alternative and performance/memory comparison	3年前
dir-to-ia	Disable deletion by default	1週間前
europarl-meps-collect	Add script for scraping MEP links from europarl.europa.eu	4年前
extract-urls-for-archiveteam-projects	Add wpull2-extract-ignored-offsite and extract-urls-for-archiveteam-projects	4ヶ月前
foolfuuka-search	Better workaround for the 5000 results limit; works for FoolFuuka 2.0.1 and up	5年前
format-size	Split out size formatting	5年前
fos-ftp-upload	First set of little things	5年前
get-crx4chrome-urls	First set of little things	5年前
github-list-repos	Fix org listings not including archived repos	1週間前
gitlab-list-repos	Add support for other instances and full-instance listing	2年前
gofile.io-dl	Add support for password-protected folders	2年前
html-extract-stupid	Handle and	10ヶ月前
http-response-bodies	Add http-response-bodies	1年前
http-response-bodies.c	Fix extra LF between chunks	9ヶ月前
ia-cdx-search	Fix error when no arguments are provided	1年前
ia-cdx-search-subdomains	Fix URLs without a path	1年前
ia-derive	Queue derives with `ia tasks` instead of this manual curl rubbish	2年前
ia-files-xml-to-jsonl	Guarantee stable output order	2年前
ia-upload-progress	Proper script for tracking size of uploaded data	5年前
ia-upload-stream	Fix single-part upload	1週間前
ia-verify-file	Add a timeout to prevent potentially indefinite blocking	2年前
ia-wait-item-tasks	Handle error tasks by exiting non-zero	11ヶ月前
iasha1check	Fix output sometimes appearing after prompt	1年前
ix.io-upload	Allow overriding the "remote filename"	5年前
kill-connections	Handle processes with too many open connections	2年前
kill-wpull-connections	Merge kill-wpull-connections repository into little-things	3年前
killcx-all-https	First set of little things	5年前
mastodon-enumerate-users	Enumerate users on a Mastodon instance	5年前
mastodon-outdated	Finding outdated Mastodon instances	5年前
moinmoin-url-list	Add moinmoin-url-list	3ヶ月前
parent-urls	Refactor, strip query/fragment	3年前
pipelines-launch-in-tmux-windows	First set of little things	5年前
pipelines-monitor-tmux-wget-outcomes	Monitor how a pipeline's wget processes are faring	5年前
pipelines-stop-gracefully	First set of little things	5年前
reddit-pushshift-search	Add Bing, Reddit/Pushshift, and FoolFuuka scrapers	5年前
run-every-five-minutes	First set of little things	5年前
s3-bucket-find-direct-url	Add support for PermanentRedirect error responses	6ヶ月前
s3-bucket-list	Enable line buffering on list URLs FD	7ヶ月前
s3-bucket-list-qwarc	Add JSONL output option for S3 listing	2年前
snscrape-extract	Add support for Twitter hashtag extraction	4年前
snscrape-facebook-user	Silence by default	5年前
snscrape-instagram-user	Silence by default	5年前
snscrape-prepare-commands	Add support for Twitter hashtag extraction	4年前
snscrape-tmux	Update tmux session commands	4年前
snscrape-twitter-filter	Filter Twitter hashtag scrapes based on account scrapes	5年前
snscrape-twitter-hashtag	Extract external links from Twitter	4年前
snscrape-twitter-user	Extract external links from Twitter	4年前
snscrape-upload	Print Instagram ignore immediately after upload instead of at the end	5年前
snscrape-vk-user	Silence by default	5年前
snscrape-wiki-transfer-merge	Helper tools for snscrape and the wiki pages	5年前
social-media-extract-profile-link	Fix decoding of links on Facebook profiles	4年前
sum-sizes	Avoid float roundtrip for integer values	5ヶ月前
tar-many-files-progress	First set of little things	5年前
tcp-closer	Add tcp-closer command	5年前
torrent-tiny	Fix negative ints	1年前
transfer.archivete.am-upload	Handle HTTP/2 lowercase headers	3年前
transfer.notkiska.pw-check-ia	Switch to HTTPS	3年前
uniqify	Add uniqify	5年前
uniqify-recent	Add uniqify-recent	2ヶ月前
url-normalise	Normalise domain name to lower-case before further processing	4年前
urldecode	Add URL/percent decoding tool	2年前
urldecode.c	Fix unused argc and argv error	11ヶ月前
urlsort	Add urlsort	2年前
warc-dump-responses	Add warc-dump-responses	1年前
warc-dump-responses.c	Fix error when the terminating CRLFCRLF of a record is truncated	5ヶ月前
warc-peek	Allow negative offsets to peek near the end of the file	1年前
warc-size	Split out size formatting	5年前
warc-tiny	Fix empty files being considered valid WARCs	9ヶ月前
website-extract-social-media	Add support for Facebook /pages/category/Category/Name-ID URLs	4年前
wget-spider-estimate-size	First set of little things	5年前
wiki-list-to-main	Add ArchiveBot wiki list helper	4年前
wiki-recursive-extract-normalise	Fix deduplication within each section processing	4年前
wiki-sections-sort	Add wiki-sections-sort	4年前
wiki-website-extract-social-media	Add script for automatic social media discovery	4年前
wpull1-parallel-progress-monitor	First set of little things	5年前
wpull1-progress-monitor	First set of little things	5年前
wpull2-extract-ignored	Remove filtering of onsite URLs because it's unreliable	4ヶ月前
wpull2-extract-remaining	Clean up wpull DB commands	3年前
wpull2-log-colourise	Add wpull2-log-colourise	1年前
wpull2-log-extract-errors	Treat NXDOMAIN and no A/AAAA record errors as ok	3年前
wpull2-requeue	Error on unknown options	2年前
wpull2-url-origin	Clean up wpull DB commands	3年前
youtube-channel-list.py	Use _type instead of key check hack	1年前
youtube-extract	Exclude backslashes in channel patterns	1年前
youtube-extract-rapid	Add youtube-extract-rapid	2年前
youtube-extract-rapid.c	Add youtube-extract-rapid	2年前
youtube-filter-autogen-channels	Add youtube-filter-autogen-channels	4年前
zstdwarccat	Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed	2年前

README.md

Over the past few years, I’ve written and accumulated a number of useful little things to help with archival-related tasks. This repository collects them. I hope someone finds some of them useful.

License (applies to all programs in this repository)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.