JustAnotherArchivist
f025c4e9f3
Add extensive debug logging
pirms 3 gadiem
JustAnotherArchivist
dbe1ed71ab
"Freeze" log file object before writing to WARC to ensure that further log messages aren't picked up
This is a workaround for https://github.com/webrecorder/warcio/issues/90
pirms 3 gadiem
JustAnotherArchivist
8ca2a6bde5
Fix exceptions on journal errors
pirms 3 gadiem
JustAnotherArchivist
733506aed7
Remove obsolete TODO
The fnctl.flock call does not use LOCK_NB, so if it fails with an OSError, that is Really Bad™ and can't be handled cleanly.
pirms 3 gadiem
JustAnotherArchivist
c7fac0ec3f
Add WARC journalling with rollback on errors
Inspired by the implementation in wpull but structured differently to avoid reopening the journal file constantly.
pirms 3 gadiem
JustAnotherArchivist
a91cc23d47
Simplify get_software_info's signature to just the extra dependency packages
As a consequence, SpecDependencies.extra can now be any data type that can be put into JSON; unhashable types previously caused a crash due to the lru_cache.
pirms 3 gadiem
JustAnotherArchivist
820384fe1e
Stop deduping small responses
For small responses, the additional headers for the revisit outweigh the payload truncation savings. The chosen limit of 100 bytes is completely arbitrary and not backed by any real-world data.
pirms 4 gadiem
JustAnotherArchivist
461cedbbde
Avoid temporary files created by warcio due to not knowing the record payload length
pirms 4 gadiem
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
pirms 4 gadiem
JustAnotherArchivist
93df9cd18d
Get rid of the temporary extra log file and read the plain file instead
pirms 4 gadiem
JustAnotherArchivist
08c3d55376
Add comment on block digest workaround (cf. f14a664b
)
pirms 4 gadiem
JustAnotherArchivist
413435b7fb
Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1
https://github.com/webrecorder/warcio/issues/94
pirms 4 gadiem
JustAnotherArchivist
8ee9b20718
Remove WARC-Target-URI header from warcinfo record
WARC 1.1 specification, section 5.14: "A ‘warcinfo’ record shall not have a WARC-Target-URI field."
pirms 4 gadiem
JustAnotherArchivist
f14a664b1c
Work around warcio not writing a block digest for warcinfo records ( https://github.com/webrecorder/warcio/issues/87 )
The length has to be set manually because otherwise warcio will automatically remove the header again.
pirms 4 gadiem
JustAnotherArchivist
bd14ab3901
Fix crash due to closing the log handler on reaching the max WARC size
pirms 4 gadiem
JustAnotherArchivist
08117630b0
Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead
pirms 4 gadiem
JustAnotherArchivist
26aab15605
urn:X-qwarc instead of urn:qwarc
pirms 4 gadiem
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
pirms 4 gadiem
JustAnotherArchivist
e093211496
Set content type for resource records
pirms 4 gadiem
JustAnotherArchivist
ae46b53401
Always write a WARC-Warcinfo-ID header
pirms 4 gadiem
JustAnotherArchivist
23fcdd4026
Write microsecond dates for request and response records
pirms 4 gadiem
JustAnotherArchivist
3030ad10ab
Mark private API accordingly
pirms 4 gadiem
JustAnotherArchivist
e0b4104d21
Remove log handler before writing log record since that requires closing the stream
pirms 4 gadiem
JustAnotherArchivist
6cfd352f68
Write WARC/1.1 files
pirms 4 gadiem
JustAnotherArchivist
e1ad5c232e
Write warcinfo and resource records in meta WARC on firing up qwarc rather than at the end
pirms 4 gadiem
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
pirms 4 gadiem
JustAnotherArchivist
e99e2304c9
Write meta WARC with log file
pirms 4 gadiem
JustAnotherArchivist
85d78cee13
Add warcinfo record with version information on Python, system, and dependencies
pirms 4 gadiem
JustAnotherArchivist
9cff6bd5c1
Only open a WARC file when necessary to avoid producing empty WARCs at the end
pirms 4 gadiem
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
pirms 4 gadiem
JustAnotherArchivist
be5673cfbf
Add record deduplication within a process
pirms 5 gadiem
JustAnotherArchivist
e892a6b6a7
Initial commit
pirms 5 gadiem