JustAnotherArchivist
2e1dc59e9d
Fix log level of one message
3 years ago
JustAnotherArchivist
f025c4e9f3
Add extensive debug logging
3 years ago
JustAnotherArchivist
ce7f8fdc92
Make optional arguments to fetch kwarg-only
3 years ago
JustAnotherArchivist
3c8b45b3a6
Refactor cleanup code
- Run the cleanup code on exceptions (e.g. ^C). There were several effects of that not happening previously; most notably, the log file was not written to the meta WARC.
- Cancel remaining tasks, which avoids a pile of asyncio warnings and errors on crashes.
- Close the DB before the WARC, or rather, close the WARC last. This is mostly a semantic change to further ensure that the log written to the meta WARC is as complete as possible.
3 years ago
JustAnotherArchivist
dcd5455388
Fix crash on starting a run while the DB is locked
3 years ago
JustAnotherArchivist
168fa78736
Avoid locking the DB when there are no subitems to insert
3 years ago
JustAnotherArchivist
4484d6c588
Add Item representation
3 years ago
JustAnotherArchivist
5675118877
Rename id to id_ to avoid clash with builtin
3 years ago
JustAnotherArchivist
a1e693739e
Replace DB locking with an async context manager
3 years ago
JustAnotherArchivist
15203bd991
Handle redirect traps/loops
3 years ago
JustAnotherArchivist
f8f5258197
Track redirect depth
3 years ago
JustAnotherArchivist
a3d6fb35f8
Turn response handlers into kwarg-only functions for easier extendability without breaking existing code
3 years ago
JustAnotherArchivist
6cc4adb901
Remove stray TODO
The DB creation operates with a DB lock, so that code can't run while another process is filling the DB; it would block on obtaining the lock a few lines prior instead.
3 years ago
JustAnotherArchivist
c5604ef965
Simplify header merging
3 years ago
JustAnotherArchivist
59ae1183d2
Add fromResponse parameter for URL completion and automatic Referer header
3 years ago
JustAnotherArchivist
2324216016
Add baseUrl and evaluate incomplete URLs relative to it
3 years ago
JustAnotherArchivist
b30ccf8bf8
Move response/exception history to ClientResponse.qhistory
It is rarely necessary to access the history, and the tuple return value clutters the spec file code.
As a consequence, it's no longer possible to return None if an error occurred without losing the history.
To replace that, this also introduces a DummyClientResponse, which is kind of ClientResponse-like, has the same qhistory attribute, and evaluates to False when cast to bool (such that the intuitive `if response` works as expected).
3 years ago
JustAnotherArchivist
e69527c715
Add defaultResponseHandler on the Item level
3 years ago
JustAnotherArchivist
03336e4988
Add item to response handler arguments (e.g. for logging)
3 years ago
JustAnotherArchivist
6bdcfe71f0
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file
3 years ago
JustAnotherArchivist
c878241f24
Switch from concurrent.futures.CancelledError to asyncio.CancelledError
Since Python 3.8, the latter does not inherit from the former anymore.
3 years ago
JustAnotherArchivist
749158b97a
Use the Future's result directly rather than awaiting again
The asyncio documentation does not specify whether awaiting a Future multiple times is supported or not: https://bugs.python.org/issue41275
3 years ago
JustAnotherArchivist
a85e80ffa2
Configurable request timeout
3 years ago
JustAnotherArchivist
429ac94689
Make it possible to override and remove headers
3 years ago
JustAnotherArchivist
e40be54578
Document verify_ssl parameter
3 years ago
JustAnotherArchivist
d3437bde19
Move default headers to qwarc.const
3 years ago
JustAnotherArchivist
1678075a89
Log traceback on exceptions raised from an item
4 years ago
JustAnotherArchivist
b1a1c03f7e
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit
4 years ago
JustAnotherArchivist
dd44d9b174
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level
4 years ago
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
4 years ago
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
4 years ago
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
4 years ago
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
4 years ago
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
4 years ago
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
4 years ago
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
4 years ago
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
4 years ago
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
4 years ago
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
4 years ago
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
4 years ago
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
4 years ago
JustAnotherArchivist
d751844626
Fix starting another item before stopping on STOP file or memory limit exceedance
4 years ago
JustAnotherArchivist
2b0778f9b5
Remove leftovers from initial code rewrite
4 years ago
JustAnotherArchivist
ab22966fef
Add to log which item a message is coming from
4 years ago
JustAnotherArchivist
6fafd32685
Error when the retries are exceeded
4 years ago
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
4 years ago
JustAnotherArchivist
5008e6e8cd
Deduplicate items
4 years ago
JustAnotherArchivist
46c95e2157
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359 ) and the decoding may be unnecessary if it's binary content.
4 years ago
JustAnotherArchivist
ad22a2327a
Support adding headers to individual requests
5 years ago
JustAnotherArchivist
67076f964c
Add support for POST requests
5 years ago