JustAnotherArchivist
2e1dc59e9d
Fix log level of one message
3年前
JustAnotherArchivist
f025c4e9f3
Add extensive debug logging
3年前
JustAnotherArchivist
ce7f8fdc92
Make optional arguments to fetch kwarg-only
3年前
JustAnotherArchivist
3c8b45b3a6
Refactor cleanup code
- Run the cleanup code on exceptions (e.g. ^C). There were several effects of that not happening previously; most notably, the log file was not written to the meta WARC.
- Cancel remaining tasks, which avoids a pile of asyncio warnings and errors on crashes.
- Close the DB before the WARC, or rather, close the WARC last. This is mostly a semantic change to further ensure that the log written to the meta WARC is as complete as possible.
3年前
JustAnotherArchivist
dcd5455388
Fix crash on starting a run while the DB is locked
3年前
JustAnotherArchivist
168fa78736
Avoid locking the DB when there are no subitems to insert
3年前
JustAnotherArchivist
4484d6c588
Add Item representation
3年前
JustAnotherArchivist
5675118877
Rename id to id_ to avoid clash with builtin
3年前
JustAnotherArchivist
a1e693739e
Replace DB locking with an async context manager
3年前
JustAnotherArchivist
15203bd991
Handle redirect traps/loops
3年前
JustAnotherArchivist
f8f5258197
Track redirect depth
3年前
JustAnotherArchivist
a3d6fb35f8
Turn response handlers into kwarg-only functions for easier extendability without breaking existing code
3年前
JustAnotherArchivist
6cc4adb901
Remove stray TODO
The DB creation operates with a DB lock, so that code can't run while another process is filling the DB; it would block on obtaining the lock a few lines prior instead.
3年前
JustAnotherArchivist
c5604ef965
Simplify header merging
3年前
JustAnotherArchivist
59ae1183d2
Add fromResponse parameter for URL completion and automatic Referer header
3年前
JustAnotherArchivist
2324216016
Add baseUrl and evaluate incomplete URLs relative to it
3年前
JustAnotherArchivist
b30ccf8bf8
Move response/exception history to ClientResponse.qhistory
It is rarely necessary to access the history, and the tuple return value clutters the spec file code.
As a consequence, it's no longer possible to return None if an error occurred without losing the history.
To replace that, this also introduces a DummyClientResponse, which is kind of ClientResponse-like, has the same qhistory attribute, and evaluates to False when cast to bool (such that the intuitive `if response` works as expected).
3年前
JustAnotherArchivist
e69527c715
Add defaultResponseHandler on the Item level
3年前
JustAnotherArchivist
03336e4988
Add item to response handler arguments (e.g. for logging)
3年前
JustAnotherArchivist
6bdcfe71f0
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file
3年前
JustAnotherArchivist
c878241f24
Switch from concurrent.futures.CancelledError to asyncio.CancelledError
Since Python 3.8, the latter does not inherit from the former anymore.
3年前
JustAnotherArchivist
749158b97a
Use the Future's result directly rather than awaiting again
The asyncio documentation does not specify whether awaiting a Future multiple times is supported or not: https://bugs.python.org/issue41275
3年前
JustAnotherArchivist
a85e80ffa2
Configurable request timeout
3年前
JustAnotherArchivist
429ac94689
Make it possible to override and remove headers
3年前
JustAnotherArchivist
e40be54578
Document verify_ssl parameter
3年前
JustAnotherArchivist
d3437bde19
Move default headers to qwarc.const
3年前
JustAnotherArchivist
1678075a89
Log traceback on exceptions raised from an item
4年前
JustAnotherArchivist
b1a1c03f7e
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit
4年前
JustAnotherArchivist
dd44d9b174
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level
4年前
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
4年前
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
4年前
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
4年前
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
4年前
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
4年前
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
4年前
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
4年前
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
4年前
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
4年前
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
4年前
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
4年前
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
4年前
JustAnotherArchivist
d751844626
Fix starting another item before stopping on STOP file or memory limit exceedance
4年前
JustAnotherArchivist
2b0778f9b5
Remove leftovers from initial code rewrite
4年前
JustAnotherArchivist
ab22966fef
Add to log which item a message is coming from
4年前
JustAnotherArchivist
6fafd32685
Error when the retries are exceeded
4年前
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
4年前
JustAnotherArchivist
5008e6e8cd
Deduplicate items
4年前
JustAnotherArchivist
46c95e2157
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359 ) and the decoding may be unnecessary if it's binary content.
4年前
JustAnotherArchivist
ad22a2327a
Support adding headers to individual requests
5年前
JustAnotherArchivist
67076f964c
Add support for POST requests
5年前