Pārlūkot izejas kodu

Stop deduping small responses

For small responses, the additional headers for the revisit outweigh the payload truncation savings. The chosen limit of 100 bytes is completely arbitrary and not backed by any real-world data.
tags/v0.2.2
JustAnotherArchivist pirms 4 gadiem
vecāks
revīzija
820384fe1e
1 mainītis faili ar 1 papildinājumiem un 1 dzēšanām
  1. +1
    -1
      qwarc/warc.py

+ 1
- 1
qwarc/warc.py Parādīt failu

@@ -132,7 +132,7 @@ class WARC:
)
payloadDigest = responseRecord.rec_headers.get_header('WARC-Payload-Digest')
assert payloadDigest is not None
if self._dedupe and responseRecord.payload_length > 0: # Don't "deduplicate" empty responses
if self._dedupe and responseRecord.payload_length > 100: # Don't deduplicate small responses; the additional headers are typically larger than the payload dedupe savings...
if payloadDigest in self._dedupeMap:
refersToRecordId, refersToUri, refersToDate = self._dedupeMap[payloadDigest]
responseHttpHeaders = responseRecord.http_headers


Notiek ielāde…
Atcelt
Saglabāt