Nevar pievienot vairāk kā 25 tēmas Tēmai ir jāsākas ar burtu vai ciparu, tā var saturēt domu zīmes ('-') un var būt līdz 35 simboliem gara.
arkiver 1b3f680951 Flush written dictionary. pirms 3 gadiem
.gitignore Fix processing of GZ WARCs. pirms 4 gadiem
README.md Add 'pack' option to pack files. pirms 11 gadiem
megawarc Flush written dictionary. pirms 3 gadiem
megawarc-fix Check for trailing zeroes in gzips. pirms 11 gadiem
ordereddict.py OrderedDict for Python 2.6. pirms 11 gadiem

README.md

Megawarc

megawarc is useful if you have .tar full of .warc.gz files and you really want one big .warc.gz. With megawarc you get your .warc.gz, but you can still restore the original .tar.

The megawarc tool looks for .warc.gz in the .tar file and creates three files, the megawarc:

  • FILE.warc.gz is the concatenated .warc.gz
  • FILE.tar contains any non-warc files from the .tar
  • FILE.json.gz contains metadata

You need the JSON file to reconstruct the original .tar from the .warc.gz and .tar files. The JSON file has the location of every file from the original .tar file.

Metadata format

One line with a JSON object per file in the .tar.

{
   "target": {
     "container": "warc" or "tar", // where is this file?
     "offset": number,             // where in the tar/warc does this file start?
                                   // for files in the tar this includes the tar header, which is
                                   // copied to the tar.
     "size": size                  // where does this file end?
                                   // for files in the tar, this includes the padding to 512 bytes
   },
   "src_offsets": {
     "entry": number,              // where is this file in the original tar?
     "data": number,               // where does the data start? entry+512
     "next_entry": number          // where does the next tar entry start
   },
   "header_fields": {
     ...                           // parsed fields from the tar header
   },
   "header_string": string         // the tar header for this entry
}

Usage

megawarc convert FILE

Converts the tar file (containing .warc.gz files) to a megawarc. It creates FILE.warc.gz, FILE.tar and FILE.json.gz from FILE.

megawarc pack FILE INFILE_1 [[INFILE_2] ...]

Creates a megawarc with basename FILE and recursively adds the given files and directories to it, as if they were in a tar file. It creates FILE.warc.gz, FILE.tar and FILE.json.gz.

megawarc restore FILE

Converts the megawarc back to the original tar. It reads FILE.warc.gz, FILE.tar and FILE.json.gz to make FILE.