Warc Extractor

The most popular thing I have ever built is my warc-extractor. I built it while working on a university project that was effectively pulling various datasets from the internet (twitter, internet archive, conventional scrapping etc) and experimenting with various ways to visualize this data. As it was a university project I was expected to upload everything I had built to a public repository, in this case Github.

At the time, there were basically no tools available for dealing with warc files. The only one available was the official warc1 project, which to this day remains completely abandoned. My main goal was to create a script that could extract all the text from a warc file; however, that evolved into a general utility that acted as an “unzip” script for warc files. Once the project finished I uploaded everything to github2 as an archive and expected nothing more to happen.

Interest in the project grew organically, I’ve done nothing to promote it, and has maintained a small but remarkably steady traffic volume for years now; about half a dozen unique visitors / downloads every week for at least seven years. This is a truly enormous amount of people from my perspective. I am genuinely glad that there is a small community out there who finds my tool useful.

I am committed to fixing any bugs that get reported (when I find time), and keeping the tool as accessible and up to date as possible. However, I don’t intend on adding any more features. There are a lot more warc tools floating around these days and I would recommend anyone needing more functionality to try them out.

Just this last month I uploaded the warc-extractor, separate from the rest of ArchiveTools which remains an archive of the original project, to pypi3, so now it can be downloaded using pip.

python3 -m pip install warc-extractor

Once installed, the script can be run similar to how it was run before. To dump all warc files in the current directory just type:

warc-extractor -dump content

Additional help can be found in the built in –help flag as well as at the repository.

I appreciate all the interest and hope that you will continue to find this simple script useful in the future.

  1. https://github.com/internetarchive/warc[]
  2. https://github.com/recrm/ArchiveTools[]
  3. https://pypi.org/project/warc-extractor/[]

Published by

ryan

This is the personal blog of Ryan Chartier. I post all of my long form content here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.