BitTorrent data mining with Python and C

Author: Niall O'Higgins <niallo@p2presearch.com>
Date: July 10th 2008
p2presearch-med.png

Protocol overview

What BitTorrent Does

What BitTorrent Doesn't Do

Filling In the Gaps

Because BitTorrent itself lacks mechanisms for some essential problems, third party technologies and sites have arisen to fill the gaps:

The BitTorrent Network?

You could say that the "BitTorrent Network" is almost comprised more of HTTP sites and RSS feeds than BitTorrent protocol traffic itself!

For this reason, the vast majority of useful analysis can be conducted just by employing the HTTP protocol and using RSS parsers.

No BitTorrent needed at all!

Why Python

Python is of course an excellent choice both for HTTP and RSS operations.

Python for BitTorrent crawling

Crawling BitTorrent aggregators consists of doing lots of HTTP and RSS.

Threading module for concurrency. GIL doesn't matter much for I/O bound stuff.

Python for BitTorrent crawling II

So where does C come in?

While Python rocks, sometimes stuff can be worth writing in C. In our case metadata (.torrent file) parsing.

We have around 300,000 metadata files. These can be over a megabyte in size. Parsers in C can be quite fast.

Other advantages of C

Also, C gives you a bit more control over some stuff than the Python stdlib. For example, Python mmap module misses 'offset' parameter which is very useful for P2P apps.

But of course this could be fixed.

Analysis

Analysis of over 320,000 torrents.

Is anything here surprising?

Analysis II

The blue line shows the number of individual torrents added to the bittorrent network over the past 7 days.

Analysis III

An additive graph of content added over the past 7 days.

...

More info at our blog, http://blog.p2presearch.com