The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Raw data-mining can be seen as ripping all information from a database. This will result in a hugh amount of traffic on server level, causing latency on the provided webservice. Therefore some sysadmins have configured their servers to block the IP which starts demanding a hugh (abnormal) load for a certain timeframe. To bypass this protection, we should make a system which uses a different IP for each request. For this TOR might become handy. At least if this serverfarm is not blocking the TOR exit nodes. :-)
With scripted data-mining, we want to use wget to rip data and tunnel it through TOR anonymous network to avoid IP blockage at the serverfarm? One way to do it is to use wget, TOR, and Privoxy to get what you need.
Explanation: Tor is a SOCKS proxy in which your date is sent over a network in a pretty anonymous fashion. The problem with tor is that it does not offer a http proxy which is what wget can use. So to get around this you can install Privoxy which will allow you to connect to TOR via a simple HTTP proxy.
So, lets get started.
Step 1 - Install the stuff
you can install all you need with the following command
sudo apt-get install -y tor tor-geoipdb privoxy
Step 2 - Configuration
There are a few things that need to be configured.
1. /etc/wgetrc
Find line starting with:
#http_proxy =
Replace whole line with:
http_proxy = http://localhost:8118
2. /etc/Privoxy
Add the following to the top of the file
listen-address localhost:8118
forward-socks5 / 127.0.0.1:9050 .
Step 3 - Start every thing up
sudo service tor restart; sudo service privoxy start
Now when you use the wget command your data will be tunneled through the TOR network. you'll notice when you run the wget command that you will see a line like the following
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:8118... connected.
The :8118 shows that your connection is going to Privoxy which in turn goes to TOR.
Note: You download speeds will be significantly redued due to the fact that your data will be tunneling through the TOR network. The configuration of TOR is not in the scope of this article.