Dark Web OSINT With Python and OnionScan: Part One



You may have heard of this awesome tool called OnionScan that is used to scan hidden services in the dark web looking for potential data leaks. Recently the project released some cool visualizations and a high level description of what their scanning results looked like. What they didn’t provide is how to actually go about scanning as much of the dark web as possible, and then how to produce those very cool visualizations that they show.
At a high level we need to do the following:
  1. Setup a server somewhere to host our scanner 24/7 because it takes some time to do the scanning work.
  2. Get TOR running on the server.
  3. Get OnionScan setup.
  4. Write some Python to handle the scanning and some of the other data management to deal with the scan results.
  5. Write some more Python to make some cool graphs. (Part Two of the series)
Let’s get started!

Setting up a Digital Ocean Droplet

If you already use Amazon, or have your own Linux server somewhere you can skip this step. For the rest of you, you can use my referral link here to get a $10 credit with Digital Ocean that will get you a couple months free (full disclosure I make money in my Digital Ocean account if you start paying for your server, feel free to bypass that referral link and pay for your own server). I am assuming you are running Ubuntu 16.04 for the rest of the instructions.

  1. The first thing you need to do is to create a new Droplet by clicking on the big Create Droplet button.
  2. Next select a Ubuntu 16.04 configuration, and select the $5.00/month option (unless you want something more powerful).
  3. You can pick a datacenter wherever you like, and then scroll to the bottom and click Create.
It will begin creating your droplet, and soon you should receive an email with how to access your new Linux server. If you are on Mac OSX or Linux get your terminal open. If you are on Windows then grab Putty from here.
  • On Mac OSX it is: Finder -> Applications -> Utilities -> Terminal
  • On Linux: Click your start menu and search for Terminal
Now you are going to SSH into your new server. Windows Putty users just punch the IP address in that you received in your email and hit Enter. You will be authenticating as the root user and then type in the password you were provided in your email.
For Mac OSX and Linux people you will type the following into your terminal:
You will be forced enter your password a second time, and then you have to change your password. Once that is done you should now be logged into your server.

Installing Prerequisites

Now we need to install the prerequisites for our upcoming code and for OnionScan. Follow each of these steps carefully and the instructions are the same for Mac OSX, Linux or Windows because the commands are all being run on the server.
Feel free to copy and paste each command instead of typing it out. Hit Enter on your keyboard after each step and watch for any problems or errors.
Now we need to install the Go requirements (OnionScan is written in Go). The following instructions are from Ryan Frankel’s post here.
Ok beauty we have Go installed. Now let’s get OnionScan setup by entering the following:
Now if you just type:
And hit Enter you should get the onionscan command line usage information. If this all worked then you have successfully installed OnionScan. If you for some reason close your terminal and you can’t run the onionscan binary anymore just simply do a:
and it will fix it for you.
Now we need to make a small modification to the TOR configuration to allow our Python script to request a new identity (a new IP address) which we will use when we run into scanning trouble later on. We have to enable this by doing the following:
This will give you output that will include the bottom line that looks like this:
16:3E73307B3E434914604C25C498FBE5F9B3A3AE2FB97DAF70616591AAF8
Copy this line and then type:
This will open a simple text editor. Now go to the bottom of the file by hitting the following keystrokes (or endlessly scrolling down):
CTRL+W CTRL+V
Paste in the following values at the bottom of the file:
Now hit CTRL+O to write the file and CTRL+X to exit the file editor. Now type:
This will restart TOR and it should have our new settings in place. Note that if you want to use a password other than PythonRocks you will have to follow the steps above substituting your own password in place, and you will also have to later change the associated Python code.
We are almost ready to start writing some code. The last step is to grab my list of .onion addresses (at last count around 7182 addresses) so that your script has a starting point to start scanning hidden services.
Whew! We are all setup and ready to start punching out some code. At this point you can switch to your local machine or if you are comfortable writing code on a Linux server by all means go for it. I find it easier to use WingIDE on my local machine personally.

A Note About Screen

You notice that both sets of instructions I have you run the screen command. This is a handy way to keep your session alive even if you get disconnected from your server. When you want to jump back into that session, you simply SSH back into the server and execute:
This will be handy later on when you start doing your scanning work, as it can take days for it to complete fully.

Writing an OnionScan Wrapper

OnionScan is a great tool but we need to be able to systematically control it, and process the results. As well, TOR connections are notoriously unstable so we need a way to kill a stuck scan process and grab a fresh IP address from the TOR network. Let’s get coding! Crack open a new Python file, name it onionrunner.py and start punching out the following (you can download the full code here).
  • Lines 1-12: we import all of the required modules that we are going to be using in this script.
  • Lines 14-15: we initialize two empty lists to hold our full onion list and the list of onions we are working through during the current scanning session.
  • Lines 17-18: we utilize an Event object that will help us to coordinate two threads that will be executing. We have to set the Event object first so that by default our main thread will execute later. More on these threads later.
Now we have to build some helper functions that will deal with loading our master list of onions and to be able to continue adding newly discovered onions to this list:
  • Line 23: we define our get_onion_list function that is going to load our master list.
  • Lines 26-33: we check to see if the onion_master_list.txt file is present (26) and if it is we crack it open (28) and then read the contents back and split it so that each line gets append to a list called stored_onions (30). If the file isn’t present then we output an error message (32) and exit the script (33).
  • Lines 35-37: we simply output the total number of onions loaded (35) and return the list back from the function (37).
  • Line 41: we define our store_onion function that takes a single parameter onion which is the hidden service we wish to add to the master list.
  • Lines 45-46: we crack open the master list file (45) and then write out the hidden service address (46).
Now we will implement the function that deals with running the onionscan binary to do the actual scanning work. Keep adding code in your editor:
  • Line 53: we define the run_onionscan function to take one parameter onion that is the address of our hidden service.
  • Line 58: here we are using the subprocess.Popen class to start onionscan passing in the command line arguments –jsonReport and –simpleReport=false which will give us JSON output on STDOUT and disable the normal output from OnionScan. The final two parameters are telling Popen that we want to communicate with stdout and stderr meaning we want to be able to retrieve the output of both.
  • Lines 61-62: here is where we have a bit of magic. We create a new Timer object that is provided from the threading module. A Timer will run for a specified time, and then execute a function when that time has been reached unless you cancel the Timer. In this case we are setting it to 300 seconds (5 minutes) and then telling it to call the handle_timeout function when 300 seconds have been hit. We also pass in the process object and the current onion we are processing. This will allow us to handle when onionscan executes for 5 minutes which could indicate that our Tor connection has gone down or that the hidden service can’t be reached any longer, so we want to be able to kill the onionscan, request a new IP from the Tor network, and continue working through our list of hidden services. We start the timer on line 62.
  • Line 65: here we are waiting for OnionScan to return the JSON results from the scan and we store it in the stdout variable.
  • Lines 68-70: if we reach this line then we know that OnionScan was finished before the 300 seconds are up, so we check if the Timer is still running (68) and then cancel the Timer (69) and return the JSON output (70).
So there you have a neat trick to deal with some timing issues when running command line binaries. Now let’s implement the actual timeout handling function to deal will killing the OnionScan and requesting a new IP from the Tor network. Keep on adding code:
  • Line 79: we define the handle_timeout function that takes the process parameter (our Popen object) and the onion parameter which is the current hidden service we are scanning.
  • Line 85: here we are clearing the identity_lock which will halt our main thread (you’ll see in a bit). This will allow us to do the process killing, and grab a new identity without the main thread trying to process a new hidden service. We want to be able to cleanly deal with the onionscan process that has timed out before continuing on to a new hidden service.
  • Lines 88-92: here we are using the kill() function that our process object has to kill off the onionscan process that took to long to execute.
  • Line 95: we now connect to our local Tor controller port and store the connection object in the torcontrol variable.
  • Line 98: we authenticate to the Tor controller using our PythonRocks password that you set at the beginning of this blog post. Remember if you decided to use a different password, make sure you put it in here.
  • Line 101: we send the signal to the local Tor controller that we would like a new identity (IP address).
  • Line 104: we pause execution until the new IP address has been acquired.
  • Line 109-110: here we are re-adding the current hidden service back into our session list. This is because we didn’t get a full scan done on the hidden service so we want to make sure we re-scan it at some point in the future. We then shuffle the list (110) so that we don’t end up just grabbing this same hidden service again. If this hidden service is not working properly or is down, you would end up in an infinite loop of timeouts, kills, re-add to list, rescan. This is why we shuffle!
  • Line 113: we set the identity_lock object again so that the main thread is now notified to continue executing, which will load a fresh hidden service for scanning.
Now we need to implement the function that will handle processing the JSON results that OnionScan hands back to us. March on good Python soldier:
  • Line 121: we define our process_results function to take in the onion parameter and the json_response respectively.
  • Lines 126-127: if the onionscan_results directory doesn’t exist (126) we create it (127) because that’s how we roll.
  • Lines 130-131: here we are writing out the JSON results to a file that is named by the hidden service that we just scanned. Pretty straightforward.
  • Lines 134-135: we do a bit of string conversion to get the JSON string into a format we can use (134) and then we decode the JSON (135) to turn it into a native Python dictionary.
  • Lines 137-144: there are three fields that we are interested in that could contain additional .onion domains that we may want to add to our list of scan targets. The linkedSitesrelatedOnionDomains and relatedOnionServices keys all will return lists. If they are set appropriately we hand the list off to our add_new_onions function.
Let’s implement that function now.
  • Line 152: we define our add_new_onions function to take in the list of .onion domains we have just discovered.
  • Lines 157-159: we walk through the list of onions (157) and then check to make sure that we don’t already have this onion in our master list and that it is a .onion domain (159). There are cases where OnionScan will discover sites that are not in the dark web, and we’ll get to those in our visualization post.
  • Lines 163-166: we add the new onion to our master list (163), we add it to our current session list of onions to scan (164), we shuffle the session list again (165) and then we store the onion in our onion_master_list.txt file (166).
Now let’s start putting the finishing touches on this script.
  • Line 171: we call our get_onion_list function that will load up all of our stored hidden service addresses.
  • Lines 174-175: we shuffle the onions up (174) and then create a copy of the list and store it in our session_onions variable (175).
  • Line 177: we initialize a counter variable that we will use to determine when we are finished looping over all of our hidden services.
Now it’s time to put the main loop in place that will be responsible for kickstarting OnionScan for each hidden service that we have stored.
  • Line 179: we create our while loop that will stop executing once we have worked through all of our hidden services.
  • Line 183: we are waiting for our Event object to be set before continuing execution. You will remember that this will only halt here if our handle_timeout function is dealing with grabbing a new Tor identity. Once the identity_lock is cleared we will move past this line.
  • Line 187: we remove a hidden service from our list and store it in the onion variable.
  • Lines 190-195: we are testing to see if we have already scanned the hidden service by checking to see if the JSON file exists (190) and if so we increment our count variable (193) and then we go back to the top of the while loop using the continue keyword (195).
  • Line 198: since we have not yet scanned the current hidden service, we kick off the scan process and return the result in the aptly named result variable.
  • Lines 201-206: if we get a good result back we test the length of the JSON string (203) and if it is greater than zero we pass the JSON string and hidden service off to our process_results function for storage (204) and then increment our count variable before returning to the top of the while loop.
Whew! That is a lot of code, but hopefully you have learned a few new Python coding tricks along the way, and it might give you ideas on how you can wrap other scanning software in a similar way as we did with OnionScan. Now for the moment of truth…

Let it Rip!

Now you are ready to start scanning! Simple run:

And you should start seeing output like the following:


# python onionrunner.py[*] Total onions for scanning: 7182[*] Running 0 of 7182.
[*] Onionscanning nfokjthabqzfndmj.onion[*] Running 1 of 7182.[*] Onionscanning gmts3xxfrbfxdm3a.onion

If you check the onionscan_results directory you should see a JSON files that are named by the hidden service that was scanned. Let this puppy run as long as you can tolerate, in the second post we are going to process these JSON files and begin to create some visualizations. For bonus points you can also push those JSON files into Elasticsearch (or modify onionrunner.py to do so on the fly) and analyze the results using Kibana!
If you don’t want to wait to get all of the data yourself, you can download the scan results for 8,167 onions from here.