Scavenger - Crawler Searching For Credential Leaks On Dissimilar Glue Sites


Just the code of my OSINT bot searching for sensitive information leaks on dissimilar glue sites.
Search terms:
  • credentials
  • private RSA keys
  • Wordpress configuration files
  • MySQL connect strings
  • onion links
  • links to files hosted within the onion network (PDF, DOC, DOCX, XLS, XLSX)
Keep inwards mind:
  1. This bot is non beautiful.
  2. The code is non consummate so far. Some parts similar integrating the credentials inwards a database are missing inwards this online repository.
  3. If you lot desire to piece of occupation this code, experience costless to create so. Keep inwards hear you lot accept to customize things to become inwards run on your system.

IMPORTANT
The bot tin give the sack locomote run inwards ii major modes:
  • API mode
  • Scraping fashion (using TOR)
Is highly recommend using the API mode. It is the intended method of scraping pastes from Pastebin.com too it is only fair to create so. The only matter you lot require is a Pastebin.com PRO draw of piece of occupation concern human relationship too whitelist your populace IP on their site.
To initiative of all the bot inwards API fashion only run the plan inwards the next way:
python run.py -0
However, it is non e'er possible to piece of occupation this intended method, every bit you lot mightiness locomote inwards NAT fashion too hence you lot create non accept an IP alone (whitelisting your IP is non reasonable here). That is the argue beacuse is implemented a scraping fashion where fast TOR cycles inwards combination amongst reasonable user agents are used to avoid IP blocking too Cloudflare captchas.
To initiative of all the bot inwards scraping fashion run it inwards the next way:
python run.py -1
Important note: you lot require the TOR service installed on your organisation listening on port 9050. Additionally you lot require to add together the next draw to your /etc/tor/torrc file.
MaxCircuitDirtiness 30
This sets the maximum wheel fourth dimension of TOR to thirty seconds.

Usage
To acquire how to piece of occupation the software you lot only require to telephone weep upward the run.py script amongst the -h/--help argument.
python run.py -h
Output:
  _________  /   _____/ ____ _____ ___  __ ____   ____    ____   ___________  \_____  \_/ ___\\__  \\  \/ // __ \ /    \  / ___\_/ __ \_  __ \  /        \  \___ / __ \\   /\  ___/|   |  \/ /_/  >  ___/|  | \/ /_______  /\___  >____  /\_/  \___  >___|  /\___  / \___  >__|         \/     \/     \/          \/     \//_____/      \/  usage: run.py [-h] [-0] [-1] [-2] [-ps]  Control software for the dissimilar modules of this glue crawler.  optional arguments:   -h, --help            demo this attention message too popular off   -0, --pastebinCOMapi  Activate Pastebin.com module (using API)   -1, --pastebinCOMtor  Activate Pastebin.com module (standard scraping using                         TOR to avoid IP blocking)   -2, --pasteORG        Activate Paste.org module   -ps, --pStatistic     Show a uncomplicated statistic.
So far I only implemented the Pastebin.com module too I am working on Paste.org. I volition add together to a greater extent than modules too update this script over time.

Just initiative of all the Pastebin.com module separately...
python P_bot.py
Pastes are stored inwards data/raw_pastes until they are to a greater extent than too so 48000. When they are to a greater extent than too so 48000 they become filtered, ziped too moved to the archive folder. All pastes which comprise credentials are stored inwards data/files_with_passwords

Keep inwards hear that at the 2nd only combinations similar USERNAME:PASSWORD too other uncomplicated combinations are detected. However, at that spot is a tool to search for proxy logs containing credentials.
You tin give the sack search for proxy logs (URLs amongst username too password combinations) past times using getProxyLogs.py file
python getProxyLogs.py data/raw_pastes

If you lot desire to search the raw information for specific strings you lot tin give the sack create it using searchRaw.py (really slow).
python searchRaw.py SEARCHSTRING

To come across statistics of the bot only call
python status.py 

The file findSensitiveData.py searches a folder (with pastes) for sensitive information similar credit cards, RSA keys or mysqli_connect strings. Keep inwards hear that this script uses grep too hence is actually deadening on a large amount of glue files. If you lot desire to analyze a large amount of pastes I recommend an ELK-Stack.
python findSensitiveData.py data/raw_pastes 

There are ii scripts stalk_user.py/stalk_user_wrapper.py which tin give the sack locomote used to monitor a specific twitter user. This agency every tweet he posts gets saved too every containing URL gets downloaded. To initiative of all the stalker only execute the wrapper.
python stalk_user_wrapper.py