Skip navigation

Monthly Archives: August 2012

Original Post: https://bitsmash.wordpress.com/2012/07/29/finding-compromised-hosts-with-the-google-search-api/

As mentioned before, there were some problems with the simple google dorking approach:

  • Google don’t like automated scraping. Though I doubt this qualifies as automated scraping, I do store the results, which may breach the API’s terms of service. Furthermore, after a few searches, the python script set off suspected abuse triggers, and was temporarily blocked from accessing the API.
  • Without automation, doubling the search = doubling the effort. Expansion of the script is definitely required to find more types of shell.
  • Subsequent similar google searches are likely to return a lot of results that you’ve already seen.
  • There are a lot of false positives. I mentioned before that a number of results are people asking for help with compromised machines. Similarly, there are unwanted results like pastebinned copies of the shell source, etc.

Fortunately, all of these problems have solutions. Or a couple, depending on how you feel about sticking to TOS agreements.

Keeping Google Happy: The good way

Google provide several hints for distinguishing your traffic from automated scraping traffic. For applications written in python, java, etc, they recommend adding the “userip” parameter to HTTP requests, including the IP address of the end user making the request.

Keeping Google Happy: The bad way

The problem is, when making large volumes of requests, especially for a fixed set of search terms, it’s still possible to set off abuse triggers, despite making these efforts to follow their TOS. The solution, if honest attempts to distinguish your traffic from fraudulent traffic fail, is to tunnel your requests through enough different proxies that not too many of these requests come from one location.

With Tor and Java, this is easy. When starting Tor through the Tor Browser Bundle, by default, a Tor SOCKS proxy opens on port 9050, and the Tor command channel listens for instructions on port 9051. You can tunnel traffic from a Java program through a SOCKS proxy using the following commands at the beginning of your code:

System.getProperties().put("proxySet", "true");
System.getProperties().put("socksProxyHost", "localhost");
System.getProperties().put("socksProxyPort", "9050");

The command channel on port 9051 can be communicated with by opening a socket connection and piping in data. To tell Tor to use a new route:

// Pseudocode
socket = new Socket("localhost", 9051)
out = socket.getOutputStream()
out.println('AUTHENTICATE "password"');
out.println('SIGNAL newnym');
socket.close()

After a few seconds, Tor will begin routing traffic through a different path, and you can continue scraping with wild abandon.

Scaling Searches

Pretty self explanatory:

// resultCount variable is used in shell search to control result page offset
for (int i = 1; i <= resultCount; i++) {
    getWSOShells(i);
    getR57Shells(i);
    // ... more shell types
}

Trimming The Fat

To ensure that I don’t have to spend any time wading through duplicated results, I had the scraper pile the information into a database. The table only had four columns: Page Title, URL, Shell Type, Date Found. (PageTitle,URL) was a unique index, so attempting to insert a shell that had already been discovered was denied by the database.

Only thing left to do is throw together a pretty stats page for the discovered shells.

Reducing False Positives

This is the only task I have yet to complete. More in depth inspection of the compromised sites (automated or otherwise) isn’t something I’m interested in. One step too far into the ethical grey area.

The moral of the story is that it’s very easy to make a site with an old, out of date CMS, forget about it, and have it compromised without you ever knowing. And it’s even easier for people with worse intentions than me to find your compromised site with zero effort or technical ability and use it for something bad. Be careful what you leave lying around.

Advertisements