Abstract: Understanding the need to crawl notorious sites of the World Wide Web for Leaked/Compromised/Hacked data and to place a mechanism in place to report such findings so that the necessary action may be taken at a quicker pace to minimize the impact of the attack.
As we know in today’s world no amount of security can assure a system impenetrable, the least we can do is step up our guard and place a mechanism in place that minimizes damage in case of a worst case scenario.
Hackers have perfected few techniques to exploit money from their plunders of hacked data.
Hacked Data may contain email credentials, credentials of social networks, API keys, Subnet IPs, Password hashes, Machine configuration info etc. They sell the data to the victim’s rivals/competitors or in certain cases they end up blackmailing the victim.
Hackers /cyber criminals tend to share the results of their data heist on the open web on sites such as pastebin, slexy, reddit, 4chan and many other loosely moderated sites. They often share glimpses of the hacked data in order to gain attention and to pull up some interested buyers for their entire data dump.
This makes it evident that we need to be on the constant lookout for such data leaks in various forums, text sharing sites, social media etc. Since the data to be monitored is large it would be impractical to do it manually, hence we need a system/application in place to do the same. Once the data that is leaked comes through to us, it is upto the security team to take the necessary action which may be anything from changing the passwords/api keys or suspending the accounts etc or whatever action is apt for the situation.
ScopeTo define a monitoring system which identifies data leaks of a specific Individual/company along with plausible data sources and tools which generate reports. The action to be taken on the data leak completely depends on the type of the system/data which is not in the scope of this
Everyone Else is doing it?
Yes! A lot of the big companies do have a system in place for the sole purpose of looking for data leaks of their respective companies on the open web. Ever since the infamous hack “50 days of Lulz” everyone is rushing towards this approach. Cyber Security related companies constantly do this.
Data Source 1: As you can see in the above diagram, the data from text sharing sites are pulled up for analysis via their API and using regular expressions in our Pattern matching engine we shall pull up any leaked data.
Data Source 2: There are few twitter bots out there such as @ which monitor hacker’s playgrounds, forums and their popular sharing platforms and tweet in case of any leaks detected.
Data from such bots can be useful as it provides a defined amount of data to search, passing it to our PR-Engine will do the rest of filtering.
Data Source 3: Using custom search engine searches and using tools such as scumblr and integrating it with our system would help us get the leaked data at a quicker rate.
The key thing to be considered here is how quick we can get the data that interests us and make sure it is attained with minimum resource consumed.
Tools: There are no fully fledged commercial tools for this purpose. On exploring I found a few good tools.
Scrumblr & Sketchy: This is a tool developed and open sourced by Netflix. The purpose of the tool is to collect information on the web that interests you/ your company. This tool is currently being used by Netflix Security team.
HaveIBeenPwned: This is a online tool where you can search for a keyword it shows you if your account is compromised. It has API support too.
Amazon also monitors the web; there have been multiple instances where users are alerted that their API keys of their instances are on the open web. We are not aware which tool they use for this purpose.
However there is an open source tool called Security Monkey which monitors policy changes and alerts on insecure configurations in an AWS account.
I happened to try out pystemon which is an open sourced tool built using python.
Below are the results.
Step 1: I posted a test email Id with some data to the text sharing site called slexy.
Step 2: I configured my system to be able to run pystemon.
Step 3: I set up the regular expression I was looking for in the tool configuration.
Step 4: Run the program
Step 5: Within a minute, I managed to find the text which I had shared in step 1 downloaded along with all the information surrounding it into the Alerts folder.
This is just a simple demonstration on how humongous data can be mined easily with the tools available, on customizing such tools we can set the path to effective monitoring of the web for confidential data leaks. The thing common in all tools is that they have used python. Python is usually used to scrape data from large dumps and it is effective in doing so.
Conclusion: Using the information in this document as a precursor and setting up an effective system or an application consisting of multiple inbound data sources, to monitor the wide web and minimize the impact on the customers/victims thereby adding more Trust towards the brand which would not only be essential but pivotal in today’s world where security can be an illusion.