Subscribe: by email or Podcast

Enter your Email to Track Changes in OSINFO


Powered by FeedBlitz

SEARCH SITE
NEWS & ARCHIVE
SOME INTERESTING PAPERS

TilTul

Particls InTouch

Link to Podcast (RSS feed) for this blog

Subscribe to Open Source Information News on your cell phone

Receive IM, Email or Mobile alerts when new content is published on this site.

BlogMailr Enabled

Add to any service

Advertisement

Login
« Sheep In Human Clothing: Scientists Reveal Our Flock Mentality | Main | CTC Sentinel (Issue 3) February 2008 »
Friday
15Feb

Spidering the "Dark Web"

From: R/W Web
Written by Sarah Perez / February 14, 2008 10:28 AM

For some, the term "dark web" simply means all the online data that search engine spiders can't reach, crawl, or index, but for the University of Arizona's AI Lab, the "Dark Web" refers to a research project where the social phenomena of terrorism is studied via various techniques including social network analysis, content analysis, link analysis, web metrics, video analysis, data and text mining, sentiment and affect analysis, and authorship analysis. Through the use of sophisticated, mathematical tools, the project aims to collect all web content generated by international terrorist groups, including content found on web sites, forums, chat rooms, blogs, social networking sites, videos, virtual worlds, and more.

The Dark Web Project

Federally funded through the National Science Foundation, the Dark Web's spiders have been crawling through the web for the past five years. As of 2007, they estimated there were about 50,000 sites of extremist/terrorist content when they looked beyond just traditional web pages. This number was a great increase from Dr. Gabriel Weimann of the University of Haifa's estimate that there were only 5000 terrorist web sites in 2006. From 2006-2007, the lab found the greatest increase in terrorist activities was on various new "web 2.0" sites, (a term they use to describe any new-generation web site including video sites, blogs, virtual worlds, etc.)

Currently, the Dark Web collection consists of the complete contents of only 1000 web sites in Arabic, Spanish, and English and the partial contents of 10,000 other sites. This collection is 2 TBs in size making it the largest open-source extremist/terrorist collection in the academic world. Researchers who would like to use this data in their own studies can contact the research center for access.

Where the Bad Guys Are

So far, the Dark Web has determined the following:

  • Forums: 300 terrorist forums found, some with more than 30,000 members; nearly 1,000,000 messages posted.
  • Blogs, social networking sites, and virtual worlds: Many transient sites have been identified before they disappear; more than 30 (self-proclaimed) terrorist or extremist groups in virtual world sites, though they have yet been unable to determine who is just "playing terrorist" vs who is for real.
  • Videos and multimedia content: 1,000,000 images and 15,000 videos from web sites and specialty multimedia file-hosting third-party servers; more than 50% of of videos are related to Improvised Explosive Devices.


Second Life Griefers - A "Terrorist Attack?"

How They Find the Data

The Dark Web project uses various tools for collection, analysis, and visualization:

  • Web site spidering: Their focused spiders can access password-protected sites and perform randomized (human-like) fetching. The spiders are trained to fetch all html, pdf, and word files, links, PHP, CGI, and ASP files, images, audios, and videos in a web site. Selected web sites are spidered every 2 to 3 months.

 


PrintView Printer Friendly Version

EmailEmail Article to Friend