Web scraping tourism reviews

Friday, April 30, 2010
by Peter Johnson

I鈥檓 sure that many tourism business owners have spent a lot of time investigating review sites like and , reading up on what their customers are saying. This is good business practice and tourism operators should always have an open ear to any praise or critique.

It is easy to look at reviews for one particular business, but what about at the regional or provincial level? What about comparing reviews across a destination? Which areas are reviewed the most and which are reviewed the least? Do users of Trip Advisor and Yelp leave a path of reviews as they travel or do only the best/worst experiences get mentioned? What is the percentage of positive vs. negative reviews? What is the overall quality of these reviews? These are just some of the questions that I鈥檝e been thinking about recently.

Last fall I started a small research project that 鈥榮craped鈥 reviews from Trip Advisor for Nova Scotia. is a somewhat controversial technique that actually uses software 鈥渁gents鈥 to harvest information from websites. In a basic sense, it is an automated version of copying specific information from a web site and pasting it into a spreadsheet. The tool I used to accomplish this is . I ended up getting nearly 6,000 total reviews, including user, date, location, star rating /5, and comments. A very rich data source! I did some basic analysis by dividing the reviews up into three categories: accommodation, restaurants, and attractions, and the geolocating them at one of 77 different named destinations. I presented this preliminary material at the 2009 annual meeting in Guelph, Ontario. You can take a look at the Slideshare here:

View more from .

View presentation.