banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface. If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png


Loklak Weibo: Now going beyond Twitter !

As of now, Loklak has done a wonderful job in collecting billion(s) of tweets especially from Twitter. The highlight of this service has been anonymous search of Twitter without the use of any authentication key. The next step is to go beyond Twitter and collect data from Chinese twitter like services for instance Weibo.com.

Screenshot from 2016-06-02 01:39:19.png

The major challenge however is to understand the Chinese annotations especially being from a non-Chinese background, but now that I have support of a Chinese friend from the Loklak community,  we hope it shall be an easier task to achieve now 🙂

The trick shall be simple, to parse the HTML page of the search results. It has been suggested to use the JSoup: The Java HTML parser library in the loklak_server. It provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL. The suggested use of JSoup is designed to deal with all varieties of HTML, hence as of now it is being considered a suitable choice.

In our approach, the HTML page generated by the search query http://s.weibo.com/weibo/<search-string&gt; shall be parsed instead of going the traditional way using the API call by authenticating via the key.

Screenshot from 2016-06-02 01:48:34.png
Sample code snippet to extract the title of the page:

String q = "Berlin";
//Get the Search Query in "q" here
Document doc = null;
String title = null;
try {
doc = Jsoup.connect("http://s.weibo.com/weibo/"+q).get();
title = doc.title();
} catch (IOException e) {
e.printStackTrace();
}

Check out this space for upcoming detail on implementing this technique to parse the entire page and get desired results…

Feedback and Suggestions welcome.


Reblogged from blog.loklak.net.

Advertisements