banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface. If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png


LOKLAK SHUOSHUO: ANOTHER FEATHER IN THE CAP !

Work is still going on Loklak Weibo to extract the information as desired and there shall be another post up soon explaining the intricacies of implementation. Currently, an attempt was made to parse the HTML page of QQShuoShuo.com  (another Chinese twitter like service)

Screenshot from 2016-06-05 22:28:18Screenshot from 2016-06-05 22:28:25
Just like last time, The major challenge however is to understand the Chinese annotations especially being from a non-Chinese background. Google translate aids testing the retrieved data by helping me match each phrase or/and line.

I have made use of of  JSoup: The Java HTML parser library which assists in extracting and manipulating data, scrape and parse HTML from the URL. The suggested use of JSoup is designed to deal with all varieties of HTML, hence as of now it is being considered a suitable choice.

Screenshot from 2016-06-05 22:32:53

/**
 *  Shuoshuo Crawler
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

public class ShuoshuoCrawler {
    public static void main(String args[]){

        Document shuoshuoHTML = null;
        Element recommendedTalkBox = null;
        Elements recommendedTalksList = null;
        String recommendedTalksResult[] = new String[100];
        Integer numberOfrecommendedTalks = 0;
        Integer i = 0;

        try {
            shuoshuoHTML = Jsoup.connect("http://www.qqshuoshuo.com/").get();

            recommendedTalkBox = shuoshuoHTML.getElementById("list2");
            recommendedTalksList = recommendedTalkBox.getElementsByTag("li");

            i=0;
            for (Element recommendedTalks : recommendedTalksList)
            {
                //System.out.println("\nLine: " + recommendedTalks.text());
                recommendedTalksResult[i] = recommendedTalks.text().toString();
                i++;
            }           
            numberOfrecommendedTalks = i;
            System.out.println("Total Recommended Talks: " + numberOfrecommendedTalks);
            for(int k=0; k<numberOfrecommendedTalks; k++){
                System.out.println("Recommended Talk " + k + ": " + recommendedTalksResult[k]);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

QQ Recommended Talks from qqshuoshuo.com are now stored as an array of Strings.

Total Recommended Talks: 10
Recommended Talk 0: 不会在意无视我的人,不会忘记帮助过我的人,不会去恨真心爱过我的人。
Recommended Talk 1: 喜欢一个人是一种感觉,不喜欢一个人却是事实。事实容易解释,感觉却难以言喻。
Recommended Talk 2: 一个人容易从别人的世界走出来却走不出自己的沙漠
Recommended Talk 3: 有什么了不起,不就是幸福在左边,我站在了右边?
Recommended Talk 4: 希望我跟你的爱,就像新闻联播一样没有大结局
Recommended Talk 5: 你会遇到别的女子和她举案齐眉,而我自会有别的男子与我白首相携。
Recommended Talk 6: 既然爱,为什么不说出口,有些东西失去了,就再也回不来了!
Recommended Talk 7: 凡事都有可能,永远别说永远。
Recommended Talk 8: 都是因为爱,而喜欢上了怀旧;都是因为你,而喜欢上了怀念。
Recommended Talk 9: 爱是老去,爱是新生,爱是一切,爱是你。

A similar approach can be now used to do the same for Latest QQ talk and QQ talk Leaderboard.


Check out this space for upcoming detail on implementing this technique to parse the entire page and get desired results…

Feedback and Suggestions welcome.

Advertisements