banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface.

If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png



NOW GET WORDPRESS BLOG UPDATES WITH LOKLAK !

 

Loklak shall soon be spoiling its users !

Next, it will be bringing in tiny tweet-like cards showing the blog-posts (title, publishing date, author and content) from the given WordPress Blog URL.

This feature is certain to expand the realm of Loklak’s missive of building a comprehensive and an extensive social network dispensing useful information.

Screenshot from 2016-06-22 04:48:28

In order to implement this feature, I have again made the use of JSoup: The Java HTML parser library as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL.

The information is scraped making use of JSoup after the corresponding URL in the format "https://[username].wordpress.com/" is passed as an argument to the function scrapeWordpress(String blogURL){..} which returns a JSONObject as the result.

A look at the code snippet :

/**
 *  WordPress Blog Scraper
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WordPressBlogScraper {
    public static void main(String args[]){
        
        String blogURL = "https://loklaknet.wordpress.com/";
        scrapeWordpress(blogURL);       
    }
    
    public static JSONObject scrapeWordpress(String blogURL) {
        
                Document blogHTML = null;
        
        Elements articles = null;
        Elements articleList_title = null;
        Elements articleList_content = null;
        Elements articleList_dateTime = null;
        Elements articleList_author = null;

        String[][] blogPosts = new String[100][4];
        
        //blogPosts[][0] = Blog Title
        //blogPosts[][1] = Posted On
        //blogPosts[][2] = Author
        //blogPosts[][3] = Blog Content
        
        Integer numberOfBlogs = 0;
        Integer iterator = 0;
        
        try{            
            blogHTML = Jsoup.connect(blogURL).get();
        }catch (IOException e) {
            e.printStackTrace();
        }
            
            articles = blogHTML.getElementsByTag("article");
            
            iterator = 0;
            for(Element article : articles){
                
                articleList_title = article.getElementsByClass("entry-title");              
                for(Element blogs : articleList_title){
                    blogPosts[iterator][0] = blogs.text().toString();
                }
                
                articleList_dateTime = article.getElementsByClass("posted-on");             
                for(Element blogs : articleList_dateTime){
                    blogPosts[iterator][1] = blogs.text().toString();
                }
                
                articleList_author = article.getElementsByClass("byline");              
                for(Element blogs : articleList_author){
                    blogPosts[iterator][2] = blogs.text().toString();
                }
                
                articleList_content = article.getElementsByClass("entry-content");              
                for(Element blogs : articleList_content){
                    blogPosts[iterator][3] = blogs.text().toString();
                }
                
                iterator++;
                
            }
            
            numberOfBlogs = iterator;
            
            JSONArray blog = new JSONArray();
            
            for(int k = 0; k<numberOfBlogs; k++){
                JSONObject blogpost = new JSONObject();
                blogpost.put("blog_url", blogURL);
                blogpost.put("title", blogPosts[k][0]);
                blogpost.put("posted_on", blogPosts[k][1]);
                blogpost.put("author", blogPosts[k][2]);
                blogpost.put("content", blogPosts[k][3]);
                blog.put(blogpost);
            }           
            
            JSONObject final_blog_info = new JSONObject();
            
            final_blog_info.put("Wordpress blog: " + blogURL, blog);            

            System.out.println(final_blog_info);
            
            return final_blog_info;
        
    }
}

 

In this, simply a HTTP Connection was established and text extracted using “element_name”.text() from inside the specific tags using identifiers like classes or ids. The tags from which the information was to be extracted were identified after exploring the web page’s HTML source code.

The result thus obtained is in the form of a JSON Object

{
  "Wordpress blog: https://loklaknet.wordpress.com/": [
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "shivenmian",
      "title": "loklak_depot \u2013 The Beginning: Accounts (Part 3)",
      "content": "So this is my third post in this five part series on loklak_depo... As always, feedback is duly welcome."
    },
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sopankhosla",
      "title": "Creating a Loklak App!",
      "content": "Hello everyone! Today I will be shifting from course a...ore info refer to the full documentation here. Happy Coding!!!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "leonmakk",
      "title": "Loklak Walls Manual Moderation \u2013 tweet storage",
      "content": "Loklak walls are going to....Stay tuned for more updates on this new feature of loklak walls!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Robert",
      "title": "Under the hood: Authentication (login)",
      "content": "In the second post of .....key login is ready."
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "jigyasa",
      "title": "Loklak gives some hackernews now !",
      "content": "It's been befittingly said  \u... Also, Stay tuned for more posts on data crawling and parsing for Loklak. Feedback and Suggestions welcome"
    },
    {
      "posted_on": "June 16, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Damini",
      "title": "Does tweets have emotions?",
      "content": "Tweets do intend some kind o...t of features: classify(feat1,\u2026,featN) = argmax(P(cat)*PROD(P(featI|cat)"
    },
    {
      "posted_on": "June 15, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sudheesh001",
      "title": "Dockerize the loklak server and publish docker images to IBM Containers on Bluemix Cloud",
      "content": "Docker is an open source...nd to create and deploy instantly as well as scale on demand."
    }
  ]
}

 

The next step now would include "writeToBackend"-ing and then parsing the JSONObject as desired.

Feel free to ask questions regarding the above code snippet, shall be happy to assist.

Feedback and Suggestions welcome 🙂

Advertisements