banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface. If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png


LET’S ‘MEETUP’ WITH LOKLAK

Loklak has already started to expand beyond the realms of Twitter and working its way to build an extensive social network. Now, Loklak aims to bring in data crawled from meetups.com to create a close-knit community. Chiming together with Meetup’s mission to revitalize local community and help people around the world self-organize, Loklak strives to revolutionize the social networking scenario of the present world.

Screenshot from 2016-06-10 09:23:10
Screenshot from 2016-06-10 09:24:05

In order to extract information viz. Group Name, Location, Description, Group Topics/Tags, Recent Meetups (Day, Date, Time, RSVP, Reviews, Introductory lines etc.) about a specific group on meetups.com I have used the URL http://www.meetup.com/<group-name>/ and then scraped information from the HTML page itself.

Just like previous experiments with other webpages, I have made use of JSoup: The Java HTML parser library in the loklak_server. It provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL. The suggested use of JSoup is designed to deal with all varieties of HTML, hence as of now it is being considered a suitable choice.

The information scraped is stored in a multi-dimensional array of recentMeetupsResult[][] and then the data inside can be used accordingly.

A sample code-snippet for reference is as:

/**
 *  Meetups Scraper
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

public class MeetupsScraper {
    public static void main(String args[]){
        
        Document meetupHTML = null;
        String meetupGroupName = "Women-Who-Code-Delhi";
        // fetch group name here
        Element groupDescription = null;
        String groupDescriptionString = null;
        Element topicList = null;
        Elements topicListStrings = null;
        String[] topicListArray = new String[100];
        Integer numberOfTopics = 0;
        Element recentMeetupsSection = null;
        Elements recentMeetupsList = null;
        Integer numberOfRecentMeetupsShown = 0;
        Integer i = 0, j = 0;
        String recentMeetupsResult[][] = new String[100][3];
        
        // recentMeetupsResult[i][0] == date && time
        // recentMeetupsResult[i][1] == Attendance && Review
        // recentMeetupsResult[i][2] == Information
                
        try{
            meetupHTML = Jsoup.connect("http://www.meetup.com/" + meetupGroupName).userAgent("Mozilla)").get();
            
            groupDescription = meetupHTML.getElementById("groupDesc");
            groupDescriptionString = groupDescription.text();
            System.out.println(meetupGroupName + "\n\tGroup Description: \n\t\t" + groupDescriptionString);
            
            topicList = meetupHTML.getElementById("topic-box-2012");
            topicListStrings = topicList.getElementsByTag("a");
            
            int p = 0;
            for(Element topicListStringsIterator : topicListStrings){
                topicListArray[p] = topicListStringsIterator.text().toString();
                p++;
            }
            numberOfTopics = p;
            
            System.out.println("\nGroup Topics:");
            for(int l = 0; l<numberOfTopics; l++){
                System.out.println("\n\tTopic Number "+ l + " : " + topicListArray[l]);
            }
            
            recentMeetupsSection = meetupHTML.getElementById("recentMeetups");
            recentMeetupsList = recentMeetupsSection.getElementsByTag("p");
            
            i = 0;
            j = 0;
            
            for(Element recentMeetups : recentMeetupsList ){                
                if(j%3==0){
                    j = 0;
                    i++;
                }
                
                recentMeetupsResult[i][j] = recentMeetups.text().toString();
                j++;
                
            }
            
            numberOfRecentMeetupsShown = i;
            
            for(int k = 1; k < numberOfRecentMeetupsShown; k++){
                System.out.println("\n\nRecent Meetup Number" + k + " : \n" + 
                        "\n\t Date & Time: " + recentMeetupsResult[k][0] + 
                        "\n\t Attendance: " + recentMeetupsResult[k][1] + 
                        "\n\t Information: " + recentMeetupsResult[k][2]);
            }

        }catch (IOException e) {
            e.printStackTrace();
        }
        
    }
}

In this, simply a HTTP Connection was established and text extracted using "element_name".text()from inside the specific tags using identifiers like classes or ids. The tags from which the information was to be extracted were identified after exploring the web page’s HTML source code.

The above yields results as:

Women-Who-Code-Delhi
    Group Description: 
        Mission: Women Who Code is a global nonprofit organization dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology. The organization tripled in 2013 and has grown to be one of the largest communities of women engineers in the world. Empowerment: Women Who code is a professional community for women in tech. We provide an avenue for women to pursue a career in technology, help them gain new skills and hone existing skills for professional advancement, and foster environments where networking and mentorship are valued. Key Initiatives: - Free technical study groups - Events featuring influential tech industry experts and investors - Hack events - Career and leadership development Current and aspiring coders are welcome.  Bring your laptop and a friend!  Support Women Who Code: Donating to Women Who Code, Inc. (#46-4218859) directly impacts our ability to efficiently run this growing organization, helps us produce new programs that will increase our reach, and enables us to expand into new cities around the world ensuring that women and girls everywhere have the opportunity to pursue a career in technology. Women Who Code (WWCode) is dedicated to providing an empowering experience for everyone who participates in or supports our community, regardless of gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, ethnicity, age, religion, or socioeconomic status. Because we value the safety and security of our members and strive to have an inclusive community, we do not tolerate harassment of members or event participants in any form. Our Code of Conduct applies to all events run by Women Who Code, Inc. If you would like to report an incident or contact our leadership team, please submit an incident report form. WomenWhoCode.com

Group Topics:

    Topic Number 0 : Django
    Topic Number 1 : Web Design
    Topic Number 2 : Ruby
    Topic Number 3 : HTML5
    Topic Number 4 : Women Programmers
    Topic Number 5 : JavaScript
    Topic Number 6 : Python
    Topic Number 7 : Women in Technology
    Topic Number 8 : Android Development
    Topic Number 9 : Mobile Technology
    Topic Number 10 : iOS Development
    Topic Number 11 : Women Who Code
    Topic Number 12 : Ruby On Rails
    Topic Number 13 : Computer programming
    Topic Number 14 : WWC

Recent Meetup Number1 : 
     Date & Time: April 2 · 10:30 AM
     Attendance: 13 Women Who Code-rs | 5.001
     Information: Brought to you in collaboration with Women Techmakers Delhi.According to a survey, only 11% of open source participants are women. People find it intimidating to get... Learn more


Recent Meetup Number2 : 
     Date & Time: March 3 · 3:00 PM
     Attendance: 21 Women Who Code-rs | 5.001
     Information: “Behold, the number five is at hand. Grab it and shake and harness the power of networking.” Women Who Code Delhi is proud to present Social Hack Eve, a networking... Learn more


Recent Meetup Number3 : 
     Date & Time: Oct 18, 2015 · 9:00 AM
     Attendance: 20 Women Who Code-rs | 4.502
     Information: Hello Ladies :) Google Women Techmakers is looking for women techies to present a talk in one of the segments of Google DevFest Delhi 2015 planned for October 18, 2015... Learn more


Recent Meetup Number4 : 
     Date & Time: Jul 5, 2015 · 12:00 PM
     Attendance: 24 Women Who Code-rs | 4.001 | 1 Photo
     Information: Agenda: Learning how to use and develop open source software, and contribute to huge existing open source projects.A series of talks by some of this year’s GSoC... Learn more

In this sample, simply text has been retrieved . Advanced versions could include the hyperlinks and multimedia embedded in the web-page and integrating the extracted information in a suitable format.


Check out this space for more details and implementation details of crawlers.
Feedback and Suggestions welcome.

Advertisements