Near real time Big Data from social network sites like Facebook and Twitter has been an interesting source for analytics by researchers in recent years due to various factors including its availability and popularity, though there might be a compromise in genuineness or correctness. Apache Spark is an efficient big data processing engine that offers faster solutions as compared to Hadoop and can be effectively used to find patterns of relevance. Recently many organizations are advertising their job vacancies through tweets, which saves time and cost in recruitment. This article discuss the real time analyzing and filtering those numerous job advertisements from among the millions of other streaming tweets and classify them into various job categories to facilitate effective job search, utilizing Spark.
Twitter, a popular social network site, contributes huge data that possess values beyond social and commercial interests. Twitter users can express or share their opinions, feelings or information regarding events, products, health or anything in their 140 character restricted short messages termed tweets. Hashtag is the convention of prefixing a word in the tweet with the symbol ‘#’ indicating the keyword or topic of the tweet, which was meant for categorization of tweets based on topics and aids in searching. According to Twitter Inc., more than 500 million tweets are sent per day. Such a big streaming data can be handled efficiently with the help of Spark ecosystem, which is considered as the second generation Big Data processing engine.
Tweets often contain latest information of topics as it is updated, frequently. Analysis of tweets can reveal useful information. In most of the tweet analytic works, analysis has been done on previously collected data and less work been done on streaming twitter data. Apart from the information gathered by these works, tweets contains other valuable information, if extracted on time, is certain to have practical and immediate application in the life of a common man. Due to the popularity of social network sites, advertisers are now heavily targeting social network users. It is common that many organizations and individuals are now tweeting about job vacancies and hiring details, instead of publishing them in print or online media which saves time, cost and effort in dissemination. These vacancies are intended to be filled on an immediate basis and it can be utilized effectively by job seekers, if they have a near real time access to these tweets. This blog is around the real time analysis of these vacancies and filtering of those specific tweets regarding job vacancies from among millions of other ones and to classify them according to the job category, without following any twitter account. A model needs to be proposed for the near real time collection and analysis of job vacancy related tweets that uses machine learning to classify them in accordance with the job type and location for effective job search. Spark Streaming is used to handle the streaming tweets. Since Spark is highly scalable, the model can be deployed on cloud cluster to achieve scalability as per demand.
Apache Spark Cluster For Big Data Analytics
Big Data processing requirements initiated a paradigm shift from traditional data processing, resulting in the evolvement of Map Reduce based frameworks like Hadoop. Though Hadoop has been extensively used for Big Data processing for years, performance wise a better solution like Apache Spark can be looked upon as a giant step in big data processing. The open source Apache spark ecosystem integrates batch and stream processing and comprise of libraries providing support for machine learning, graph processing and SQL querying. Apache spark, originated from Berkeley, now licensed under Apache foundation offers much faster performance and a variety of features in comparison with the most sought out Hadoop Big Data Processing System. Though Hadoop is a matured batch processing system with many projects being completed and much expertise being available, it has its limitations. Hadoop is written in java and mainly rely on two functions, the Map and the Reduce; all operations are to be represented in terms of these two functions which makes the programming a little complicated. Spark program can be written using Java, Python or Scala and it offers more functions other than just the map and reduce and above all, it provides an interactive mode, the spark shell, which makes programming much simpler for Spark compared to Hadoop. Hadoop persists data back to the hard disk after a map or reduce operation, while spark performs in-memory data processing and hence repetitive operations on same data will be done much faster.
Hence, memory requirement of Spark is much higher compared to Hadoop but if the data fits in the memory, spark works faster or else it has to move data back and forth the disk, which deteriorates spark’s performance. Being a batch processing system, Hadoop users have to depend on other platforms like Storm for real time data processing, Mahout for machine learning or Giraph for graph processing. But Spark ecosystem includes Spark streaming, MLLib, GraphX and Spark SQL for real time data processing, machine learning, graph processing and SQL querying respectively, which gives competitive advantage for Spark over Hadoop.
Spark application will be having a driver program that runs the main function and performs parallel operations on various nodes in a spark cluster. By introducing the concept of Resilient Distributed Dataset (RDD), the collection of immutable objects partitioned across the nodes of a cluster for performing parallel operations which can be persisted in memory for repetitive/iterative use, Spark outperforms Hadoop with 100 times faster performance by saving time of read/write from disk, especially in running machine learning applications where iterative operations on data is common. RDDs are formed by transformation from other RDDs or file and RDDs retain the information by which it is formed Since Big Data analytics involves application of machine learning/data mining techniques on Big Data, Spark offers MLLib, the machine learning library that includes popular machine learning algorithms for classification, clustering and association. Integration of MLLib in spark ecosystem is another advantage that spark is having while Hadoop struggles with Mahout, the machine learning platform. Spark streaming facilitates stream data processing, though spark is basically a batch processing engine. Incoming data stream is grouped into batches of interval less than a second and processed by the batch processing spark engine integrating the powerful features to near real time processing.
Job Categorization Using Machine Learning
Filtering of job advertisements based on allied hashtags gives good results, but many unethical practitioners use popular tags like ‘#job’ or ‘#career’ in their tweets with totally unrelated content for gaining attention. To remove such tweets and to categorize the relevant advertisements in various classes like IT, Construction, Driving, Healthcare etc. as given in table below, for ease of search, machine learning based classification is to be performed.
|#IT #Job alert: VMware Quality
System Engineer | VMware |
|Blown an interview? Maybe not.
Here’s how to recover:
|#Job #Germantown: Systems
Administrator (Sea): IT Project
|JOB OPENING: Project Financial
Controls Specialist – IRC at
(Minneapolis, MN) #job
|Abc Tea TRAINING
#Transportation #Job: DRIVERS
Searching With Spark SQL
Tweets classified under various job categories can be stored in the database, which then can be queried to find vacancies belonging to a particular Job category. Spark SQL provides querying functionality. Job seeker can query about a category and the advertisements regarding that category can be given as the result, which is useful if the client is using hand-held devices for job search.
Deployment On Cloud
This spark application can be run on standalone spark cluster with a master and a few worker nodes. It can also be deployed on a cloud like Amazon EC2. Spark-ec2 script allows launching, managing and shutting down of multiple named spark clusters in Amazon EC2 cloud. Each cluster’s machines are placed in corresponding EC2 security groups.
Amazon EC2 security keys needs to be created which is meant for secure shell (ssh) connection. EC2 cluster can be launched by running the command from spark-ec2 directory of local machine,
./spark-ec2 -k <keypair> -i <key-file> -s
<num-slaves> launch <cluster-name>. To ssh into
the cluster the command, ./spark-ec2 -k <keypair> –
i <key-file> login <cluster-name> is used.