Data Mining with Weka
In today’s world of Data Explosion, the problem of managing the information becomes extremely difficult, which can lead to overload and chaos. Luckily, there are some tools, technologies and methodologies which helps managing the abundant data and extracting valuable insights from it. One of the most important of all methodologies is Data Mining and one such tool is Weka. Before we learn more about Weka, first of all lets talk about what is Data Mining.
What is Data Mining?
Data Mining is a key process in analyzing the Big Data. It is the computational process of unfolding patterns in the large raw data set which will support making right business decisions and designing strategies for organizational growth.
Raw Data in data mining process could be anything as below:
- CSV files (comma separated values)
- Data warehouse
- Transactional Data
- Text, fact files, etc.
Data mining process is mainly applied on data warehouse (collection of large amount of data) by using query method to generate result. This process identifies relationship between input data, analyzes patterns and extracts information which gets transformed into user understandable format like, dashboard, tables, charts, reports, etc.
The resultant data generated is called “information”, thus it can be concluded that knowledge discovery is the main aim of data mining process.
Figure: Data Mining Process
Lets try to understand the data mining process with an example of Market Basket Analysis. In grocery shop or a mall, the best discount offer to be given to a customer, is decided by analyzing the products which are mostly bought together. For e.g. the customer who buys milk will mostly buys bread.
Having understood a methodology, now lets talk about one of the best tool for data mining – Weka.
What is Weka?
Weka is a data mining visualization tool which contains collection of machine learning algorithms for data mining tasks. It is an open source software issued under the GNU General Public License. It provides result information in the form of chart, tree, table etc.
Weka expects the data file to be in Attribute-Relation File Format (ARFF) file. So, first we have to convert any file into ARFF before we start mining with it in Weka.
Features of Weka:
Data Preprocessing: It is cleaning of data while data gathering and selection phase. It removes/adds default value to missing fields and resolve conflicts.
Data Classification and Prediction: It classifies data based on relationship between things and predicts data label. For e.g., A Bank, based on available data of loan, classifies and predicts customer label ‘risky’ or ‘safe’.
Clustering: Group of related data into cluster, used to discover distinct group. For e.g., We have data of weather and based on that we want to decide whether to play outside or not, in such case, using Weka tool we can visualize overall data and can make decision according to the charts.
As shown in above image, current data is loaded from weather.arff file and there are 5 attributes: outlook, temperature, humidity, windy, play. Temperature is the selected attribute and we need to decide about playing, i.e whether to play outside or not. Weka applies data mining on the available data and produces result which is displayed in right corner chart (blue = play outside and red = not play). Chart is used to visualize play attribute with respect to temperature, so as per above “if temperature is 64 to 75 => play outside”.
Weka also provides various data mining techniques like filters, classification and clustering. Here is another example of data mining technique that is classification using J48 algorithm.
Figure: Classification Algorithm
The figure is the result of Classification algorithm J48 in Weka and it displays information in a tree view. By visualization of tree analyzer one can decide to play outside or not.