Download free pdf data mining books r






















Coupling Rattle with R delivers a very sophisticated data mining environment with all the power, and more, of the many commercial offerings. Data Mining and BusinessAnalytics with R utilizes the open source software R for theanalysis, exploration, and simplification of large high-dimensionaldata sets. As a result, readers are provided with the neededguidance to model and interpret complicated data and become adeptat building powerful models for prediction and classification.

Highlighting both underlying concepts and practicalcomputational skills, Data Mining and Business Analytics withR begins with coverage of standard linear regression and theimportance of parsimony in statistical modeling. The book includesimportant topics such as penalty-based variable selection LASSO ;logistic regression; regression and classification trees;clustering; principal components and partial least squares; and theanalysis of text and network data.

The book is also a valuable reference for practitionerswho collect and analyze data in the fields of finance, operationsmanagement, marketing, and the information sciences. Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it. Apply effective data mining models to perform regression and classification tasks. Who This Book Is For If you are a budding data scientist, or a data analyst with a basic knowledge of R, and want to get into the intricacies of data mining in a practical manner, this is the book for you.

No previous experience of data mining is required. What You Will Learn Master relevant packages such as dplyr, ggplot2 and so on for data mining Learn how to effectively organize a data mining project through the CRISP-DM methodology Implement data cleaning and validation tasks to get your data ready for data mining activities Execute Exploratory Data Analysis both the numerical and the graphical way Develop simple and multiple regression models along with logistic regression Apply basic ensemble learning techniques to join together results from different data mining models Perform text mining analysis from unstructured pdf files and textual data Produce reports to effectively communicate objectives, methods, and insights of your analyses In Detail R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.

This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques. While moving along the plot of the story you will effectively learn and practice on real data th.

Score: 4. R is widely used in leveraging data mining techniques across many different industries, including government, finance, insurance, medicine, scientific research and more. This book presents 15 different real-world case studies illustrating various techniques in rapidly growing areas. It is an ideal companion for data mining researchers in academia and industry looking for ways to turn this versatile software into a powerful analytic tool. R code, Data and color figures for the book are provided at the RDataMining.

Helps data miners to learn to use R in their specific area of work and see how R can apply in different industries Presents various case studies in real-world applications, which will help readers to apply the techniques in their work Provides code examples and sample data for readers to easily learn the techniques by running the code by themselves.

This is the sixth version of this successful text, and the first using Python. It covers both statistical and machine learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis. It also includes: A new co-author, Peter Gedeck, who brings both experience teaching business analytics courses using Python, and expertise in the application of machine learning methods to the drug-discovery process A new section on ethical issues in data mining Updates and new material based on feedback from instructors teaching MBA, undergraduate, diploma and executive courses, and from their students More than a dozen case studies demonstrating applications for the data mining techniques described End-of-chapter exercises that help readers gauge and expand their comprehension and competency of the material presented A companion website with more than two dozen data sets, and instructor materials including exercise solutions, PowerPoint slides, and case solutions Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python is an ideal textbook for graduate and upper-undergraduate level courses in data mining, predictive analytics, and business analytics.

This new edition is also an excellent reference for analysts, researchers, and practitioners working with quantitative methods in the fields of business, finance, marketing, computer science, and information technology. It also gives insight into some of the challenges faced when deploying these tools.

Extensively classroom-tested, the text is ideal for students in customer and business analytics or applied data mining as well as professionals in small- to medium-sized organizations. The book offers an intuitive understanding of how different analytics algorithms work. Where necessary, the authors explain the underlying mathematics in an accessible manner.

Each technique presented includes a detailed tutorial that enables hands-on experience with real data. The authors also discuss issues often encountered in applied data mining projects and present the CRISP-DM process model as a practical framework for organizing these projects.

Showing how data mining can improve the performance of organizations, this book and its R-based software provide the skills and tools needed to successfully develop advanced analytics capabilities. R Data Analysis and Visualization Author : Tony Fischetti Publisher : Packt Publishing Ltd Release Date : Genre: Computers Pages : ISBN 10 : GET BOOK R Data Analysis and Visualization Book Description : Master the art of building analytical models using R About This Book Load, wrangle, and analyze your data using the world's most powerful statistical programming language Build and customize publication-quality visualizations of powerful and stunning R graphs Develop key skills and techniques with R to create and customize data mining algorithms Use R to optimize your trading strategy and build up your own risk management system Discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R Who This Book Is For This course is for data scientist or quantitative analyst who are looking at learning R and take advantage of its powerful analytical design framework.

This makes it a useful teaching tool in learning R for the specific task of data mining, and also a good memory aid! Rattle is simple to use, quick to deploy, and allows us to rapidly work through the data processing, modelling, and evaluation phases of a data mining project. When we need to fine-tune and further develop our data mining projects, we can migrate from Rattle to R.

Rattle can save the current state of a data mining task as a Rattle project. A Rattle project can then be loaded at a later time or shared with other users. Projects can be loaded, modified, and saved, allow- ing check pointing and parallel explorations. Projects also retain all of the R code for transparency and repeatability. The R code can be loaded into R outside of Rattle to repeat any data mining task. However, it also provides a stepping stone to more sophisticated processing and modelling in R itself.

It is worth emphasising that the user is not limited to how Rat- tle does things. For sophisticated and unconstrained data mining, the experienced user will progress to interacting directly with R.

The typical workflow for a data mining project was introduced above. In the context of Rattle, it can be summarised as: 1. Load a Dataset. Select variables and entities for exploring and mining. Explore the data to understand how it is distributed or spread. Transform the data to suit our data mining purposes. Build our Models. Evaluate the models on other datasets. Export the models for deployment. It is important to note that at any stage the next step could well be a step to a previous stage.

We illustrate a typical workflow that is embodied in the Rattle inter- face in Figure 1. Identify Data Start by getting as much data Select Variables as we can and then cull.

Clean and Transform We may loop around here many times as we clean, transform, and Build and Tune Models then build and tune our models.

Evaluate Models Evaluate performance, structure, complexity, and deployability. Deploy Model Is the model run manually on demand or on an automatic Monitor Performance shecdule? Figure 1. R and Rattle are free software in terms of allowing anyone the freedom to do as they wish with them. This is also referred to as open source software to distinguish it from closed source software, which does not provide the source code. Closed source software usually has quite restrictive licenses associated with it, aimed at limiting our freedom using it.

R and Rattle can be obtained for free. On 7 January , the New York Times carried a front page tech- nology article on R where a vendor representative was quoted: I think it addresses a niche market for high-end data analysts that want free, readily available code.

We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet. This is a common misunderstanding about the concept of free and open source software.

R, being free and open source software, is in fact a peer-reviewed software product that a number of the worlds top statisti- cians have developed and others have reviewed. In fact, anyone is permit- ted to review the R source code.

Over the years, many bugs and issues have been identified and rectified by a large community of developers and users. On the other hand, a closed source software product cannot be so readily and independently verified or viewed by others at will.

Bugs and enhancement requests need to be reported back to the vendor. Customers then need to rely on a very select group of vendor-chosen people to assure the software, rectify any bugs in it, and enhance it with new algorithms. Bug fixes and enhancements can take months or years, and generally customers need to purchase the new versions of the software.

Both scenarios open source and closed source see a lot of effort put into the quality of their software. With open source, though, we all share it, whereas we can share and learn very little about the algorithms we use from closed source software. It is worthwhile to highlight another reason for using R in the con- text of free and commercial software. In obtaining any software, due diligence is required in assessing what is available. However, what is fi- nally delivered may be quite different from what was promised or even possible with the software, whether it is open source or closed source, free or commercial.

With free open source software, we are free to use it without restriction. If we find that it does not serve our purposes, we can move on with minimal cost. With closed source commercial purchases, once the commitment is made to buy the software and it turns out not to meet our requirements, we are generally stuck with it, having made the financial commitment, and have to make do.

It incorporates all of the standard statistical tests, models, and analyses, as well as providing a comprehensive language for manag- ing and manipulating data. New technology and ideas often appear first in R. It reflects well on a very competent community of computational statisticians. Because R is open source, unlike closed source software, it has been reviewed by many internationally renowned statisticians and computational scientists. R runs on many operating systems and different hardware.

Have you ever tried getting support from the core developers of a commercial vendor? Whilst the advantages might flow from the pen with a great deal of enthusiasm, it is useful to note some of the disadvantages or weaknesses of R, even if they are perhaps transitory! There are several simple-to- use graphical user interfaces GUIs for R that encompass point- and-click interactions, but they generally do not have the polish of the commercial offerings.

However, some very high-standard books are increasingly plugging the documentation gaps. R is a software application that many people freely devote their own time to developing.

Problems are usually dealt with quickly on the open mailing lists, and bugs disappear with lightning speed. Users who do require it can purchase support from a number of vendors internationally.

This can be a restriction when doing data mining. There are various solutions, including using 64 bit operating systems that can access much more memory than 32 bit ones. Laws in many countries can directly affect data mining, and it is very worthwhile to be aware of them and their penalties, which can often be severe.

There are basic principles relating to the protection of privacy that we should adhere to. Please take that responsibility seriously. Think often and carefully about what you are doing. Some basic familiarity with R will be gained through our travels in data mining using the Rattle interface and some excursions into R. In this respect, most of what we need to know about R is contained within the book. But there is much more to learn about R and its associated packages.

The book covers the basic data structures, read- ing and writing data, subscripting, manipulating, aggregating, and re- shaping data. Introductory Statistics with R Dalgaard, , as mentioned earlier, is a good introduction to statistics using R. Moving more towards areas related to data mining, Data Analysis and Graphics Using R Maindonald and Braun, provides excellent practical coverage of many aspects of exploring and modelling data using R. The Elements of Statistical Learning Hastie et al.

Bivand et al. Moving on from R itself and into data mining, there are very many general introductions available.

One that is commonly used for teaching in computer science is Han and Kamber It provides a compre- hensive generic introduction to most of the algorithms used by a data miner. It is presented at a level suitable for information technology and database graduates. Chapter 2 Getting Started New ideas are often most effectively understood and appreciated by ac- tually doing something with them.

So it is with data mining. Fun- damentally, data mining is about practical application—application of the algorithms developed by researchers in artificial intelligence, machine learning, computer science, and statistics. This chapter is about getting started with data mining. Our aim throughout this book is to provide hands-on practise in data mining, and to do so we need some computer software.

There is a choice of software packages available for data mining. These include commercial closed source software which is also often quite expensive as well as free open source software.

Open source software whether freely available or commercially available is always the best option, as it offers us the freedom to do whatever we like with it, as discussed in Chapter 1. This includes extending it, verifying it, tuning it to suit our needs, and even selling it. Such software is often of higher quality than commercial closed source software because of its open nature. For our purposes, we need some good tools that are freely available to everyone and can be freely modified and extended by anyone.

There- fore we use the open source and free data mining tool Rattle, which is built on the open source and free statistical software environment R.

See Appendix A for instructions on obtaining the software. Now is a good time to install R. Much of what follows for the rest of the book, and specifically this chapter, relies on interacting with R and Rattle. The aim is to build a model that captures the essence of the knowledge discovered from our data.

Be careful though—there is a G. Once we have qual- ity data, Rattle can build a model with just four mouse clicks, but the effort is in preparing the data and understanding and then fine-tuning the models. In this chapter, we use Rattle to build our first data mining model—a simple decision tree model, which is one of the most common models in data mining.

We cover starting up and quitting from R, an overview of how we interact with Rattle, and then how to load a dataset and build a model. Once the enthusiasm for building a model is satisfied, we then review the larger tasks of understanding the data and evaluating the model.

This assumes that we have already installed R, as detailed in Appendix A. One way or another, we should see a window Figure 2.

We will generally refer to this as the R Console. These include options for working with script files, managing packages, and obtaining help. We start Rattle by loading rattle into the R library using library. We supply the name of the package to load as the argument to the com- mand. The rattle command is then entered with an empty argument list, as shown below. The prompt indicates that R is awaiting user commands. Tip: The key to using Rattle, as hinted at in the status bar on starting up Rattle, is to supply the appropriate information for a particular tab and to then click the Execute button to perform the action.

Always make sure you have clicked the Execute button before proceeding to the next step. To exit from Rattle, we simply click the Quit button. For R, the startup message Figure 2. We type this command into the R Console, including the parentheses so that the command is invoked rather than simply listing its definition.

The workspace refers to all of the datasets and any other objects we have created in the cur- rent R session.

We can save all of the objects currently available in a workspace between different invocations of R. We do so by choosing the y option. We might be in the middle of some complex analysis and wish to resume it at a later time, so this option is useful. Many users generally answer n each time here, having already cap- tured their analyses into script files. Script files allow us to automatically regenerate the results as required, and perhaps avoid saving and manag- ing very large workspace files.

If we do not actually want to quit, we can answer c to cancel the operation and return to the R Console. The amount of such effort should not be underestimated, but we do skip this step for now. Once we have processed our data, we are ready to build a model—and with Rattle we can build the model with just a few mouse clicks.

Using a sample dataset that someone else has already prepared for us, in Rattle we simply: 1. Click on the Execute button. Rattle will notice that no dataset has been identified, so it will take action, as in the next step, to ensure we have some data. This is covered in detail in Section 2. Click on Yes within the resulting popup. The weather dataset is provided with Rattle as a small and simple dataset to explore the concepts of data mining. The dataset is described in detail in Chapter 3.

Click on the Model tab. This is where we tell Rattle what kind of model we want to build and how it should be built. The Model tab is described in more detail in Section 2.

Once we have specified what we want done, we ask Rattle to do it by clicking the Execute button. For simple model builders for small datasets, Rattle will only take a second or two before we see the results displayed in the text view window. The data comes from a weather monitoring station located in Can- berra, Australia, via the Australian Bureau of Meteorology. Each obser- vation is a summary of the weather conditions on a particular day.

It has been processed to include a target variable that indicates whether it rained the day following the particular observation.

Using this historic data, we have built a model to predict whether it will rain tomorrow. Weather data is commonly available, and you might be able to build a similar model based on data from your own region. With only one or two more clicks, further models can be built. A few more clicks and we have an evaluation chart displaying the performance of the model.

Then, with just a click or two more, we will have the model applied to a new dataset to generate scores for new observations. Now to the details. We will continue to use Rattle and also the simple command line facility. The command line is not strictly necessary in using Rattle, but as we develop our data mining capability, it will become useful. We will load data into Rattle and explain the model that we have built. We will build a second model and compare their performances.

We will then apply the model to a new dataset to provide scores for a collection of new observations i. Now we want to illustrate loading any data perhaps our own data into Rattle. If we have followed the four steps in Section 2. Simply click the New button within the toolbar. We are asked to confirm that we would like to clear the current project. Either way, we need to have a fresh Rattle ready so that we can follow the examples below.

On starting Rattle, we can, without any other action, click the Execute button in the toolbar. Click on Yes to do so, to see the data listed, as shown in Figure 2. Figure 2. The dataset consists of observations and 24 variables, as noted in the status bar. The first variable has a role other than the default Input role.

Rattle uses heuristics to initialise the roles. Within R, a dataset is actually known as a data frame, and we will see this terminology frequently. The dataset summary Figure 2. The types will generally be Numeric if the data consists of numbers, like temper- ature, rainfall, and wind speed or Categoric if the data consists of characters from the alphabet, like the wind direction, which might be N or S, etc.

An Ident is often one of the variables columns in the data that uniquely identifies each observation row of the data. The Comments column includes general information like the number of unique or distinct values the variable has and how many observations have a missing value for a variable.

To build a decision tree model, one of the most common data mining models, click the Execute button decision trees are the default. A textual representation of the model is shown in Figure 2. The target variable which stores the outcome we want to model or predict is RainTomorrow, as we see in the Data tab window of Figure 2. Rattle automatically chose this variable as the target because it is the last variable in the data file and is a binary i.

Using the weather dataset, our modelling task is to learn about the prospect of it raining tomorrow given what we know about today. The textual presentation of the model in Figure 2. For now, we might click on the Draw button provided by Rattle to obtain the plot that we see in Figure 2.

The plot provides a better idea of why it is called a decision tree. This is just a different way of representing the same model.

This is yet another way to represent the same model. The rules are listed here, and we explain them in detail next. The rules are perhaps the more readable representation of the model. We traverse the tree by following the branches corresponding to the tests at each node.

The leaf nodes include a node number for reference, a decision of No or Yes to indicate whether it will RainTomorrow, the number of training observations, and the strength or confidence of the decision. The interpretation of the probability will be explained in more detail in Chapter 11, but we provide an intuitive reading here.

We can read it as saying that if the atmospheric pressure reduced to mean sea level at 3 pm was less than hectopascals and the amount of sunshine today was less than 8. That is to say that on most days when we have previously seen these conditions as represented in the data it has rained the following day.

Rule number 4 has two conditions: the atmospheric pressure at 3 pm greater than or equal to hectopascals and cloud cover at 3 pm less than 7.

When these conditions hold, the historic data tells us that it is unlikely to be raining tomorrow. We now have our first model. We have data-mined our historic ob- servations of weather to help provide some insight about the likelihood of it raining tomorrow.

A realistic data mining project, though, will precede modelling with quite an extensive exploration of data, in addition to understanding the business, understanding what data is available, and transforming such data into a form suitable for modelling. There is a lot more involved than just building a model. We look now at exploring our data to better understand it and to identify what we might want to do with it.

We will cover exploratory data analysis in detail in Chapters 5 and 6. We present here an initial flavour of exploratory data analysis. One of the first things we might want to know is how the values of the target variable RainTomorrow are distributed. A histogram might be useful for this. The simplest way to create one is to go to the Data tab, click on the Input role for RainTomorrow, and click the Execute button.

The plot of Figure 2. We can see from Figure 2. This is typical of data mining, where even greater skewness is not uncommon. We can display other simple plots from the Explore tab by selecting the Distributions option.

Then click on Execute to display the plots in Figure 2. The plots begin to tell a story about the data. We sketch the story here, leaving the details to Chapter 5. The top two plots are known as box-and-whisker plots. The top left plot tells us that the maximum temperature is generally higher the day before it rains the plot above the x-axis label Yes than before the days when it does not rain above the No.

The top right plot suggests an even more dramatic skew for the amount of sunshine the day prior to the prediction. Generally we see that if there is less sunshine the day before, then the chance of rain Yes seems to be increased. Both box plots also give another clue about the distribution of the values of the target variable. The width of the boxes in a box plot provides a visual indication of this distribution.

Each bottom plot overlays three separate plots that give further in- sight into the distribution of the observations. The three plots within each figure are a histogram bars , a density plot lines , and a rug plot short spikes on the x-axis , each of which we now briefly describe.

The histogram has partitioned the numeric data into segments of equal width, showing the frequency for each segment. The density plots tend to convey a more accurate picture of the dis- tribution of the data. Because the density plot is a simple line, we can also display the density plots for each of the target classes Yes and No. Along the x-axis is the rug plot. The short vertical lines represent actual observations. This can give us an idea of where any extreme values are, and the dense parts show where more of the observations lie.

These plots are useful in understanding the distribution of the nu- meric data. Rattle similarly provides a number of simple standard plots for categoric variables. A selection are shown in Figure 2. All three plots show a different view of the one variable, WindDir9am, as we now describe. The top plot of Figure 2. The bar chart has been sorted from the overall most frequent to the overall least frequent categoric value.

We note that each value of the variable e. The first bar is the overall frequency i. The second and third bars show the breakdown for the values across the respective values of the categoric target variable i. We can see that the distribution within each wind di- rection differs between the three groups, some more than others. Recall that the three groups correspond to all observations All , observations where it did not rain on the following day No , and observations where it did Yes.

The lower two plots show essentially the same information, in different forms. The bottom left plot is a dot plot. The breakdown into the levels of the target variable is compactly shown as dots within the same row. The bottom right plot is a mosaic plot, with all bars having the same height. The relative frequencies between the values of WindDir9am are now indicated by the widths of the bars. A mosaic plot allows us to easily identify levels that have very differ- ent proportions associated with the levels of the target variable.

We can see that a north wind direction has a higher proportion of observations where it rains the following day. That is, if there is a northerly wind today, then the chance of rain tomorrow seems to be increased.

These examples demonstrate that data visualisation or exploratory data analysis is a powerful tool for understanding our data—a picture is worth a thousand words. We actually learn quite a lot about our data even before we start to specifically model it. Many data miners begin to deliver significant benefits to their clients simply by providing such insights.

We delve further into exploring data in Chapter 5. We have illustrated above how to then build our first model. It is now time to evaluate the performance or quality of the model. Evaluation is a critical step in any data mining process, and one that is often left underdone.

For the sake of getting started, we will look at a simple evaluation tool. The confusion matrix also referred to as the error matrix is a common mechanism for evaluating model performance. The validation dataset is used to test different parameter settings or different choices of variables whilst we are data mining. It is important to note that this dataset should not be used to provide any error estimations of the final results from data mining since it has been used as part of the process of building the model.

The testing dataset is only to be used to predict the unbiased error of the final results. It is important not to use this testing dataset in any way in building or even fine-tuning the models that we build. Otherwise, it no longer provides an unbiased estimate of the model performance. The testing dataset and, whilst we are building models, the validation dataset, are used to test the performance of the models we build. This often involves calculating the model error rate.

A confusion matrix sim- ply compares the decisions made by the model with the actual decisions. This will provide us with an understanding of the level of accuracy of the model in terms of how well the model will perform on new, previously unseen, data. Two tables are presented. The first lists the actual counts of observations and the second the percentages.

That is, 35 days out of the 56 days are correctly predicted as not raining. In terms of how correct the model is, we observe that it correctly predicts rain for 10 days out of the 15 days on which it actually does rain. We also see six days when we are expecting rain and none occurs called the false positives. If we were using this model to help us decide whether to take an umbrella or raincoat with us on our travels tomorrow, then it is probably not a serious loss in this circumstance—we had to carry an umbrella without needing to use it.

Perhaps more serious though is that there are five days when our model tells us there will be no rain yet it rains called the false negatives. We might get inconveniently wet without our umbrella. The concepts of true and false positives and negatives will be further covered in Chapter The performance measure here tells us that we are going to get wet more often than we would like. This is an important issue—the fact that the different types of errors have different consequences for us.

To do so, we can select the Validation and then the Training options and for completeness the Full option from the Data line of the Evaluate tab and then Execute each. The resulting performance will be reported. We reproduce all four here for comparison, including the count and the percentages.

That is not surprising since the tree was built using the training dataset, and so it should be more accurate on what it has already seen. This provides a hint as to why we do not validate our model on the training dataset—the evaluation will provide optimistic estimates of the performance of the model.

This is more likely how accurate the model will be longer-term as we apply it to new observations. We have loaded some data, explored it, cleaned and transformed it, built a model, and evaluated the model. The model is now ready to be deployed.

Of course, there is a lot more to what we have just done than what we have covered here. The remainder of the book provides much of these details.

Before proceeding to the details, though, we might review how we interact with Rattle and R. We have seen the Rattle interface throughout this chapter and we now introduce it more systematically. The interface is based on a set of tabs through which we progress as we work our way through a data mining project.

For any tab, once we have set up the required information, we will click the Execute button to perform the actions. Take a moment to explore the interface a little.

Notice the Help menu and that the help layout mimics the tab layout. The Rattle interface is designed as a simple interface to a powerful suite of underlying tools for data mining. The general process is to step through each tab, left to right, performing the corresponding actions.

For any tab, we configure the options and then click the Execute button or F2 to perform the appropriate tasks. It is important to note that the tasks are not performed until the Execute button or F2 or the Execute menu item under Tools is clicked. The Status Bar at the base of the window will indicate when the action is completed. Messages from R e. Since Rattle is a simple graphical interface sitting on top of R itself, it is important to remember that some errors encountered by R on loading the data and in fact during any operation performed by Rattle may be displayed in the R Console.

This allows us to review the R commands that perform the corresponding data mining tasks. The R code snippets can be copied as text from the Log tab and pasted into the R Console from which Rattle is running, to be directly executed. This allows us to deploy Rattle for basic tasks yet still gives us the full power of R to be deployed as needed, perhaps through using more command options than are exposed through the Rattle interface. This also allows us the opportunity to export the whole session as an R script file.

The log serves as a record of the actions taken and allows those actions to be repeated directly and automatically through R itself at a later time. Simply select to display the Log tab and click on the Export button. This will export the log to a file that will have an R extension. We now traverse the main elements of the Rattle user interface, specif- ically the toolbar and menus.

We begin with a basic concept—a project. Projects A project is a packaging of a dataset, variable selections, explorations, and models built from the data. Rattle allows projects to be saved for later resumption of the work or for sharing the data mining project with other users.

A project is typically saved to a file with a rattle extension. In fact, the file is a standard binary RData file used by R to store objects in a more compact binary form.

Any R system can load such a file and hence have access to these objects, even without running Rattle. Loading a rattle file into Rattle using the Open button will load that project into Rattle, restoring the data, models, and other displayed information related to the project, including the log and summary infor- mation. We can then resume our data mining from that point. From a file system point of view, we can rename the files as well as the filename extension, though that is not recommended without impacting the project file itself—that is, the filename has no formal bearing on the contents, so use it to be descriptive.

It is best to avoid spaces and unusual characters in the filenames. Toolbar The most important button on the Toolbar Figure 2. All action is initiated with an Execute, often with a click of the Execute button. A keyboard shortcut for Execute is the F2 function key. A menu item for Execute is also available. It is worth repeating that the user interface paradigm used within Rattle is to set up the parameters on a tab and then Execute the tab.

The next few buttons on the Toolbar relate to the concept of a project within Rattle. Projects were discussed above. Clicking on the New button will restore Rattle to its pristine startup state with no dataset loaded. This can be useful when a source dataset has been externally modified external to Rattle and R. We might, for example, have manipulated our data in a spreadsheet or database pro- gram and re-exported the data to a CSV file. To reload this file into Rattle, if we have previously loaded it into the current Rattle session, we need to clear Rattle as with a click of the New button.

We can then specify the filename and reload it. The Report button will generate a formatted report based on the cur- rent tab. A number of report templates are provided with Rattle and will generate a document in the open standard ODT format, for the open source and open standards supporting LibreOffice.

Whilst sup- port for user-generated reports is limited, the log provides the necessary commands used to generate the ODT file. We can thus create our own ODT templates and apply them within the context of the current Rattle session. The Export button is available to export various objects and entities from Rattle.

Details are available together with the specific sections in the following chapters. The nature of the export depends on which tab is active and within the tab, which option is active.

The Export button is not available for all tabs and options. Menus The menus Figure 2. A key point in introducing menus is that they can be navigated from the keyboard and contain keyboard shortcuts so that we can navigate more easily through Rattle using the keyboard.

The Project menu provides access to the Open and Save options for loading and saving projects from or to files. The Tools menu provides access to some of the other toolbar functions as well as access to spe- cific tabs. The Settings menu allows us to control a number of optional characteristics of Rattle.

This includes tooltips and the use of the more modern Cairo graphics device. Extensive help is available through the Help menu. The structure of the menu follows that of the tabs of the main interface. On selecting a help topic, a brief text popup will display some basic information. Discover how to write code for various predication models, stream data, and time-series data.

You will also be introduced to solutions written in R based on RHadoop projects. You will finish this book feeling confident in your ability to know which data mining algorithm to apply in any situation. This book assumes familiarity with only the very basics of R, such as the main data types, simple functions, and how to move data around.

No prior experience with data mining packages is necessary; however, you should have a basic understanding of data mining concepts and processes.

Programmer Books. Random Books. Book Description: Being able to deal with the array of problems that you may encounter during complex statistical projects can be difficult.



0コメント

  • 1000 / 1000