<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4375349027250083196</id><updated>2011-07-28T15:55:42.054-07:00</updated><title type='text'>Exploratory Data Analysis</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>41</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8111854516265172157</id><published>2009-04-08T04:35:00.000-07:00</published><updated>2009-04-08T04:55:58.314-07:00</updated><title type='text'>Binning Baseball Ages</title><content type='html'>To motivate the issues on the next EDA topic, I collected the ages of all 651 pitchers who played major league baseball during the 2008 season.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have a vector age that contains the ages in years for these pitchers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A histogram is a standard way of graphing this data.  Before I graph, I should select reasonable bins; I could use the default selection of bins chosed by the R hist command, but I typically like more control over my graphical displays.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here I'm interested in the number of players who are each possible age 20, 21, 22, ... etc.  So choose cutpoints 19.5, 20.5, ..., 46.5 that cover &lt;/div&gt;&lt;div&gt;the range of the data and so there will be no confusion about data falling on bin boundaries.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;cutpoints=seq(19.5,46.5)&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now I can use the hist function using the optional breaks argument.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;hist(age,breaks=cutpoints)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SdyNdlSb6FI/AAAAAAAAAVQ/E7jkbJ1XPBk/s320/hist0.jpg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 319px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5322284399010244690" /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What do I see in this display?&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;The shape of the data looks a bit right-skewed.  I'm a little surprised about the number of pitchers who are 40 or older.&lt;/li&gt;&lt;li&gt;The most popular ages are 25 and 27 among MLB pitchers.&lt;/li&gt;&lt;li&gt;Looking more carefully, it might seem a little odd that we have 79 pitchers of ages 25 and 27, but only 66 pitchers who are age 26.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;What is causing this odd behavior in the frequencies for popular ages?  We don't see this behavior for the bins with small counts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Actually, this "odd behavior" is just an implication of the basic EDA idea that&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;LARGE COUNTS HAVE LARGER VARIABILITY THAN SMALL COUNTS&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So we typically will see this type of behavior whenever we construct a histogram.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When we plot a histogram, it would seem desirable to remove this "variability problem" so it is easier to make comparisons.  For example, when we compare the counts to expected counts assuming a Gaussian model, it will be harder to look at residuals for bins with large counts and bins with small counts since we will have this unequal variability problem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This discussion motivates the construction of a rootogram and eventually a suspended rootogram to make comparisons with a symmetric curve.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, can we fit a Gaussian curve to our histogram?  On R, we&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;first plot the histogram using the freq=FALSE option -- the vertical scale will be DENSITY rather than COUNTS&lt;/li&gt;&lt;li&gt;use the curve command to add a normal curve where the mean and standard deviation are found by the mean and sd of the ages&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Here are the commands and the resulting graph.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;hist(age,breaks=cutpoints,freq=FALSE)&lt;/div&gt;&lt;div&gt;curve(dnorm(x,mean(age),sd(age)),add=TRUE,col="red",lwd=3)&lt;/div&gt;&lt;div&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SdyQeCl_NeI/AAAAAAAAAVY/a7ah80--xFU/s320/hist2.jpg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 319px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5322287705411761634" /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It should be clear that a Gaussian curve is not a good model for baseball ages.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8111854516265172157?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8111854516265172157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8111854516265172157' title='40 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8111854516265172157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8111854516265172157'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/04/binning-baseball-ages.html' title='Binning Baseball Ages'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SdyNdlSb6FI/AAAAAAAAAVQ/E7jkbJ1XPBk/s72-c/hist0.jpg' height='72' width='72'/><thr:total>40</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5963761128899656891</id><published>2009-03-25T08:11:00.000-07:00</published><updated>2009-03-25T08:54:58.791-07:00</updated><title type='text'>Looking at Hits on my Web Page</title><content type='html'>The next topic is two-way tables.  To illustrate the application of median polish, I took some convenient data, that is, a data that was readily available to me.&lt;br /&gt;&lt;br /&gt;A couple of years ago, I wrote a book on Bayesian computing using R.  I have a website that gives resource material for the book and I use Google Analytics to monitor hits on this particular website.  Each day I observe the number and location of hits; it is interesting data partly since it seems that statisticians from many countries are interested in my book.&lt;br /&gt;&lt;br /&gt;Anyway, here is the data -- a number in the table represents the number of hits for a particular day of the week for a particular week.&lt;br /&gt;&lt;br /&gt; &lt;span style="font-family: courier new;"&gt;          Week 1 Week 2 Week 3 Week 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Sunday        22     12     17     15&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Monday        23     15     27     17&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Tuesday       17     26     21     14&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Wednesday     26     13     18     18&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Thursday      24     27     28     13&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Friday        28     17     17     19&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Saturday      14     11     13     13&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I am interested in how the website hits vary across days of the week and also how the website hits vary across weeks.  I can explore these patterns by means of an additive fit that I do by the median polish algorithm.&lt;br /&gt;&lt;br /&gt;Since the data is stored as a matrix, a median polish is done by the medpolish function:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;fit=medpolish(web.hits)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The variable "fit" stores the output of medpolish.  Let's look at each component of medpolish.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; fit$overall&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1] 18&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This tells that the average number of hits (per day) on my website was 18.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; fit$row&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;   Sunday    Monday   Tuesday Wednesday  Thursday    Friday &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;    -2.00      0.50     -0.50      0.00      5.50      1.75 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Saturday &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;    -5.25 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These are the row effects.  For Sunday, the row effect is -2 -- this means that on this day, the number of hits tends to be 2 smaller than average.  Comparing Sunday and Monday, there tends to be 0.50 - (-2.00) = 2.5 more hits on Monday.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; fit$col&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Week 1 Week 2 Week 3 Week 4 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;  4.50  -2.75   1.00  -1.00 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These are the column effects.  It looks like my website hits across weeks where HIGH, LOW, high, low.  On average, there were 4.50 - (2.75) = 7.25 more hits on Week 1 than Week 2.&lt;br /&gt;&lt;br /&gt;The remaining component in the additive fit are the residuals.  These tell us how the hit values deviate from the fitted values (from the additive model).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; fit$residuals&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;           Week 1 Week 2 Week 3 Week 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Sunday      1.50  -1.25   0.00   0.00&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Monday      0.00  -0.75   7.50  -0.50&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Tuesday    -5.00  11.25   2.50  -2.50&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Wednesday   3.50  -2.25  -1.00   1.00&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Thursday   -4.00   6.25   3.50  -9.50&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Friday      3.75   0.00  -3.75   0.25&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt; Saturday   -3.25   1.00  -0.75   1.25&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If the residual values are generally small (small compared to the row and column effects), then the additive model is a good description of the patterns in the data.   Actually, the residuals look large to me, so I'm not sure I'd get that excited about this additive fit.  Specifically, the residual for Tuesday, Week 2 is 11.25 -- for some reason, this particular day had many hits -- many more than one would expect based on its day of the week and week number.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5963761128899656891?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5963761128899656891/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5963761128899656891' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5963761128899656891'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5963761128899656891'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/03/looking-at-hits-on-my-web-page.html' title='Looking at Hits on my Web Page'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3269287717269499312</id><published>2009-03-17T04:57:00.000-07:00</published><updated>2009-03-17T05:05:01.872-07:00</updated><title type='text'>Smoothing on R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb-RsvdkULI/AAAAAAAAAVI/KxmdcfKKQic/s1600-h/plot.jpg"&gt;&lt;/a&gt;&lt;br /&gt;One of you asked how to produce a "3RSSH, twice" smooth on R.  It seems that my R notes on the web could be clarified.  Here I illustrate a simple function to do the smooth that we want.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is a new function that you can use called smooth.3RSSH.twice.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;smooth.3RSSH.twice=function(data)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;SMOOTH=han(smooth(attend,kind="3RSS")) # 3RSSH smooth&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;ROUGH=data-SMOOTH                      # computes the rough&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;SMOOTH+han(smooth(ROUGH,kind="3RSS"))  # twicing operation&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This program does three things:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First, one computes a 3RSSH smooth using the smooth command in R and the han function from the LearnEDA package.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  Then one computes the rough (the residuals) from this smooth.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  Then one smooths the rough (using the same 3RSSH smooth) and adds this "smoothed rough" to the first smooth to get a "twiced smooth".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is an illustration of how it works for the Braves attendance data from the notes.&lt;/div&gt;&lt;div&gt;(I am assuming the above function has been read into R.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(LearnEDA)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;data(brave.at)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;attach(brave.at)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;plot(game,attend)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;the.smooth=smooth.3RSSH.twice(attend)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;lines(game,the.smooth,lwd=3,col="green")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb-RsvdkULI/AAAAAAAAAVI/KxmdcfKKQic/s320/plot.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5314126283161227442" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3269287717269499312?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3269287717269499312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3269287717269499312' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3269287717269499312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3269287717269499312'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/03/smoothing-on-r.html' title='Smoothing on R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb-RsvdkULI/AAAAAAAAAVI/KxmdcfKKQic/s72-c/plot.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-2587161234942900856</id><published>2009-03-16T07:11:00.000-07:00</published><updated>2009-03-16T07:31:43.771-07:00</updated><title type='text'>Smoothing Free Throw Percentages</title><content type='html'>I hope you had a nice spring break.  My son's tennis team is flying to Florida this week -- I wish I could join him.&lt;br /&gt;&lt;br /&gt;Since March Madness is starting, I thought it would be appropriate to talk about basketball data.  There was an interesting article about free-throw shooting that recently appeared in the New York Times.  See the article &lt;a href="http://www.nytimes.com/2009/03/04/sports/basketball/04freethrow.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The main message was that free-throw shooting in professional basketball has hovered about 75% for many years.  Unlike other athletic performances such as running and swimming, basketball players don't seem to be getting better in shooting free throws.&lt;br /&gt;&lt;br /&gt;Is that really true?  Has free-throw shooting accuracy remained constant for all of the years of professional basketball?&lt;br /&gt;&lt;br /&gt;To answer this question, I collected some data.  For each of the seasons 1949-50 through the current season 2008-09, I collected the overall free-throw shooting percentage in the NBA.&lt;br /&gt;&lt;br /&gt;Here are the shooting percentages graphed as a function of year.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb5gvLSvmgI/AAAAAAAAAU4/rhQR6Q3sOGs/s1600-h/plot1.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 320px;" src="http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb5gvLSvmgI/AAAAAAAAAU4/rhQR6Q3sOGs/s320/plot1.jpg" alt="" id="BLOGGER_PHOTO_ID_5313790973945289218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Actually, although the overall free-throw shooting percentage is approximately 75%, there seems to some interesting patterns in the graph.&lt;br /&gt;&lt;br /&gt;To better see the patterns, I use the command&lt;br /&gt;&lt;br /&gt;han(smooth(FTP,kind="3RSS"))&lt;br /&gt;&lt;br /&gt;to superimpose a 3RSSH smooth on the graph and I get the following:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_V8g1rNtmHuM/Sb5hcstCpJI/AAAAAAAAAVA/UV3EvtpQNjk/s1600-h/plot2.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 320px;" src="http://1.bp.blogspot.com/_V8g1rNtmHuM/Sb5hcstCpJI/AAAAAAAAAVA/UV3EvtpQNjk/s320/plot2.jpg" alt="" id="BLOGGER_PHOTO_ID_5313791756008072338" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What patterns do we see?&lt;br /&gt;&lt;br /&gt;1.  In the early days between 1950-1970, the shooting percentages were relatively low with a valley around 72% in the late 1960's.&lt;br /&gt;&lt;br /&gt;2.  The shooting percentages increased through the 1970's, had a small valley and hit a peak of about 76% in 1990.&lt;br /&gt;&lt;br /&gt;3.  Then the percentages decreased again and had a local minimum of 74% around 1995.&lt;br /&gt;&lt;br /&gt;4.  In recent years, the percentages are increasing.  It is interesting that the current free-throw shooting percentage 77.2 is the highest in NBA history.&lt;br /&gt;&lt;br /&gt;So in reality, the shooting percentage has not stayed flat across years.  But it is surprising that NBA players haven't learned to shoot free throws better in the last 60 years.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-2587161234942900856?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/2587161234942900856/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=2587161234942900856' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/2587161234942900856'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/2587161234942900856'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/03/smoothing-free-throw-percentages.html' title='Smoothing Free Throw Percentages'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/Sb5gvLSvmgI/AAAAAAAAAU4/rhQR6Q3sOGs/s72-c/plot1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8938800475457280154</id><published>2009-03-02T08:11:00.000-08:00</published><updated>2009-03-02T08:24:15.982-08:00</updated><title type='text'>Fitting a Line by Eye</title><content type='html'>You did fine on the latest Fathom "fitting line" homework.  But I sensed a little confusion and I should make a few comments about fitting a line by eye.&lt;br /&gt;&lt;br /&gt;Let's return to that homework problem where you are plotting the OBP of the baseball players for two consecutive years.&lt;br /&gt;&lt;br /&gt;Here is a plot that many of you produced.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_V8g1rNtmHuM/SawFpqeuQHI/AAAAAAAAAUQ/uf1NITAdqPI/s1600-h/pic1.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 309px; height: 261px;" src="http://3.bp.blogspot.com/_V8g1rNtmHuM/SawFpqeuQHI/AAAAAAAAAUQ/uf1NITAdqPI/s320/pic1.png" alt="" id="BLOGGER_PHOTO_ID_5308624274098897010" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Is this a good graph?  Actually, NO since there is too much white space around the points.&lt;br /&gt;&lt;br /&gt;You can improve this in Fathom by using the hand tool to fill up the space.  Here is a better plot.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_V8g1rNtmHuM/SawGAQzrRKI/AAAAAAAAAUY/EMybWyrqSZY/s1600-h/pic2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 310px; height: 256px;" src="http://3.bp.blogspot.com/_V8g1rNtmHuM/SawGAQzrRKI/AAAAAAAAAUY/EMybWyrqSZY/s320/pic2.png" alt="" id="BLOGGER_PHOTO_ID_5308624662344451234" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Second, it is not easy to fit a movable line.  To make this process easier, plot the movable line and then add a residual graph.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_V8g1rNtmHuM/SawGTyy7MlI/AAAAAAAAAUg/ZHPon4BZAYo/s1600-h/pic3.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 308px;" src="http://1.bp.blogspot.com/_V8g1rNtmHuM/SawGTyy7MlI/AAAAAAAAAUg/ZHPon4BZAYo/s320/pic3.png" alt="" id="BLOGGER_PHOTO_ID_5308624997885620818" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;A good line will remove any tilt pattern in the residual graph.  I still have a downward tilt in this graph - this suggests I have to try a little harder.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_V8g1rNtmHuM/SawGpzkb4kI/AAAAAAAAAUo/picQziGHVnA/s1600-h/pic4.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 310px;" src="http://1.bp.blogspot.com/_V8g1rNtmHuM/SawGpzkb4kI/AAAAAAAAAUo/picQziGHVnA/s320/pic4.png" alt="" id="BLOGGER_PHOTO_ID_5308625376050405954" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This looks better -- I don't see much of an increasing or decreasing pattern in the residuals.&lt;br /&gt;&lt;br /&gt;Does my best fit correspond to a least-squares or resistant fit?  (By the way, the resistant line is called a median-median line in Fathom.)  I show all three lines below.  Least-squares is blue, median-median is purple, and my line is brown.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SawHLs-BpuI/AAAAAAAAAUw/_4tR39W3LTc/s1600-h/pic5.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 304px; height: 320px;" src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SawHLs-BpuI/AAAAAAAAAUw/_4tR39W3LTc/s320/pic5.png" alt="" id="BLOGGER_PHOTO_ID_5308625958394242786" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It looks like my line is closer to the resistant fit. &lt;br /&gt;&lt;br /&gt;Generally, when one has outliers, I would anticipate that my line would be closer to the resistant median-median line.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8938800475457280154?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8938800475457280154/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8938800475457280154' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8938800475457280154'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8938800475457280154'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/03/fitting-line-by-eye.html' title='Fitting a Line by Eye'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_V8g1rNtmHuM/SawFpqeuQHI/AAAAAAAAAUQ/uf1NITAdqPI/s72-c/pic1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4911055510158387977</id><published>2009-02-27T07:47:00.000-08:00</published><updated>2009-02-27T07:57:42.358-08:00</updated><title type='text'>Graphical User Interface for R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_V8g1rNtmHuM/SagMHdUTO3I/AAAAAAAAAUI/hGry0WKXhpU/s1600-h/Rcmdr.image.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 303px; height: 320px;" src="http://3.bp.blogspot.com/_V8g1rNtmHuM/SagMHdUTO3I/AAAAAAAAAUI/hGry0WKXhpU/s320/Rcmdr.image.png" alt="" id="BLOGGER_PHOTO_ID_5307505483124521842" border="0" /&gt;&lt;/a&gt;Since some of you seem to be struggling with R's interface, where you type commands and you have to know the R functions.&lt;br /&gt;&lt;br /&gt;There is an attractive package called R Commander which provides a menu interface for R.  You work in a special window and most of the useful R functions are available as options in menus.&lt;br /&gt;&lt;br /&gt;On the left, you see a snapshot of R Commander (on a Macintosh).&lt;br /&gt;&lt;br /&gt;How do you get R Commander?  It is easy -- you just install the package Rcmdr from CRAN.  This package uses a number of other R packages.  If you don't have these other packages installed yet, then R will automatically install these (it takes a little while -- be patient).&lt;br /&gt;&lt;br /&gt;One of my former students who teaches at Youngstown State uses R Commander in his statistics classes.  He says "Rcmdr is really cool, and the students eat it up."&lt;br /&gt;&lt;br /&gt;Anyway, you might want to try this.  You can run the functions in LearnEDA by first loading it and typing commands (like lval) in the top window.&lt;br /&gt;&lt;br /&gt;Let me know if you find this helpful, since I haven't done much with it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4911055510158387977?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4911055510158387977/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4911055510158387977' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4911055510158387977'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4911055510158387977'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/graphical-user-interface-for-r.html' title='Graphical User Interface for R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_V8g1rNtmHuM/SagMHdUTO3I/AAAAAAAAAUI/hGry0WKXhpU/s72-c/Rcmdr.image.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-2562882556676695160</id><published>2009-02-26T06:15:00.001-08:00</published><updated>2009-02-26T06:35:05.668-08:00</updated><title type='text'>Plotting and Straightening on R</title><content type='html'>To encourage you to use R for next week's assignment, I put together a new movie showing R in action.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As you probably know, one of the fastest growing companies in the U.S. is Google.  I found an interesting graph showing Google's growth (as measured by the number of employees):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SaakwhzpRaI/AAAAAAAAAUA/vfnfYx5uk9M/s320/google.graph.jpg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 247px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5307110364518368674" /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Obviously there has been a substantial increase in Google employees over this two year period.  But we wish to say more.  We'd like to describe the size of this growth.  Also, we'd like to look at the residuals that will detect possible interesting patterns beyond the obvious growth.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's the plan.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  We start with plotting the data, fitting a resistant line, and looking at the residuals.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  By looking at the half-slope ratio and the residuals, we decide if the graph is straight.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  If we see curvature in the graph, then we try to reexpress one or more of the variables (by a power transformation) to straighten the graph.  We use half-slope ratios and residual plots to see if we are successful in straightening the graph.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4.  Once the graph is straight, then we summarize the fit (interpret the slope of the resistant line) and examine the residuals.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The key function in the LearnEDA package is rline.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I illustrate using rline in my new movie &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://bayes.bgsu.edu/eda/straightening.google.swf"&gt;http://bayes.bgsu.edu/eda/straightening.google.swf&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The dataset can be found at &lt;a href="http://bayes.bgsu.edu/eda/data/google.txt"&gt;http://bayes.bgsu.edu/eda/data/google.txt &lt;/a&gt;&lt;/div&gt;&lt;div&gt;and my script of R commands for this analysis can be found at&lt;/div&gt;&lt;div&gt;&lt;a href="http://bayes.bgsu.edu/eda/R/Ch5.google.work.R"&gt;http://bayes.bgsu.edu/eda/R/Ch5.google.work.R&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, I'm using a new version of the LearnEDA package that you can download at&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://bayes.bgsu.edu/eda/LearnEDA_1.01.zip"&gt;http://bayes.bgsu.edu/eda/LearnEDA_1.01.zip&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Hope this is helpful in your work next week.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-2562882556676695160?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/2562882556676695160/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=2562882556676695160' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/2562882556676695160'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/2562882556676695160'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/plotting-and-straightening-on-r.html' title='Plotting and Straightening on R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SaakwhzpRaI/AAAAAAAAAUA/vfnfYx5uk9M/s72-c/google.graph.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-831915025156794872</id><published>2009-02-25T08:36:00.000-08:00</published><updated>2009-02-25T08:39:21.306-08:00</updated><title type='text'>Sexy Jobs</title><content type='html'>Since many of you will be thinking about jobs soon, here is an interesting posting about desirable jobs.   Many of you are in the right area!&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;In a recent talk by Google's chief economist Hal Varian, he says this:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;"I keep saying the sexy job in the next ten years will be statisticians.&lt;br /&gt;People think I’m joking, but who would’ve guessed that computer engineers&lt;br /&gt;would’ve been the sexy job of the 1990s? . . .&lt;br /&gt;       The ability to take data—to be able to understand it, to process it,&lt;br /&gt;to extract value from it, to visualize it, to communicate it—that’s going to&lt;br /&gt;be a hugely important skill in the next decades, not only at the&lt;br /&gt;professional level but even at the educational level for elementary school&lt;br /&gt;kids, for high school kids, for college kids. Because now we really do have&lt;br /&gt;essentially free and ubiquitous data. So the complimentary scarce factor is&lt;br /&gt;the ability to understand that data and extract value from it.&lt;br /&gt;        I think statisticians are part of it, but it’s just a part. You also&lt;br /&gt;want to be able to visualize the data, communicate the data, and utilize it&lt;br /&gt;effectively. But I do think those skills—of being able to access,&lt;br /&gt;understand, and communicate the insights you get from data analysis—are&lt;br /&gt;going to be extremely important. Managers need to be able to access and&lt;br /&gt;understand the data themselves. "&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;The full article is at&lt;/span&gt;&lt;br /&gt; &lt;br /&gt; &lt;a style="color: rgb(255, 0, 0);" href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286" target="_blank"&gt;http://www.mckinseyquarterly.&lt;wbr&gt;com/Hal_Varian_on_how_the_Web_&lt;wbr&gt;challenges_managers_2286&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-831915025156794872?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/831915025156794872/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=831915025156794872' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/831915025156794872'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/831915025156794872'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/sexy-jobs.html' title='Sexy Jobs'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4457843995557287499</id><published>2009-02-23T07:15:00.000-08:00</published><updated>2009-02-23T07:33:09.521-08:00</updated><title type='text'>Reducing Skewness</title><content type='html'>Do you want to see me quickly reduce skewness?&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size:78%;"&gt;skewness&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;More seriously, here are some comments after looking at your homework (the grades are posted and you did generally well on this homework).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;1.  Why is symmetry important?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Some of you may be wondering why it is useful to make datasets symmetric.   I don't think making a dataset symmetric is as important as making datasets have equal spreads, but here are a couple of reasons why symmetry is helpful.&lt;br /&gt;&lt;br /&gt;-- Symmetric datasets are simpler to summarize.  There is an obvious "average".  Also if the data is bell-shaped, one can use the empirical rule (2/3 of the data fall within one standard deviation of the mean).&lt;br /&gt;&lt;br /&gt;-- Many statistical procedures like ANOVA assume normally distributed data.  One might wish to reexpress data before applying these procedures.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 0, 0);"&gt;2.  Monotone increasing reexpressions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;-- In our definition of power transformations, recall that when p is negative, we consider&lt;br /&gt;&lt;br /&gt;"minus data raised to a power"&lt;br /&gt;&lt;br /&gt;We did that so that all of the transformations are monotone increasing.   So when you have a right-skewed dataset and you move from p=1 to p=0 to p=-1, you should be moving toward symmetry and left-skewness.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;3.  Skewness in the middle and skewness in the tails. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Sometimes it is tough to make a data symmetric, since there will be different behavior in the middle half and in the tails.  You can detect this by a symmetry plot.  The points to the left may be close to the line and the points to the right may fall off the line.  This indicates that the middle of the data is symmetric, but there is a long tail to the right (or the left).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 0, 0);"&gt;4.  When does reexpression work?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You need sufficient spread in the data as measured by HI/LO.  If this ratio is not much different from 1, reexpressions won't help much.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 0, 0);"&gt;5.  Are some of you "R resistant"?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Many of you seem to prefer using Fathom or Minitab.  That's okay, but the best way to get comfortable using R is to practice using it.    I made the package LearnEDA to make R easier to use, but some of you aren't taking advantage of the special functions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4457843995557287499?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4457843995557287499/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4457843995557287499' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4457843995557287499'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4457843995557287499'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/reducing-skewness.html' title='Reducing Skewness'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5880893615865644467</id><published>2009-02-16T04:34:00.000-08:00</published><updated>2009-02-16T04:48:52.487-08:00</updated><title type='text'>Last Homework and Dreamtowns</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SZlf2s_Kp7I/AAAAAAAAAT4/e7E9C3ARaa4/s1600-h/dreamtown.jpg"&gt;&lt;/a&gt;&lt;br /&gt;We've been talking about taking power reexpressions of data to achieve particular objectives, such as equalizing spreads between batches or making a batch symmetric.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When do these reexpressions work?  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First you need to have data with sufficient spread, that we measure by the ratio HI/LO for the power transformations to have much effect.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  You have to be taking a "significant" power.  In the last Fathom homework, most of you tried p = .82 (from a starting value of p = 1) which would typically have little effect on the data.  Typically, we try reexpressions in steps of 0.5, so we move from raw to roots to logs, and so on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This week, we are reexpressing data to achieve approximate symmetry.  Here's an example.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Last summer, there was an interesting article posted on bizjournals.com that ranked 140 "dreamtowns" -- these towns offer refuge from big cities and conjested traffice.  I got interested in the article since my town, Findlay, made the list.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway, they collected a number of variables from each city, including the percentage of adults (25 and over) who hold college degrees.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's a histogram of the these percentages from the 140 towns.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SZlf2s_Kp7I/AAAAAAAAAT4/e7E9C3ARaa4/s320/dreamtown.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5303375429599143858" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This looks right-skewed with one outlier (I never knew that Bozeman, Montana had a lot of highly educated people) and this is a good candidate for reexpression.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the notes, I give several methods (plotting the mids, using a symmetry plot, using Hinkley's method) for choosing the "right" rexpression.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'd suggest that you should try at least 2 of these methods in your homework.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5880893615865644467?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5880893615865644467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5880893615865644467' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5880893615865644467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5880893615865644467'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/last-homework-and-dreamtowns.html' title='Last Homework and Dreamtowns'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SZlf2s_Kp7I/AAAAAAAAAT4/e7E9C3ARaa4/s72-c/dreamtown.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4916837294683804628</id><published>2009-02-09T08:52:00.000-08:00</published><updated>2009-02-09T09:05:59.282-08:00</updated><title type='text'>Simple Statistical Comparisons</title><content type='html'>&lt;div&gt;I just finished grading your "comparing batches" homework.  You did fine in the mechanics (computing the summaries, constructing a spread versus level graph, and deciding on an appropriate reexpression), but it wasn't clear that you understand WHY we are doing all of this work.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's review the main points of this section.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  We wish to &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;compare two batches&lt;/span&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  What does compare mean?  Well, it could mean many things.  Batch 1 has a larger spread, batch 2 has three more outliers than batch 2, and so on.  You illustrated many type of comparisons in your homework writeups.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here we wish to compare the general or average locations of the two batches.   A &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;simple statistical comparison&lt;/span&gt; is something like&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"batch 2 is 10 units larger than batch 1"&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Maybe we should call this a SSC (statisticians love to use acroynms.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What this could mean is that if I added 10 units to each value in batch 1, then I would get a new dataset that resembles batch 2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  Is it &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;always appropriate&lt;/span&gt; to make a SSC?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;No.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It won't work if the two batches have unequal spreads.  If they have unequal spreads, then adding a number to batch 1 will NOT resemble batch 2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4.  So if the batches have different spreads, we give up?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;No.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is possible that can can reexpress the batches to a new scale, so that the new batches have approximate new scales.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;5.  So the plan is to (1) try to &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;find a suitable reexpression&lt;/span&gt; and (2) &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;do a SSC on the reexpressed data&lt;/span&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The snowfall data example using Fathom is one example where our strategy works.  But generally, you did a poor job in making a SCC on the reexpressed data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4916837294683804628?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4916837294683804628/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4916837294683804628' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4916837294683804628'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4916837294683804628'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/simple-statistical-comparisons.html' title='Simple Statistical Comparisons'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-1718987442639672093</id><published>2009-02-01T20:01:00.001-08:00</published><updated>2009-02-01T20:16:37.628-08:00</updated><title type='text'>A Statistical Salute to the Steelers</title><content type='html'>I just finished watching the Super Bowl and I'm happy that the Steelers won.  The winning quarterback Ben Roethlisberger is from my home town (Findlay) and it was very exciting seeing him direct his team to the winning touchdown at the end of the game.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As I was watching this game, I wondered: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"Which team has the better offense?"&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To answer this question, I collected the yards gained for every single play of the two teams, the Steelers and the Cardinals, for this Superbowl.  I created a datafile with two variables, Yards and Team, and entered this data into R.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Using the R function&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;boxplot(Yards~Team,horizontal=T,xlab="Yards Gained",ylab="Team",col="gold")&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I created the following boxplot.&lt;/div&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SYZxYLypnPI/AAAAAAAAATw/i1EN5Hx2vp0/s320/superbowl.jpg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 319px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5298046671943998706" /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What do we see in this boxplot display?&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Both batches of yards gained look a bit right-skewed.  &lt;/li&gt;&lt;li&gt;There are four outliers for the Steelers and two for the Cardinals.  These correspond to big plays for the teams that gained a lot of yards.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Why did I draw this boxplot display?  Several comments:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;We wish to make a comparison between the two batches.&lt;/li&gt;&lt;li&gt;What is a comparison?  Well, we could say that one batch tends to be larger than the second batch.  For this example, if it were true, then I would say that the Cardinals gained more yards (per play) than the Steelers.&lt;/li&gt;&lt;li&gt;But that isn't really saying much.   When I say "make a comparison", this means that I want to say that one batch is a particular number larger or smaller than the second batch.&lt;/li&gt;&lt;li&gt;When can we say &lt;br /&gt;"batch 1 is 10 larger than batch 2"?&lt;/li&gt;&lt;li&gt;As you'll read in the notes, we can only make this type of comparison "batch 1 is 10 larger than batch 2" when the two batches have similar spreads.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Returning to my football example, it doesn't appear that the two teams were very different with respect to yards gained per play.  Both teams had approximately the same median.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-1718987442639672093?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/1718987442639672093/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=1718987442639672093' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1718987442639672093'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1718987442639672093'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/02/statistical-salute-to-steelers.html' title='A Statistical Salute to the Steelers'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_V8g1rNtmHuM/SYZxYLypnPI/AAAAAAAAATw/i1EN5Hx2vp0/s72-c/superbowl.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3629714255318687838</id><published>2009-01-31T06:08:00.000-08:00</published><updated>2009-01-31T06:33:00.388-08:00</updated><title type='text'>Letter values by group</title><content type='html'>In the next topic, we are talking about comparing groups.  The first step is to construct some graph (such as a stemplot or boxplot or dotplot) across groups and the next step is to compute summaries (such as letter values) for each group.  Here are some comments about using R to do this stuff.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's return to the boston marathon data where there are two variables, time and age.  We are interested in comparing the completion times for different ages.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;data(boston.marathon)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;attach(boston.marathon)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, this data frame is organized with two variables, response (here time) and&lt;/div&gt;&lt;div&gt;group (here age).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  GRAPHS&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There is no simple way to construct parallel stemplots in R for different groups.  Parallel boxplots are easy to construct by typing, say, &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;boxplot(time~age,ylab="Time",xlab="Group")&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also, it is easy to construct parallel dotplots by use of the stripchart function.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;stripchart(time~age,ylab="Group")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Personally, I prefer dots that are solid black and are stacked:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;stripchart(time~age,method="stack",pch=19,ylab="Group")&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  SUMMARIES BY GROUP&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you apply the boxplot option with the plot = FALSE option, it will give you five number summaries (almost) for each group.  But the output isn't very descriptive, so I wrote a simple "wrapper" function where the output is easier to follow.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;lval.by.group=function(response,group)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;B=boxplot(response~group, plot=FALSE, range=0)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;S=as.matrix(B$stats)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;dimnames(S)[[1]]=c("LO","QL","M","QH","HI")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;dimnames(S)[[2]]=B$names&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;S&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To use this in R, you just type the function into the R Console window.  Or if you store this function in a file called lval.by.group.R, then you read the function into R by typing&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;source("lval.by.group.R")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway, let me apply this function for the marathon data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;lval.by.group(time,age)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;    20  30  40  50  60&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;LO 150 194 163 222 219&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;QL 222 213 224 251 264&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;M  231 235 239 262 274&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;QH 240 259 262 281 279&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;HI 274 330 346 349 338&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;attr(,"class")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       20 &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;"integer" &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think the output is pretty clear.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  WHAT IF YOUR DATA IS ORGANIZED AS (GROUP1, GROUP2, ...)?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Sometimes, it is convenient to read data in a matrix format, where the different groups are in different columns.  An example of this is the population densities dataset.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;data(pop.densities.1920.2000)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;pop.densities.1920.2000[1:4,]&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  STATE y1920 y1960 y1980 y1990 y2000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1    AL  45.8  64.2  76.6  79.6  87.6&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;2    AK   0.1   0.4   0.7   1.0   1.1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;3    AZ   2.9  11.5  23.9  32.3  45.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;4    AR  33.4  34.2  43.9  45.1  51.3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You see that the 1920 densities are in the 2nd column, the 1960 densities in the second column, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway, you want to put this in the&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[Response, Group]&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;format.  You can do this by the stack command.  The input is the matrix of data (with the first column removed) and the output is what you want.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;d=stack(pop.densities.1920.2000[,-1])&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;d[1:5,]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;  values   ind&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;1   45.8 y1920&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;2    0.1 y1920&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;3    2.9 y1920&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;4   33.4 y1920&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;5   22.0 y1920&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we can compute the letter values by group (year) by typing&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;lval.by.group(d$values,d$ind)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;     y1920   y1960   y1980   y1990   y2000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;LO    0.10     0.4     0.7    1.00    1.10&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;QL   17.30    22.5    28.4   32.05   41.40&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;M    39.90    67.2    80.8   79.60   88.60&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;QH   62.35   114.6   157.7  181.65  202.85&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;HI 7292.90 12523.9 10132.3 9882.80 9378.00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that since I didn't attach the data frame d, I'm referring to the response variable as d$values and the grouping variable by d$ind.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3629714255318687838?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3629714255318687838/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3629714255318687838' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3629714255318687838'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3629714255318687838'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/01/letter-values-by-group.html' title='Letter values by group'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3316275420377718903</id><published>2009-01-28T13:51:00.000-08:00</published><updated>2009-01-28T15:28:57.710-08:00</updated><title type='text'>How to Bin?</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SYDcFid9KhI/AAAAAAAAATo/YFOW62v2z5w/s1600-h/hist2.jpg"&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;One of you asked me a question about the Fathom assignment on constructing histograms.  Let me describe the issues by considering a histogram for a new dataset.&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Today I was thinking about snow and wondering if 6 inches is really a lot of snowfall.  So I thought I would look for some data showing levels of snowfall for different locations in the United States.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;At the National Climatic Data Center site http://www.ncdc.noaa.gov/ussc/USSCAppController?action=map, one can download &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; "&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;a "1.1 MB ASCII text file containing the maximum 1-day, 2-day, and 3-day snowfall for all available stations in the Lower 48 States and Alaska based on data through December 2006."  This sounded interesting so I downloaded this data, created a text datafile, and then read this into R.  (I had to do some data cleaning -- a little tedious.)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;data=read.table("snowclim-fema-sep2006.txt",header=T,sep="\t")&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;This dataset contains the maximum 1-day snowfall, the maximum 2-day snowfall, the maximum 3-day snowfall for 9087 weather stations in 49 states in the U.S.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Let's focus on the max 1-day snowfall for all of the stations in Ohio -- remember, I wanted to figure out the extreme nature of 6 inches.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Using the subset command, I create a new data frame ohio that contains the data for the Ohio stations.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;ohio=subset(data,data$State=="OH")&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The variable of interest is called Max.1.Day.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Let's say that I wish to construct a histogram by hand -- that is, choose the bins by hand. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;First, I use the summary command to figure out the extremes.  So the variable falls betweeen 6.5 and 21.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;   6.50   11.25   13.00   13.28   15.00   21.00 &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Let's try using 7 bins (why not?).  The range of the data is 21-6.50 = 14.5.  If I divide this range by 7, I get 14,5/7 = 2.07 which is approximately 2 (a prettier number than 2.07).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;So let's try using the bins&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(6, 8),  (8, 10), (10, 12), (12, 14), (14, 16), (16, 18), (18, 20), (20, 22)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(Actually this is 8 bins, but that's ok -- it is close to 7 bins.)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;In R, I create a vector that gives these bin endpoints -- we call these my.breaks.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;my.breaks=seq(6,22,2)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I use the hist command with the breaks option.  Also I want to show "6 inches" on the graph.  I do this by adding a vertical line to the graph (abline command) and label it by the text command. &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;hist(ohio$Max.1.Day,breaks=my.breaks)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;abline(v=6,lwd=3,col="red")&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;text(8,45,"6 inches",col="red")&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); line-height: normal; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; "&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SYDcAI3rlcI/AAAAAAAAATg/aj4JTwiL8qY/s320/hist1.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5296475056726840770" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Is this a good looking histogram?  In other words, did I make a reasonable choice of bins?  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;This is a pretty smooth and I have a clear idea of the shape of the data (a little right-skewed).  But maybe I need more bins to see finer structure in the histogram.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Let's try using bins of length 1.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;my.breaks=seq(6,22,1)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;hist(ohio$Max.1.Day,breaks=my.breaks)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;abline(v=6,lwd=3,col="red")&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;text(8,30,"6 inches",col="red")&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); line-height: normal; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; "&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SYDcFid9KhI/AAAAAAAAATo/YFOW62v2z5w/s320/hist2.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5296475149497608722" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Well, I think we went too far.  The histogram is less smooth -- I see some bumps in the histogram that may be a byproduct of random noise.  I definitely prefer the first histogram to this one.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I am guessing that an "optimal" bin width is between 1 and 2.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;If you use the hist command without a breaks option, it will use the "Sturges" method to determine the optimal bin width for this particular sample size.  This method is similar to the one described in the Fathom lab.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I tried it.  Suprisingly (really), it turns out that R used the same breaks as I did in the first histogram.  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Wrap-up:  How does this discussion relate to the Fathom homework?  In Fathom, it is very easy to play with the bin width and see the effect of the choice of bins on the histograms.  In the famous kid's tale "Goldilocks and the Three Bears", remember that Goldilocks tried several bowls of porridge -- one bowl was "too hot", a second bowl was "too cold", and a third bowl was "just right".  Similarly, you'll get bad histograms when you choose too few bins or too many bins, and there will be a middle number of bins that will be "just right".  By playing with different histograms, you will decide (by eye) what choice of bin width is "just right", and then you will see if your choice of best bin width corresponds to the bin choice using the optimal histogram rule.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;By the way, how does the "optimal rule" &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style=""&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;  bin width = 2 (fourth spread) / n^(1/3)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;perform here?  Here the quartiles are 11.25 and 15, the fourth spread is 15-11.25 = 3.75, so the optimal bin width is 2 (3.75)/179^(1/3) = 1,33 which is a bit smaller than the bin width of 2 used in the hist command.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Final thought -- "6 inches" is really not a lot of snow, although it seemed to cripple the city where I live.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="font-family: arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 22px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3316275420377718903?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3316275420377718903/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3316275420377718903' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3316275420377718903'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3316275420377718903'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/01/how-to-bin.html' title='How to Bin?'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_V8g1rNtmHuM/SYDcAI3rlcI/AAAAAAAAATg/aj4JTwiL8qY/s72-c/hist1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5298735517795058627</id><published>2009-01-21T09:09:00.001-08:00</published><updated>2009-01-21T09:22:22.588-08:00</updated><title type='text'>President Ages</title><content type='html'>As you know, yesterday was a big day in the history of the US as Barack Obama became our 44th president.  In honor of that event, I collected the ages at inauguration and lifespans of all 44 presidents and read this dataset into R.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;data=read.table("president.ages.txt",header=T,sep="\t")&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I attach the dataframe to make the variables visible.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;attach(data)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Then using the stem.leaf function in the aplstats package, I construct a stemplot of the president ages:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; stem.leaf(Age)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1 | 2: represents 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; leaf unit: 1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;            n: 44&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2     t | 23&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;         f | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6     s | 6677&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   9    4. | 899&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  15    5* | 011111&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  17     t | 22&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  (9)    f | 444445555&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  18     s | 6667777&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  11    5. | 8&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  10    6* | 0111&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6     t | 2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   5     f | 445&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;         s | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2    6. | 89&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What do I see?  Here are some things I notice:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;The ages at inauguration seem pretty symmetric shaped about the values of 54 or 55.&lt;/li&gt;&lt;li&gt;President ages range from 42 to 69.  Actually, one has to be a particular age to be president, so I believe that 42 is close to the rule.&lt;/li&gt;&lt;li&gt;Barrack Obama is one of the youngest presidents in history at 48&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;I didn't play with the options for stemplot -- I just used the default settings in the stem.leaf function.  Could I produce a better stemplot?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's break between the tens and ones with five leaves per stem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; stem.leaf(Age,unit=1,m=2)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1 | 2: represents 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; leaf unit: 1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;            n: 44&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;    2    4* | 23&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;    9    4. | 6677899&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   22    5* | 0111112244444&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  (12)   5. | 555566677778&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   10    6* | 0111244&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;    3    6. | 589&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Or I could go the other way and split between the ones and tenths with 10 leaves per stem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; stem.leaf(Age,unit=.1,m=1)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1 | 2: represents 1.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; leaf unit: 0.1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;            n: 44&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   1    42 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2    43 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        44 | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        45 | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   4    46 | 00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6    47 | 00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   7    48 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   9    49 | 00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  10    50 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  15    51 | 00000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  17    52 | 00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        53 | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  22    54 | 00000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  (4)   55 | 0000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  18    56 | 000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  15    57 | 0000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  11    58 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        59 | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  10    60 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   9    61 | 000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6    62 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        63 | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   5    64 | 00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   3    65 | 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;HI: 68 69&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think it is pretty obvious that the first stemplot is the best.   If I have too few lines, then I lose some of the structure of the distribution; with too many lines, I don't see any structure at all.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5298735517795058627?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5298735517795058627/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5298735517795058627' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5298735517795058627'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5298735517795058627'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/01/president-ages.html' title='President Ages'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-6368094232350296355</id><published>2009-01-14T06:56:00.000-08:00</published><updated>2009-01-14T07:02:17.723-08:00</updated><title type='text'>Learning R by Viewing Movies</title><content type='html'>As you are learning R, you should try out the movies that I have made that are posted at&lt;br /&gt;&lt;br /&gt;                   http://bayes.bgsu.edu/eda&lt;br /&gt;&lt;br /&gt;(I call them R Encounters.)  You'll see movies that show&lt;br /&gt;&lt;br /&gt;-- how to create a dataset and read it into R&lt;br /&gt;-- how to install the LearnEDA package in R&lt;br /&gt;-- illlustrating a simple data analysis on R&lt;br /&gt;&lt;br /&gt;I'm assuming that most of you are working in Windows.  Let me know if you work on a Macintosh -- things work on R a little differently on mac.&lt;br /&gt;&lt;br /&gt;Students have struggled in the past on creating datafiles and reading them in R -- let me know if you are experiencing problems with this.&lt;br /&gt;&lt;br /&gt;Also, let me know if you like the R movies.  They are easy to make and I can make more of them if you find them useful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-6368094232350296355?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/6368094232350296355/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=6368094232350296355' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6368094232350296355'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6368094232350296355'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/01/learning-r-by-viewing-movies.html' title='Learning R by Viewing Movies'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4857548482781228112</id><published>2009-01-13T07:31:00.000-08:00</published><updated>2009-01-13T07:35:44.167-08:00</updated><title type='text'>Welcome to Exploratory Data Analysis</title><content type='html'>Hi EDA students.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I will be using this blog to provide advice on using R, give you new examples and help on the homework.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This week, you'll be learning R and Fathom and creating a couple of datasets that you'll be reading into R.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have provided help on R on my website http://bayes.bgsu.edu/eda&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You'll find guidance on installing R and the LearnEDA package, and several R lessons that you can work through.  Also I made some movies that illustrate some basic R stuff.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I know that learning R can be difficult, so let me know if you have specific concerns.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4857548482781228112?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4857548482781228112/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4857548482781228112' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4857548482781228112'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4857548482781228112'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2009/01/welcome-to-exploratory-data-analysis.html' title='Welcome to Exploratory Data Analysis'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-7911481262665290996</id><published>2008-12-01T05:36:00.000-08:00</published><updated>2008-12-01T05:44:54.233-08:00</updated><title type='text'>Why do we flog?</title><content type='html'>I just put an example of using flogs to compare scores on two years of placement scores.  But some of you may be confused why we take flogs in the first place.  Here are the main points.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  We want to compare two proportions, say p1 and p2.  There are two problems with a direct comparision like p1/p2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;a)  This measure depends on whether we consider the proportions p1, p2 or the proportions 1-p1, 1-p2, and that's a problem.  The choice of p1, p2 or 1-p1, 1-p2 shouldn't change our comparision.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;b)  Small proportions tend to have smaller variation than large proportions (close to .5).  The ratio p1/p2 = 1.5 is more meaningful (significant) if the p's are close to zero, than if the p's are close to 0.5.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So we want to transform a proportion p so that &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;-- it doesn't matter if we consider p or 1-p&lt;/div&gt;&lt;div&gt;-- the variance of the transformed p will be roughly the same for p near 0 and p near 0.5.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By using the flog reexpression log(p/(1-p)) we achieve these two goals.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, the flog reexpression is the basis for logistic regression models.  You will likely be using logistic regression in your statistical life, but we are trying here to motivate why we use the flog reexpression.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-7911481262665290996?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/7911481262665290996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=7911481262665290996' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7911481262665290996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7911481262665290996'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/12/why-do-we-flog.html' title='Why do we flog?'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-9007811273542904262</id><published>2008-12-01T05:10:00.000-08:00</published><updated>2008-12-01T05:33:47.003-08:00</updated><title type='text'>Flogging placement data</title><content type='html'>Here's an illustration on transforming data for proportion data.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Every summer, a math placement test is given to over 3000 entering freshmen.  The score on this test is used to determine which math course they are allowed to take in the fall.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The score on the placement test is a course number that indicates that the student can take that course in the fall.  Here are the counts of the placement scores for the freshmen in the last six years:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;    2003 2004 2005 2006 2007 2008&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;131H   33   54   51   57   44   41&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;131   196  364  342  361  320  251&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;130   192  245  236  248  211  208&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;128   557  707  647  700  603  618&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;126   428  518  489  480  442  428&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;122   501  612  580  565  464  498&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;215   661  912  888  838  792  747&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;112   207  216  230  208  212  180&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;095   418  524  545  419  407  335&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;090    46   78   89   58   62   46&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How can we compare the scores for different years?  A first step to compute percentages of each column.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;  &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2003 2004 2005 2006 2007 2008&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;131H  1.0  1.3  1.2  1.4  1.2  1.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;131   6.1  8.6  8.3  9.2  9.0  7.5&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;130   5.9  5.8  5.8  6.3  5.9  6.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;128  17.2 16.7 15.8 17.8 17.0 18.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;126  13.2 12.2 11.9 12.2 12.4 12.8&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;122  15.5 14.5 14.2 14.4 13.0 14.9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;215  20.4 21.6 21.7 21.3 22.3 22.3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;112   6.4  5.1  5.6  5.3  6.0  5.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;095  12.9 12.4 13.3 10.7 11.4 10.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;090   1.4  1.8  2.2  1.5  1.7  1.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We see in 2003 that 6.1% of the students placed in MATH 131 and 20.4% placed in MATH 215.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's follow Tukey's strategy for comparing percentage vectors.  To make this simple to explain, let's focus on comparing the percentages of 2006 and 2007.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;     2006 2007&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;131H  1.4  1.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;131   9.2  9.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;130   6.3  5.9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;128  17.8 17.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;126  12.2 12.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;122  14.4 13.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;215  21.3 22.3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;112   5.3  6.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;095  10.7 11.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;090   1.5  1.7&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First we cut the data by some row.  Let's try cutting the data after the second row.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;     2006 2007&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;131H  1.4  1.2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;131   9.2  9.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;---------------&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;130   6.3  5.9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;128  17.8 17.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;126  12.2 12.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;122  14.4 13.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;215  21.3 22.3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;112   5.3  6.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;095  10.7 11.4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new'; "&gt;090   1.5  1.7&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  We compute a folded log for each year.  For 2006, we see that 1.4 + 9.2 = 10.6% are above the line and 100 - 10.6 = 89.4% are below the line, so the flog is&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;FLOG for 2006 = log(10.6/89.4) = -2.13&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Likewise the flog for 2007 is given by&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;FLOG for 2007 = log(10.2/89.8) = -2.18&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  To compare the years 2006 and 2007, we look at the difference in flogs:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Change in FLOG from 2006 to 2007 is -2.18 - (-2.13) = -0.05&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The interpretation is that students did 0.05 worse in 2007 (on the flog scale).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What if we cut the table by a different row?  We can repeat the procedure using all possible cuts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is the table of flogs:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       2006  2007&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [1,] -4.22 -4.38&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [2,] -2.13 -2.17&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [3,] -1.59 -1.65&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [4,] -0.63 -0.70&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [5,] -0.12 -0.18&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [6,]  0.46  0.35&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [7,]  1.56  1.44&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [8,]  1.98  1.88&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [9,]  4.20  4.03&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To understand these values, -4.22 is the flog if we cut after the first row, -2.13 is the flog if we cut after the second row, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To compare the years 2006 and 2007, we look at the difference in flogs:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;  &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2007 FLOG - 2006 FLOG&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [1,] -0.16&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [2,] -0.04&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [3,] -0.06&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [4,] -0.07&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [5,] -0.06&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [6,] -0.11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [7,] -0.12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [8,] -0.10&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [9,] -0.17&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What have we learned?  Note that all of the flog differences are negative and the median flog difference is -0.10.  So it is clear the 2007 students did a little worse than the 2006 students.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-9007811273542904262?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/9007811273542904262/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=9007811273542904262' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/9007811273542904262'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/9007811273542904262'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/12/flogging-placement-data.html' title='Flogging placement data'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3030488321202764210</id><published>2008-11-21T11:20:00.000-08:00</published><updated>2008-11-21T11:40:04.750-08:00</updated><title type='text'>Binning homework</title><content type='html'>Some of you appear to be confused on this "binning" homework.  Let's outline what you are supposed to do in this assignment.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First you bin your data and construct a histogram.  This is easy (I hope).  The histogram gives you counts of each bin.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, a rootogram is just like a histogram, but you are graphing the root counts against the bins.  (Why do you do this?  Look at the notes.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  You find a Gaussian curve that fits your data -- you have a mean and a standard deviation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  Now you want to find the expected or fitted counts from the Gaussian curve.  You do this by running the function fit.gaussian -- the inputs to this function are your bins, the vector of raw data, and the mean and standard deviation of the Gaussian curve.   Suppose you save the output of this function into the variable s.  Then&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;s$counts are the observed bin counts&lt;/div&gt;&lt;div&gt;s$expected are the expected counts&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4.  The question at this point is -- does the Gaussian curve provide a good fit to the histogram?  Well, maybe yes, and maybe no.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How can you tell?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You look at the residuals which are essentially the deviations of the counts from the expected counts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I defined several residuals in the notes.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Simple rootogram residuals are &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;r.simple = sqrt(d) - sqrt(e)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;These are graphed by use of the rootogram function.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The double root residuals are defined by&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;DRR = sqrt(2+4d) - sqrt(1+4e).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How do I compute these?  Well, you already have computed the d's and the e's -- you can use R to compute the DRR's.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;5.  Once I compute the residuals and graphed them, am I done?  Well, the point of graphing the residuals is to see if they show any systematic pattern -- if there is a pattern, then the Gaussian curve is not good.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;6.  Well, we are almost done.  Part 1 (d) asks you to fit a G comparison curve to the root data.  (Boy, your instructor seems to like taking roots.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All I'm asking here is to FIRST transform the data by a root, and then repeat all of the above with the root data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;7.  Lastly, in #2, I'm asking you to fit a G curve to the women heights.  If you have loaded in the studentdata dataset, then to get the female heights, you type&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;library(LearnEDA)&lt;/div&gt;&lt;div&gt;data(studentdata)&lt;/div&gt;&lt;div&gt;&lt;div&gt;f.heights=studentdata$Height[studentdata$Gender=="female"]&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Good luck!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3030488321202764210?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3030488321202764210/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3030488321202764210' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3030488321202764210'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3030488321202764210'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/11/binning-homework.html' title='Binning homework'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3692183631584343378</id><published>2008-11-18T11:50:00.000-08:00</published><updated>2008-11-18T12:15:09.732-08:00</updated><title type='text'>Learning about Lengths of Baseball Games</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SSMf1dP5TyI/AAAAAAAAASs/GVWfzsyuxXs/s1600-h/plot4.jpg"&gt;&lt;/a&gt;One complaint about baseball is that it is a relatively long game, compared to American football or basketball.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To understand more about the lengths of baseball games, I collected the lengths (in minutes) of all games played during the 2007 season.  Here are my questions:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  Are the times of baseball games normally distributed?&lt;/div&gt;&lt;div&gt;2.  If not, how do the times differ from a normal curve?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'll illustrate the R work. You'll need a couple of packages: LearnEDA and vcd (for the rootogram function).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First I'll read in the baseball game times, construct bins and the rootogram (this is essentially a histogram where you plot the roots of the counts instead of the counts).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;library(LearnEDA)&lt;/div&gt;&lt;div&gt;data=read.table("game.time.07.txt",header=T)&lt;/div&gt;&lt;div&gt;attach(data)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# set up bins and construct a rootogram&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;bins=seq(100,300,by=20)&lt;/div&gt;&lt;div&gt;bin.mids=(bins[-1]+bins[-length(bins)])/2&lt;/div&gt;&lt;div&gt;h=hist(time,breaks=bins)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;h$counts=sqrt(h$counts)&lt;/div&gt;&lt;div&gt;plot(h,xlab="TIME",ylab="ROOT FREQUENCY",main="")&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SSMeicoF1TI/AAAAAAAAASU/oQK4gJqsAYs/s320/plot1.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5270089566102345010" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  Now I want to fit a normal comparison curve.  I figure out the mean and standard deviation of the normal curve by using letter values.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# find mean and standard deviation of matching Gaussian curve&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;S=lval(time)&lt;/div&gt;&lt;div&gt;f=as.vector(S[2,2:3])&lt;/div&gt;&lt;div&gt;m=as.integer((f[1]+f[2])/2)&lt;/div&gt;&lt;div&gt;sd=as.integer((f[2]-f[1])/1.349)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I get that the times are approximately N(175, 24).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  Is this normal approximation ok?  The function fit.gaussian computes the expected counts for each bins and computes the residuals sqrt(observed) - sqrt(expected).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# function fit.gaussian.R fits Gaussian curve to the counts&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;s=fit.gaussian(time,bins,m,sd)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# output observed and expected counts&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;output=round(data.frame(s$counts, sqrt(s$counts), s$probs, s$expected, &lt;/div&gt;&lt;div&gt;     sqrt(s$expected), sqrt(s$counts)-sqrt(s$expected)),2)&lt;/div&gt;&lt;div&gt;names(output)=c("count","root","prob","fit","root fit","residual")&lt;/div&gt;&lt;div&gt;output&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  count root prob   fit root fit residual&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1      1 1.00 0.01  1.00     1.00     0.00&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;2      8 2.83 0.06  6.08     2.47     0.36&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;3     22 4.69 0.19 19.17     4.38     0.31&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;4     27 5.20 0.32 31.34     5.60    -0.40&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;5     20 4.47 0.27 26.60     5.16    -0.69&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;6     12 3.46 0.12 11.72     3.42     0.04&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;7      6 2.45 0.03  2.67     1.64     0.81&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;8      1 1.00 0.00  0.32     0.56     0.44&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;9      1 1.00 0.00  0.02     0.14     0.86&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;10     1 1.00 0.00  0.00     0.02     0.98&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# place root expected counts on top of rootogram&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;lines(bin.mids,sqrt(s$expected),lwd=4,col="red")&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SSMfDakXtxI/AAAAAAAAASc/KkdZYgNXuKA/s320/plot2.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5270090132485551890" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  The rootogram function plots the residuals.  Both plots below show the same info, but in different ways.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# load library vcd that contains rootogram function&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;library(vcd)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# again save histogram object&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;h=hist(time,breaks=bins,plot=F)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# illustrate "hanging" style of rootogram&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;r=rootogram(h$counts,s$expected,type="hanging")  # this is the hanging style &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SSMfczhjWoI/AAAAAAAAASk/mmf4jfkBHfY/s320/plot3.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5270090568681347714" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# illustrate "deviation" style of rootogram&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;r=rootogram(h$counts,s$expected,type="deviation") &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SSMf1dP5TyI/AAAAAAAAASs/GVWfzsyuxXs/s320/plot4.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5270090992198438690" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What have we learned?  Although the lengths of baseball games are roughly normal in shape, we see that there are some large negative residuals on the right tail.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What does this mean?  Baseball game lengths have a long right-tail which means one tends to see many long games.  If you know anything about baseball, you know that baseball can go into extra-innings when there the game is tied after 9 innings, and these extra-inning games cause the long right tail.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3692183631584343378?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3692183631584343378/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3692183631584343378' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3692183631584343378'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3692183631584343378'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/11/learning-about-lengths-of-baseball.html' title='Learning about Lengths of Baseball Games'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SSMeicoF1TI/AAAAAAAAASU/oQK4gJqsAYs/s72-c/plot1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-1642927384537315152</id><published>2008-11-09T04:23:00.000-08:00</published><updated>2008-11-09T05:00:45.592-08:00</updated><title type='text'>Median Polishing Student Grades</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SRbcXKFPm3I/AAAAAAAAASM/Bcv2MvU8UaU/s1600-h/twowayplot.jpeg"&gt;&lt;/a&gt;&lt;br /&gt;Our department recently was interested in exploring the relationship between student class attendance and performance in a number of 100-level math classes.   We all know that missing math classes will have an adverse effect on grades, but we wanted to learn more about this relationship.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For many students in four 100-level math classes, we collected the student's attendance (percentage of classes attended) and their performance (a percentage where a 90% corresponds to A work, 80-90% a B, etc).   We decided to categorize attendance using the categories 0-50%, 50-70%, 70-90%, over 90%) and then we found the mean performance for students in each categorization of attendance and course.  We got the following two-way table.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;          Attendance Pct&lt;/div&gt;&lt;div&gt;COURSE   0-50 50-70 70-90 90-100  &lt;/div&gt;&lt;div&gt;---------------------------------       &lt;/div&gt;&lt;div&gt;MATH 112  61.9  68.0  68.8  75.5&lt;/div&gt;&lt;div&gt;MATH 115  54.3  71.8  76.6  83.3&lt;/div&gt;&lt;div&gt;MATH 122  56.1  64.8  73.1  78.1&lt;/div&gt;&lt;div&gt;MATH 126  50.6  66.3  73.9  78.1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'll demonstrate median polish with this data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  First, I have to get this data into a matrix form into R.  I'll first use the matrix command to form the matrix (by default, one enters data column by column) and then I'll add row and column labels.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;grade=matrix(c(61.9,54.3,56.1,50.6,&lt;/div&gt;&lt;div&gt;               68.0,71.8,64.8,66.3,&lt;/div&gt;&lt;div&gt;               68.8,76.6,73.1,73.9,&lt;/div&gt;&lt;div&gt;               75.5,83.3,78.1,78.1),c(4,4))&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;dimnames(grade)=list(c("MATH 112","MATH 115","MATH 122","MATH 126"),&lt;/div&gt;&lt;div&gt;                     c("0-50","50-70","70-90","90-100"))&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To check that I've read in the data correctly, I'll display "grade":&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&gt; grade&lt;/div&gt;&lt;div&gt;         0-50 50-70 70-90 90-100&lt;/div&gt;&lt;div&gt;MATH 112 61.9  68.0  68.8   75.5&lt;/div&gt;&lt;div&gt;MATH 115 54.3  71.8  76.6   83.3&lt;/div&gt;&lt;div&gt;MATH 122 56.1  64.8  73.1   78.1&lt;/div&gt;&lt;div&gt;MATH 126 50.6  66.3  73.9   78.1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  Now I can implement median polish to get an additive fit.  I'll store the output in the variable "my.fit" and then I'll display the different components.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&gt; my.fit=medpolish(grade)&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's the common value.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&gt; my.fit$overall&lt;/div&gt;&lt;div&gt;[1] 69.7&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Here are the row effects.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&gt; my.fit$row&lt;/div&gt;&lt;div&gt;MATH 112 MATH 115 MATH 122 MATH 126 &lt;/div&gt;&lt;div&gt; -0.7375   4.5000   0.2875  -0.2875 &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Here are the column effects.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&gt; my.fit$col&lt;/div&gt;&lt;div&gt;     0-50     50-70     70-90    90-100 &lt;/div&gt;&lt;div&gt;-16.35000  -2.75625   2.75625   8.40000 &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;I'll interpret this additive fit.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;The average performance of these students is 69.7%.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Looking at the row effects, we see that MATH 115 students get grades that are 4.5 - (0.2875) approx 4.2 points higher than MATH 122 students.  MATH 122 students tend to be 0.2875 - (-0.2875) approx 0.57 points higher than MATH 126 students, and MATH 112 students are about a half percentage point lower than MATH 126 students.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Looking at the column effects, we see a clear relationship between attendance and performance.  The best (90-100%) attenders do 8.4 - 2.75 = 5.65 points better (on average) than the 70-90 attendance group, do 8.4 - (-2.75) = 11.15 points better than the 50-70 attendance group, and a whopping 8.4 - (-16.35) = 24.75 points better than the "no shows" (the under 50% attending group).&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;3.  It might be helpful to plot this fit.  I wrote a function plot2way in the LearnEDA package:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&gt; library(LearnEDA)&lt;/div&gt;&lt;div&gt;&gt; plot2way(my.fit$overall+my.fit$row,my.fit$col,dimnames(grade)[[1]],&lt;/div&gt;&lt;div&gt;  dimnames(grade)[[2]])&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that the plot2way function has four arguments:  the row part, the column part, the vector of names of the rows, and the vector of names of the columns.  Here's the figure.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SRbcXKFPm3I/AAAAAAAAASM/Bcv2MvU8UaU/s320/twowayplot.jpeg" border="0" alt="" id="BLOGGER_PHOTO_ID_5266639104657824626" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 320px; height: 319px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;Actually, this plot would look better if I had figured out how to rotate the figure so that the FIT lines are horizontal.  (I'll give extra credit points to anyone who can fix my function to do that.)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;From this graph we see that the best performances are the MATH 115 students who attend over 90% of the classes; the worst performances are the MATH 112 students who have under 50% attendance.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;4.  Are we done?  Not quite.  We have looked at the fit, but have not looked at the residuals -- the differences between the performance and the fit.  They are stored in my.fit$residual -- I will round the values so they are easier to view.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 51);"&gt;&lt;div&gt;&gt; round(my.fit$residual)&lt;/div&gt;&lt;div&gt;         0-50 50-70 70-90 90-100&lt;/div&gt;&lt;div&gt;MATH 112    &lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;9&lt;/span&gt;&lt;/span&gt;     2    -3     -2&lt;/div&gt;&lt;div&gt;MATH 115   -4     0     0      1&lt;/div&gt;&lt;div&gt;MATH 122    2    -2     0      0&lt;/div&gt;&lt;div&gt;MATH 126   -2     0     2      0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I see one large residual that I have highlighted in red.  MATH 112 students who don't come to class (under 50% attendance) seem to do 9 points better than one would expect based on the additive fit.   This might deserve further study.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-1642927384537315152?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/1642927384537315152/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=1642927384537315152' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1642927384537315152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1642927384537315152'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/11/median-polishing-student-grades.html' title='Median Polishing Student Grades'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SRbcXKFPm3I/AAAAAAAAASM/Bcv2MvU8UaU/s72-c/twowayplot.jpeg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8367083820098098929</id><published>2008-10-30T05:44:00.000-07:00</published><updated>2008-10-30T06:09:33.697-07:00</updated><title type='text'>My Smoothing Tribute to the Phillies</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SQmwGwnrgOI/AAAAAAAAAR8/IET6v0deNM0/s1600-h/phillies.jpg"&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Last night was an exciting night.  My team, the Philadelphia Phillies, won the World Series for only the second time in their history that started in 1883.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;So it seems appropriate to give a statistical tribute to the Phillies.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;I collected the Phillies winning percentage (percentage of games won) for each season in their history.  Below I plot the winning percentage against the season year.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SQmvz8LtnpI/AAAAAAAAAR0/fHT8HVDeVQM/s320/phillies0.jpg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 440px; height: 438px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5262930946422578834" /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;It is hard to see the general pattern in this graph, so it makes sense to smooth.  I could use Tukey's resistant smooth described in the EDA notes, but I'll illustrate the use of an alternative smoothing method called lowess.  Here is the R code to graph the scatterplot and overlay the lowess smooth.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;plot(Year,Win.Pct,pch=19,col="red")&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;lines(lowess(Year,Win.Pct,f=.2),lwd=3,col="red")&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;title("PHILLIES WINNING PERCENTAGES",col="red")&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;abline(h=50,lwd=2)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;img src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SQmwGwnrgOI/AAAAAAAAAR8/IET6v0deNM0/s320/phillies.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5262931269736169698" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 440px; height: 438px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=""&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0); font-family:arial;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color: rgb(255, 0, 0);  font-family:arial;"&gt;I&lt;span class="Apple-style-span" style="font-size: large;"&gt; added the horizontal line at WIN.PCT = 50 that corresponds to an average season.&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Looking at the graph, you should see why the Philadelphia fans are so excited about the Phillies winning this season.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;1.  The Phillies teams generally have been crummy, especially between 1920 and 1950.  The team hasn't experienced much success.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;2.  But there have been two recent periods where the Phillies have been successful.  One period was in the 1970's and the climax of this success was the Phillies first World Series win in 1980.  The second period is in recent history and of course the Phillies won their second World Series in 2008.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8367083820098098929?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8367083820098098929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8367083820098098929' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8367083820098098929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8367083820098098929'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/10/my-smoothing-tribute-to-phillies.html' title='My Smoothing Tribute to the Phillies'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SQmvz8LtnpI/AAAAAAAAAR0/fHT8HVDeVQM/s72-c/phillies0.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-6265416570566535141</id><published>2008-10-26T17:59:00.000-07:00</published><updated>2008-10-26T18:52:57.122-07:00</updated><title type='text'>Comments on Plotting Homework</title><content type='html'>Here are some general and specific comments on your efforts on Homework 4 on plotting and straightening.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First, most of you did well on this homework.  You were good in describing the fit and the pattern in the residuals.  Also you made reasonable choices at reexpressing the x and/or the y variable so that the graph looked pretty straight.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But there were some things that caused you to lose points.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  What are you looking for in a residual plot?  In this assignment, the focus was looking for nonlinear patterns.    For example, is it appropriate to fit a line to the (year, population) data for the England and Wales dataset?  We can answer this question by looking for a nonlinear pattern in the plot of residuals against year.  There is significant curvature (that is, a quadratic pattern) in the residual plot which tells you that the population growth is not linear.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, some of you plotted log population against log year -- why did you take the log of year?  It doesn't make any sense to me.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  Always talk about the fit and the residuals in the context of the data.  Someone plotted x against y without telling me the variables.  The fun part of statistics is that you can always talk about the application.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  Remember that funny problem where the scatterplot shows two group of points with a clear separation?  This is one type of nonlinear pattern that you won't be able to straighten with a single choice of power transformation.  But since the graph clearly divides into two parts, it makes sense to treat this as two independent problems and try to straighten each part.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4.  Should one fit a line by least-squares or by a resistant line?  In many situations, it won't make a difference -- either fit will work.  But least-squares can give you relatively poor fits when there are outliers.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How can you tell if least squares isn't the best fit?  Look at the residual plot.  If you still see some increasing or descreasing pattern, then this tells you that least-squares hasn't explained all of the "tilt" pattern in the graph.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-6265416570566535141?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/6265416570566535141/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=6265416570566535141' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6265416570566535141'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6265416570566535141'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/10/comments-on-plotting-homework.html' title='Comments on Plotting Homework'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5662279098034230282</id><published>2008-10-20T09:38:00.000-07:00</published><updated>2008-10-20T09:45:17.060-07:00</updated><title type='text'>Straightening</title><content type='html'>This week, the main topic is reexpressing either the x or y variable to straighten a nonlinear pattern that we see in a scatterplot.  Although the manipulation may be straightforward, it is possible to miss the main message in this material.  Here are some questions that may help in your explanations in the homework.&lt;br /&gt;&lt;br /&gt;Question 1:  What is a simple description of the pattern in a scatterplot?&lt;br /&gt;&lt;br /&gt;Question 2:  Why do we prefer to fit lines instead of more complicated curves like quadratic or cubic?&lt;br /&gt;&lt;br /&gt;Question 3:  Is it possible to straighten all nonlinear patterns by power transformations?&lt;br /&gt;&lt;br /&gt;Question 4:  Can you think of a situation or an example where it is not possible to straighten by power transformations?&lt;br /&gt;&lt;br /&gt;Don't forget to look at your data first -- you may see rightaway that it is not possible to straighten the graph.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5662279098034230282?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5662279098034230282/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5662279098034230282' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5662279098034230282'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5662279098034230282'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/10/straightening.html' title='Straightening'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8474166036650580657</id><published>2008-10-15T11:06:00.000-07:00</published><updated>2008-10-15T11:30:12.699-07:00</updated><title type='text'>Example of Resistant Fitting</title><content type='html'>In baseball, the objective is to win games and a team wins a game by scoring more runs than its opponent.  An interesting question is "how important is a single run" towards the goal of winning a game?  Suppose one collects the runs scored, the runs allowed, the number of wins and the number of losses for a group of teams.  Bill James (a famous guy who works on baseball data) discovered the empirical relationship&lt;br /&gt;&lt;br /&gt;log(wins/losses) = 2 log(runs scored/runs allowed)&lt;br /&gt;&lt;br /&gt;He called this the Pythagorean Relationship.&lt;br /&gt;&lt;br /&gt;Let's try to demonstrate this relationship by use of a resistant fit.&lt;br /&gt;&lt;br /&gt;1.  First, I collected data for all baseball teams in the 2008 season.  The dataset teams2008. txt contains for each of the 30 teams ...&lt;br /&gt;&lt;br /&gt;Team -- the name of the team&lt;br /&gt;Wins -- the number of wins&lt;br /&gt;Losses -- the number of losses&lt;br /&gt;Runs.Scored -- the total number of runs scored&lt;br /&gt;Runs.Allowed -- the total number of runs allowed&lt;br /&gt;&lt;br /&gt;2.  I read this dataset into R and compute the variables log.RR and log.WL.&lt;br /&gt;&lt;br /&gt;data=read.table("http://bayes.bgsu.edu/eda/data/teams2008.txt",header=T)&lt;br /&gt;attach(data)&lt;br /&gt;&lt;br /&gt;log.RR=log(Runs.Scored/Runs.Allowed)&lt;br /&gt;log.WL=log(Wins/Losses)&lt;br /&gt;&lt;br /&gt;3.  I graph log.RR against log.WL and add team labels to the graph.  As we hoped, the relationship looks pretty linear.&lt;br /&gt;&lt;br /&gt;plot(log.RR,log.WL,pch=19)&lt;br /&gt;text(log.RR,log.WL,Team,pos=2)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_V8g1rNtmHuM/SPY2RyzhcBI/AAAAAAAAAMk/9FB9gpkE184/s1600-h/plot1.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://1.bp.blogspot.com/_V8g1rNtmHuM/SPY2RyzhcBI/AAAAAAAAAMk/9FB9gpkE184/s320/plot1.jpg" alt="" id="BLOGGER_PHOTO_ID_5257449294325182482" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;4.  I next fit a resistant line using the rline function in the LearnEDA package.  I add the&lt;br /&gt;fitted line to the graph.&lt;br /&gt;&lt;br /&gt;the.fit=rline(log.RR,log.WL,iter=4)&lt;br /&gt;curve(the.fit$a+the.fit$b*(x-the.fit$xC),add=TRUE)&lt;br /&gt;&lt;br /&gt;5.  If Bill James' relationship holds, the slope of the resistant line should be close to 2.&lt;br /&gt;&lt;br /&gt;the.fit&lt;br /&gt;$a&lt;br /&gt;[1] 0.01006079&lt;br /&gt;$b&lt;br /&gt;[1] 1.801718&lt;br /&gt;&lt;br /&gt;It approximately holds since the slope of 1.8 is close to 2.&lt;br /&gt;&lt;br /&gt;6.  To see if this is a reasonable fit, we compute the fit and the residuals.&lt;br /&gt;&lt;br /&gt;FIT=the.fit$a+the.fit$b*(log.RR-the.fit$xC)&lt;br /&gt;RESIDUAL=log.WL-FIT&lt;br /&gt;&lt;br /&gt;and then plot the residuals, adding the team labels.&lt;br /&gt;&lt;br /&gt;plot(log.RR,RESIDUAL,pch=19)&lt;br /&gt;abline(h=0)&lt;br /&gt;text(log.RR,RESIDUAL,Team,pos=2)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SPY2lIiXnwI/AAAAAAAAAMs/frO7WhxFHic/s1600-h/plot2.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SPY2lIiXnwI/AAAAAAAAAMs/frO7WhxFHic/s320/plot2.jpg" alt="" id="BLOGGER_PHOTO_ID_5257449626576330498" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;7.  What are we looking for in the residual plot?  First, we look for general patterns that we didn't see earlier in the first plot.  I don't see any trend, so it appears that we removed the tilt by fitting a line.&lt;br /&gt;&lt;br /&gt;Also we are looking for unusually small or large residual.  Here a "lucky team" corresponds to a team who seemed to win more games than one would expect based on their wins and losses.&lt;br /&gt;&lt;br /&gt;Which team was unusually lucky in the 2008 season?  A hint:  they were a "heavenly" team from the west coast.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8474166036650580657?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8474166036650580657/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8474166036650580657' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8474166036650580657'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8474166036650580657'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/10/example-of-resistant-fitting.html' title='Example of Resistant Fitting'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_V8g1rNtmHuM/SPY2RyzhcBI/AAAAAAAAAMk/9FB9gpkE184/s72-c/plot1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4215372833337743908</id><published>2008-10-06T05:31:00.000-07:00</published><updated>2008-10-06T05:54:13.771-07:00</updated><title type='text'>Reexpressing for Symmetry</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SOoJ80nWhbI/AAAAAAAAAMc/ZcG_kY0bxLM/s1600-h/phillies.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SOoJ80nWhbI/AAAAAAAAAMc/ZcG_kY0bxLM/s320/phillies.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5254022855801603506" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Your instructor is currently in "Phillies Heaven".  His baseball team rarely has a chance to win the World Series (they have won one World Series in over 120 seasons of competing) and they currently in the National League Championship against the Dodgers!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Oh right -- I'm supposed to talk about the EDA class.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;I finished the grading on your "symmetry" homework and Fathom assignment.    You generally did fine, but I'll explain some issues that may have caused you to lose points.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;What was I looking for?&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the notes, we talked about several methods for assessing symmetry of a batch, including looking a the sequence of midsummaries, using a symmetry plot, and Hinkley's quick method.   For each dataset you consider, here's a outline of what you should do:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;Demonstrate &lt;/span&gt;&lt;/span&gt;(using some method) that your raw data looks nonsymmetric.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;Experiment&lt;/span&gt;&lt;/span&gt; with power transformations with different choices of p to try to make the reexpressed data approximately symmetric.  Use one of our methods to see if the "p-reexpression" works.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;Convince me&lt;/span&gt;&lt;/span&gt; that you have found a reasonable transformation by graphing the reexpressed data (say by a histogram or a stemplot).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is best if you explain (with words) your process of finding the best reexpression.  I'm much more interested in your thought process than your computer output.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;Here are a few other pitfalls.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  Some of you were confused when you considered reexpressions with negative values of p.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;p = 1  -- data is really right-skewed (big positive value of Hinkley's d)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;p = 0 -- data is slightly right-skewed (smaller positive value of d)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;p = -1 -- data is right skewed (positive value of d)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What is going on?  The problem is that you were defining a reexpression like p = -1/2 as&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;data^(-1/2)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;when you should have used&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;-data^(-1/2)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By adding the negative sign, all of your transformations are increasing functions, and then&lt;/div&gt;&lt;div&gt;you can make better sense of the reexpressed graphs and the methods.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  Reexpression only works when there is sufficient spread in your data.  Suppose you have data that ranges from 50 to 60 -- here the range is only 60/50 = 1.2.  Reexpression using any value of p won't work -- that is, it won't change the shape of the data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  One of you used a normal probability plot to determine if the graph was symmetric.  What's wrong with this?  Well, it is not one of the methods we discussed in the notes.  Second, checking for normality is different (but related) than checking for symmetry.  Data can be symmetric but not normally distributed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4215372833337743908?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4215372833337743908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4215372833337743908' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4215372833337743908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4215372833337743908'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/10/reexpressing-for-symmetry.html' title='Reexpressing for Symmetry'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_V8g1rNtmHuM/SOoJ80nWhbI/AAAAAAAAAMc/ZcG_kY0bxLM/s72-c/phillies.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-6072726659921697318</id><published>2008-09-29T12:32:00.000-07:00</published><updated>2008-09-29T12:53:15.806-07:00</updated><title type='text'>Did reexpression work?</title><content type='html'>I finished grading your Fathom spread vs level plot assignment.  I've posted all of the grades on Blackboard.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When I grade your homework, I'm more interested in your explanation and comments rather than the mechanics.  This homework is a good illustration of this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You started by constructing a spread vs level plot of the weights by the supplements.  Here's the graph produced by the spread.level.plot function in the LearnEDA package.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SOEumffTriI/AAAAAAAAAMU/-5h6ardF70A/s320/spread.leve.plot.jpeg" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" border="0" alt="" id="BLOGGER_PHOTO_ID_5251529879313428002" /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here are the main questions:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;1.  Is there a dependence between spread and level? &lt;/span&gt;&lt;/span&gt; &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yes, but the pattern in the graph is a bit confused with the outlying point in the lower-right section of the plot.  If we removed this point, there would appear to be a stronger relationship.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;2.  Can we improve by a suitable reexpression?&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There is a line of slope 0.18 drawn on the graph.  This would suggest the use of a power transformation with power p = 1 - 0.18 = 0.82.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0);"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;3.  Is this a reasonable strategy?&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If we transform by a 0.82 power, this won't really change things.  It is almost equivalant to taking a 1 power which is no change.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But looking more carefully at the graph, we would fit a different line if we ignored that one outlier.  Then one would get a line with a smaller slope, like 0.50 and this would suggest the use of a root transformation.  This would really be nontrivial and would help the general dependence between spread and level.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you made some comments that were similar in spirt to the ones I've made above, then you got full credit.  You could have lost points if you went through the mechanics without commenting on what you actually did.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-6072726659921697318?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/6072726659921697318/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=6072726659921697318' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6072726659921697318'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6072726659921697318'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/did-reexpression-work.html' title='Did reexpression work?'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SOEumffTriI/AAAAAAAAAMU/-5h6ardF70A/s72-c/spread.leve.plot.jpeg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-4286894716651761539</id><published>2008-09-23T12:02:00.000-07:00</published><updated>2008-09-23T12:11:59.412-07:00</updated><title type='text'>Missing the Forest for the Trees</title><content type='html'>I just completed grading your "comparison" homework and have posted the grades.&lt;br /&gt;&lt;br /&gt;There is a figure of speech called "missing the forest for the trees."  This means that one can get caught up in the details of a problem without really understanding the real issues that are involved.&lt;br /&gt;&lt;br /&gt;An example of "missing the forest for the trees" is your homework. &lt;br /&gt;&lt;ul style="color: rgb(255, 0, 0);"&gt;&lt;li&gt;Why do we reexpress data?&lt;/li&gt;&lt;/ul&gt;We reexpress to equalize the spreads between batches.&lt;br /&gt;&lt;ul style="color: rgb(255, 0, 0);"&gt;&lt;li&gt;But why do we care about equalizing spreads?&lt;/li&gt;&lt;/ul&gt;We care about equalizing spreads since we wish to make a simple comparison between batches.&lt;br /&gt;&lt;ul style="color: rgb(255, 0, 0);"&gt;&lt;li&gt;What is a simple comparison?&lt;/li&gt;&lt;/ul&gt;A simple comparison is saying that one batch is, say, 5 units large than another batch.&lt;br /&gt;&lt;br /&gt;If you don't conclude your analysis with a simple comparison, then all of your work (finding 5-number summaries, constructing a spread vs level plot, reexpressing, etc) is for &lt;span style="color: rgb(255, 0, 0);"&gt;NOTHING&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;In other words, &lt;span style="color: rgb(255, 0, 0);"&gt;MAKING A USEFUL INTERPRETATION&lt;/span&gt; (in this case, a useful comparison) is &lt;span style="color: rgb(255, 0, 0);"&gt;EVERYTHING&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;A few of you were successful in making simple comparisons in your homework and I congratulate you if you got a 30/30 on the comparing batches homework.&lt;br /&gt;&lt;br /&gt;Remember, don't forget to look for the forest.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-4286894716651761539?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/4286894716651761539/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=4286894716651761539' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4286894716651761539'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/4286894716651761539'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/missing-forest-for-trees.html' title='Missing the Forest for the Trees'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-7212298375670582715</id><published>2008-09-23T07:08:00.000-07:00</published><updated>2008-09-23T07:20:16.248-07:00</updated><title type='text'>A Exploratory Data Analysis Story</title><content type='html'>When I was grading your homework this week, I thought of this story.&lt;br /&gt;&lt;br /&gt;--------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;One day, a boy was interested in taking a course in exploratory data analysis, but he didn't have the money to pay for it.  He decided on asking his grandmother for the money for the course.  She decided to help, but she said "I hope this is a worthwhile class for you."&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;Anyway, the boy visited his grandmother recently and told her that the course was going well.  The grandmother asked what he was learning and the boy responded:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;Last week, we learned how to compare groups of data.  We compared the yearly snowfall of Buffalo with Cairo.  It was hard to compare the two groups since there was a dependence between level and spread that I learned by constructing a spread versus level graph.  But this graph suggested the use of a p = 0.5 power transformation and when I did a spread versus level graph of the transformed data, the dependence between spread and level was reduced.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;The grandmother, listening intently, responded "So what did you conclude from your data analysis?"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;The boy said proudly "There is more snow in Buffalo than Cairo."&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: arial;"&gt;The grandmother then with a heavy sigh said "Can we get our money back?"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;What is the message in this story?  (It relates to the work that you did on your homework.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-7212298375670582715?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/7212298375670582715/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=7212298375670582715' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7212298375670582715'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7212298375670582715'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/exploratory-data-analysis-story.html' title='A Exploratory Data Analysis Story'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-910076185911285385</id><published>2008-09-17T07:58:00.000-07:00</published><updated>2008-09-17T08:07:05.907-07:00</updated><title type='text'>Boxplots in R</title><content type='html'>Here is a simple example of constructing boxplots and summary stats in R.&lt;br /&gt;&lt;br /&gt;I'm interested in comparing the team statistics for baseball teams this year -- I've heard&lt;br /&gt;that American League teams score more runs.  Is that true?&lt;br /&gt;&lt;br /&gt;Using data from baseball-reference.com, I created a dataset 2008teamstats.txt that contains current statistics for all 30 baseball teams.&lt;br /&gt;&lt;br /&gt;Here's my R script.  I'll paste in a horizontal-style boxplot display at the end.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; b.data=read.table("http://bayes.bgsu.edu/eda/data/2008teamstats.txt",header=T)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; b.data[1:5,1:5]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;   Tm   League  R.G   R   G&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;1 TEX American 5.49 835 152&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;2 BOS American 5.29 799 151&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;3 MIN American 5.16 779 151&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;4 DET American 5.06 759 150&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;5 BAL American 5.03 750 149&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # I am interested in comparing the runs scored per game&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # (variable R.G) for the American and National league teams&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; attach(b.data)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # Here are the boxplots:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; boxplot(R.G~League)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # boxplot has many options -- if you prefer horizontal style ...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; boxplot(R.G~League, horizontal=TRUE)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # To get summary stats for each group, just assign boxplot()&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; # to a variable, and then display the variable.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; b=boxplot(R.G~League, horizontal=TRUE)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&gt; b&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$stats&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;     [,1]  [,2]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1,] 3.99 3.910&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[2,] 4.43 4.295&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[3,] 4.85 4.575&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[4,] 5.06 4.695&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[5,] 5.49 4.910&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$n&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1] 14 16&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$conf&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;         [,1]  [,2]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1,] 4.583968 4.417&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[2,] 5.116032 4.733&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$out&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1] 5.34&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$group&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1] 2&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$names&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[1] "American" "National"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SNEc9Rtzr7I/AAAAAAAAAMM/QG3YNvMyOCU/s1600-h/boxplots.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SNEc9Rtzr7I/AAAAAAAAAMM/QG3YNvMyOCU/s320/boxplots.jpg" alt="" id="BLOGGER_PHOTO_ID_5247006879916470194" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-910076185911285385?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/910076185911285385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=910076185911285385' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/910076185911285385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/910076185911285385'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/boxplots-in-r.html' title='Boxplots in R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SNEc9Rtzr7I/AAAAAAAAAMM/QG3YNvMyOCU/s72-c/boxplots.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-9114082296486881171</id><published>2008-09-16T11:05:00.000-07:00</published><updated>2008-09-16T11:13:29.741-07:00</updated><title type='text'>Working with subgroups in R</title><content type='html'>Since we are comparing groups in EDA, I thought I would give some guidance on how to subset data in R.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Suppose we want to construct stemplots for the areas of the islands in each continent in the homework.  Here is some R work for constructing a stemplot of the island areas in the Arctic Ocean.  The key command is "subset".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, I don't think there is a simple way of constructing parallel stemplots in R.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; data(island.areas)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; names(island.areas)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;[1] "Ocean" "Name"  "Area" &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; attach(island.areas)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; arctic.areas=subset(Area,Ocean=="Arctic")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; arctic.areas&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; [1]  16671 195928  27038   6194  21331  75767  16274  12872   9570  15913  83896&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;[12]   8000  35000   2800  23940&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; library(aplpack)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; stem.leaf(arctic.areas)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1 | 2: represents 12000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt; leaf unit: 1000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;            n: 15&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   1    0* | 2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   4    0. | 689&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   5    1* | 2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  (3)   1. | 566&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   7    2* | 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   5    2. | 7&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;        3* | &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   4    3. | 5&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;HI: 75767 83896 195928&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-9114082296486881171?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/9114082296486881171/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=9114082296486881171' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/9114082296486881171'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/9114082296486881171'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/working-with-subgroups-in-r.html' title='Working with subgroups in R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5599017538018330935</id><published>2008-09-14T11:07:00.000-07:00</published><updated>2008-09-14T11:17:14.629-07:00</updated><title type='text'>Bins in a histogram and looking ahead</title><content type='html'>Hi EDA folks:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I finished grading your Fathom assignment on the number of bins.  Generally, you all did well on this, but there are a couple of things I should mention.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  The moral of this assignment is that as you have more data (bigger n), you should use a small bin width and have more bins.  It seemed that your best histograms by eye were similar to the ones chosen by the "optimal rule" formula.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  I think the rule wasn't that effective for constructing a histogram of the old faithful data.  By using a small number of bins, you didn't see any structure in each of the two humps.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3.  If you lost points, it probably was due to some confusion on your calculations or maybe not the best answer to a question -- like the one about the histogram of the old faithful data.  If you don't know why you lost points, just email me .&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Looking ahead, the next assignment is on EFFECTIVE COMPARISON.  You'll learn a specific method for equalizing spreads between batches.   Although you might understand the method (spread vs. level plot, reexpressing, etc), it is important not to lose sight of what we are trying to accomplish.  We want to make a reasonable comparison between groups.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, when you do your homework this week, don't forget to think about the BIG PICTURE.  Conclude your work by making a comparison.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Last, we'll be using some new R commands.  Don't forget to look at the "Chapter 3 work" file that illustrates the use of these new commands.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5599017538018330935?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5599017538018330935/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5599017538018330935' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5599017538018330935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5599017538018330935'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/bins-in-histogram-and-looking-ahead.html' title='Bins in a histogram and looking ahead'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8969027629409650081</id><published>2008-09-07T08:20:00.000-07:00</published><updated>2008-09-07T08:30:34.375-07:00</updated><title type='text'>EDA Grading</title><content type='html'>Most of you are doing great on the homework so far.  But I thought I should explain how I great and why you may be losing points.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Generally I am more interested in your explanations and how you are answering the main questions of interest.  For example, in the graphs and summaries homework, I am not interested as much in your R work and your computation.  Most of you are doing ok in getting R to produce stemplots and compute letter values.  But the BIG questions are ...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;-- what is the best choice of stemplot?&lt;/div&gt;&lt;div&gt;-- what have we learned about the data in terms of shape, average, and spread?&lt;/div&gt;&lt;div&gt;-- are there observations that deviate from the rest and why are these observations unusual?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You should be addressing these BIG questions in the first R homework.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the Fathom activity, we were looking at the number of outliers one would expect for samples from different population distributions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For normal data, we don't see many outliers.  But if the data comes from a flat-tailed distribution (like the t distribution), outliers are more common.   If this general conclusion wasn't obvious from your work, then you may have lost points.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is a final quibble (small point).  Most of the stemplots you showed me were hard to read and certainly you wouldn't want to use them for any presentation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Which stemplot do you prefer?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Stemplot A:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;1 | 2: represents 1.2&lt;/div&gt;&lt;div&gt; leaf unit: 0.1&lt;/div&gt;&lt;div&gt;            n: 50&lt;/div&gt;&lt;div&gt;   2    -1. | 55&lt;/div&gt;&lt;div&gt;   6    -1* | 0233&lt;/div&gt;&lt;div&gt;  11    -0. | 67899&lt;/div&gt;&lt;div&gt;  22    -0* | 01111113344&lt;/div&gt;&lt;div&gt;  (9)    0* | 112223334&lt;/div&gt;&lt;div&gt;  19     0. | 789999&lt;/div&gt;&lt;div&gt;  13     1* | 0011233&lt;/div&gt;&lt;div&gt;   6     1. | 68&lt;/div&gt;&lt;div&gt;   4     2* | 023&lt;/div&gt;&lt;div&gt;   1     2. | 9&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Stemplot B:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;1 | 2: represents 1.2&lt;/div&gt;&lt;div&gt; leaf unit: 0.1&lt;/div&gt;&lt;div&gt; &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;           n: 50&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   2    -1. | 55&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6    -1* | 0233&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  11    -0. | 67899&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  22    -0* | 01111113344&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  (9)    0* | 112223334&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  19     0. | 789999&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  13     1* | 0011233&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   6     1. | 68&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   4     2* | 023&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;   1     2. | 9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The message here is that you should use a &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: arial; font-size: 13px; font-weight: bold; "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;monoproportional &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Georgia; font-weight: normal; "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;font where each character takes the same space like Couri&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Georgia; font-size: 16px; font-weight: normal; "&gt;er.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8969027629409650081?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8969027629409650081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8969027629409650081' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8969027629409650081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8969027629409650081'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/eda-grading.html' title='EDA Grading'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-3660332058766406179</id><published>2008-09-06T07:53:00.001-07:00</published><updated>2008-09-06T07:59:54.519-07:00</updated><title type='text'>Using the LearnEDA package</title><content type='html'>I'm starting to grade your first R homework.   I wrote the LearnEDA package to make it easier for you to read in datasets and do some basic calculations.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you look at the R folder in the Course Documents section of Blackboard, you'll see the appropriate R commands for each topic. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;FOR EACH HOMEWORK, MAKE SURE YOU LOOK AT THE R FOLDER SO YOU KNOW&lt;/div&gt;&lt;div&gt;THE COMMANDS YOU NEED TO USE.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the first homework, you were supposed to read in the baseball attendance data and compute some letter values. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's how you do this in R using the LearnEDA package.  (I'm assuming you have already installed this package.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This loads the package.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; library(LearnEDA)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;Read in the dataset:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; data(baseball.attendance)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;Attach the data to make the variable names available:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; attach(baseball.attendance)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;Compute letter values:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; lval(Home.Attendance)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;  depth      lo      hi     mids spreads&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;1  15.5 32783.5 32783.5 32783.50     0.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;2   8.0 23704.0 36164.0 29934.00 12460.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;3   4.5 21614.5 40166.0 30890.25 18551.5&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;4   2.5 16574.0 41010.0 28792.00 24436.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;5   1.0  8651.0 42067.0 25359.00 33416.0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-3660332058766406179?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/3660332058766406179/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=3660332058766406179' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3660332058766406179'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/3660332058766406179'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/using-learneda-package.html' title='Using the LearnEDA package'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-5005684530398745567</id><published>2008-09-04T12:09:00.000-07:00</published><updated>2008-09-04T12:16:28.176-07:00</updated><title type='text'>Some common problems in R and Fathom</title><content type='html'>Here are some common questions I've heard recently about R and Fathom.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1.  Some of you are having problems reading in datafiles which is a big concern.  There are two ways you can mess up.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(a)  First, it is important that R can find your files.   Put all of your R work in a particular folder, say EDA, and then by choosing menu item File -&gt; Change dir ..., you select the file EDA.  To check if the working directory really has changed, type&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;dir()&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;and you should see your data files.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(b)  A general form to read in a text datafile is&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;data=read.table(file.name, header=T, sep="\t")&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;where file.name is in double-quotes.  The header option says that the first line in the file contains the variable names and the sep option says that columns are separated by the tab character.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2.  How do you plot curves on Fathom?  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Suppose you have created a scatterplot and wish to add a curve.  You select the graph and choose the menu item Graph -&gt; Plot Function.  Then you just type the function (using the variable name on the x axis) in the box.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-5005684530398745567?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/5005684530398745567/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=5005684530398745567' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5005684530398745567'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/5005684530398745567'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/09/some-common-problems-in-r-and-fathom.html' title='Some common problems in R and Fathom'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-1735103048826555354</id><published>2008-08-26T17:35:00.000-07:00</published><updated>2008-08-26T17:56:56.004-07:00</updated><title type='text'>Example of data analysis on R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_V8g1rNtmHuM/SLSlHY0wb6I/AAAAAAAAAME/HPtNey6w3dk/s1600-h/baseball.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_V8g1rNtmHuM/SLSlHY0wb6I/AAAAAAAAAME/HPtNey6w3dk/s320/baseball.jpg" alt="" id="BLOGGER_PHOTO_ID_5238993812880125858" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here is a typical data analysis on R.   I'm a baseball fan and I'm interested in the wins and losses for the current Major League teams and how these wins and losses are related to the runs scored and runs allowed.&lt;br /&gt;&lt;br /&gt;In Excel, I create a dataset that contains the wins, losses, runs scored, and runs allowed for all 30 teams.  Here is the first 4 rows of the dataset.&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 216pt;" width="290" border="0" cellpadding="0" cellspacing="0"&gt;&lt;col style="width: 70pt;" span="2" width="94"&gt;  &lt;col style="width: 16pt;" span="2" width="22"&gt;  &lt;col style="width: 22pt;" span="2" width="29"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 70pt;" width="94" height="18"&gt;Team&lt;/td&gt;   &lt;td style="width: 70pt;" width="94"&gt;League&lt;/td&gt;   &lt;td style="width: 16pt;" width="22"&gt;W&lt;/td&gt;   &lt;td style="width: 16pt;" width="22"&gt;L&lt;/td&gt;   &lt;td style="width: 22pt;" width="29"&gt;RS&lt;/td&gt;   &lt;td style="width: 22pt;" width="29"&gt;RA&lt;/td&gt;  &lt;/tr&gt; &lt;!--seasonType=2--&gt;&lt;!--startDate=20080826--&gt;&lt;!--StartDate is currentDate--&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;Tampa Bay&lt;/td&gt;   &lt;td&gt;American&lt;/td&gt;   &lt;td num="" align="right"&gt;79&lt;/td&gt;   &lt;td num="" align="right"&gt;50&lt;/td&gt;   &lt;td num="" align="right"&gt;597&lt;/td&gt;   &lt;td num="" align="right"&gt;515&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;Boston&lt;/td&gt;   &lt;td&gt;American&lt;/td&gt;   &lt;td num="" align="right"&gt;75&lt;/td&gt;   &lt;td num="" align="right"&gt;55&lt;/td&gt;   &lt;td num="" align="right"&gt;670&lt;/td&gt;   &lt;td num="" align="right"&gt;559&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;NY Yankees&lt;/td&gt;   &lt;td&gt;American&lt;/td&gt;   &lt;td num="" align="right"&gt;70&lt;/td&gt;   &lt;td num="" align="right"&gt;60&lt;/td&gt;   &lt;td num="" align="right"&gt;632&lt;/td&gt;   &lt;td num="" align="right"&gt;585&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;Toronto&lt;/td&gt;   &lt;td&gt;American&lt;/td&gt;   &lt;td num="" align="right"&gt;67&lt;/td&gt;   &lt;td num="" align="right"&gt;63&lt;/td&gt;   &lt;td num="" align="right"&gt;574&lt;/td&gt;   &lt;td num="" align="right"&gt;510&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;I save this dataset as "baseball2008.txt" -- text, tab-delimited format.  I save this file in a folder called "eda" and I make this the current R working directory so R will find this file.&lt;br /&gt;&lt;br /&gt;Here is my analysis:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# read the datafile into R&lt;br /&gt;&lt;br /&gt;data=read.table("baseball2008.txt",header=T,sep="\t")&lt;br /&gt;&lt;br /&gt;# attach the data to make the variable names available in R&lt;br /&gt;&lt;br /&gt;attach(data)&lt;br /&gt;&lt;br /&gt;# compute the winning proportion for all teams&lt;br /&gt;&lt;br /&gt;win.prop=W/(W+L)&lt;br /&gt;&lt;br /&gt;# what was the largest winning proportion?&lt;br /&gt;&lt;br /&gt;max(win.prop)&lt;br /&gt;&lt;br /&gt;[1] 0.6183206&lt;br /&gt;&lt;br /&gt;# which team had the largest winning proportion?&lt;br /&gt;&lt;br /&gt;Team[win.prop==max(win.prop)]&lt;br /&gt;&lt;br /&gt;[1] Chicago Cubs&lt;br /&gt;&lt;br /&gt;# for each team, compute the number of runs scored per game&lt;br /&gt;&lt;br /&gt;runs.game=RS/(W+L)&lt;br /&gt;&lt;br /&gt;# construct a stemplot of the runs scored per game&lt;br /&gt;# (I'm assuming you have the aplpack package installed)&lt;br /&gt;&lt;br /&gt;library(aplpack)&lt;br /&gt;stem.leaf(runs.game)&lt;br /&gt;&lt;br /&gt;1 | 2: represents 1.2&lt;br /&gt;leaf unit: 0.1&lt;br /&gt;      n: 30&lt;br /&gt;1        s | 7&lt;br /&gt;4       3. | 889&lt;br /&gt;7       4* | 011&lt;br /&gt;8    t | 2&lt;br /&gt;12        f | 4555&lt;br /&gt;(6)  s | 666667&lt;br /&gt;12     4. | 88899&lt;br /&gt;7       5* | 00011&lt;br /&gt;   t          |&lt;br /&gt;2         f | 45&lt;br /&gt;&lt;br /&gt;# (sorry the display got messed up when I put it on this blog)&lt;br /&gt;# you get a similar display by using a histogram&lt;br /&gt;&lt;br /&gt;hist(runs.game)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SLSjwefy8RI/AAAAAAAAALw/R78yB-OtU_8/s1600-h/histogram.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SLSjwefy8RI/AAAAAAAAALw/R78yB-OtU_8/s400/histogram.jpg" alt="" id="BLOGGER_PHOTO_ID_5238992319754203410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# there are two teams who stand out with respect to runs scored&lt;br /&gt;&lt;br /&gt;Team[runs.game&gt;5.3]&lt;br /&gt;&lt;br /&gt;[1] Texas        Chicago Cubs&lt;br /&gt;&lt;br /&gt;# what is the average number of runs per game scored by a team?&lt;br /&gt;&lt;br /&gt;mean(runs.game)&lt;br /&gt;&lt;br /&gt;[1] 4.621464&lt;br /&gt;&lt;br /&gt;# do American League teams score more than National League teams?&lt;br /&gt;&lt;br /&gt;mean(runs.game[League=="American"])&lt;br /&gt;&lt;br /&gt;[1] 4.756021&lt;br /&gt;&lt;br /&gt;mean(runs.game[League=="National"])&lt;br /&gt;&lt;br /&gt;[1] 4.503728&lt;br /&gt;&lt;br /&gt;# there is an interesting relationship between a team's winning&lt;br /&gt;# proportion and the number of runs scored and runs allowed&lt;br /&gt;#&lt;br /&gt;# the relationship is  log(W/L) = k log(RS/RA)&lt;br /&gt;# where k is typically near 2&lt;br /&gt;&lt;br /&gt;# we'll compute log(W/L), log(RS/RA)&lt;br /&gt;# and fit a least-squares line -- we'll see if the slope is close to 2&lt;br /&gt;&lt;br /&gt;log.WL=log(W/L)&lt;br /&gt;log.RR=log(RS/RA)&lt;br /&gt;&lt;br /&gt;lm(log.WL ~ log.RR)&lt;br /&gt;&lt;br /&gt;Coefficients:&lt;br /&gt;(Intercept)       log.RR&lt;br /&gt;-0.0003097    1.8377065&lt;br /&gt;&lt;br /&gt;# here we see the slope is 1.84 which relatively close to 2&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-1735103048826555354?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/1735103048826555354/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=1735103048826555354' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1735103048826555354'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/1735103048826555354'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/08/example-of-data-analysis-on-r.html' title='Example of data analysis on R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_V8g1rNtmHuM/SLSlHY0wb6I/AAAAAAAAAME/HPtNey6w3dk/s72-c/baseball.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-7165630767730686215</id><published>2008-08-26T13:53:00.000-07:00</published><updated>2008-08-26T13:55:32.577-07:00</updated><title type='text'>accessing Fathom</title><content type='html'>As a student that is a couple of hours removed from BGSU is there anyway I can install Fathom without coming to the campus?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-7165630767730686215?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/7165630767730686215/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=7165630767730686215' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7165630767730686215'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/7165630767730686215'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/08/accessing-fathom.html' title='accessing Fathom'/><author><name>learner</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-6687510692923058166</id><published>2008-08-25T08:46:00.000-07:00</published><updated>2008-08-25T08:55:07.738-07:00</updated><title type='text'>Learning R</title><content type='html'>R is a great program but it has a relatively steep learning curve.  To help you get started, I have three help sessions on R planned this week -- you need only attend &lt;span style="font-weight: bold;"&gt;one&lt;/span&gt; of the sessions.&lt;br /&gt;&lt;br /&gt;I have four documents R_INTRO_PART_I, R_INTRO_PART_II, R_INTRO_PART_III, and R_INTRO_PART_IV in the Course Documents section that describe different aspects of R. &lt;br /&gt;&lt;br /&gt;1. Manipulating vectors.   A basic object in R is a vector.   R_INTRO_PART_I discusses how to create and work on vectors.&lt;br /&gt;&lt;br /&gt;2.  Input and output.   One typically wants to read datafiles into R -- read.table is useful for doing this.   Data is typically stored in a R object called a data frame.  Also you'll want to save R output including graphs and paste this material into a Word document.&lt;br /&gt;&lt;br /&gt;3.  Matrices.  You should be comfortable working and manipulating matrices in R.&lt;br /&gt;&lt;br /&gt;4.  Plotting.  You should be familiar with basic plotting commands and understand how one can add things (like labels and titles) to to graphs.&lt;br /&gt;&lt;br /&gt;I hope you have R installed on your laptop.  You may find it helpful to bring your laptop to the help session.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-6687510692923058166?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/6687510692923058166/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=6687510692923058166' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6687510692923058166'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6687510692923058166'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/08/learning-r.html' title='Learning R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-8807988783940213878</id><published>2008-08-22T08:23:00.000-07:00</published><updated>2008-08-22T08:44:42.635-07:00</updated><title type='text'>Installing R Packages</title><content type='html'>Now that you have successfully installed R, you have access to many functions in the R "base" package.  But we'll want to add packages that will give us additional functions helpful for the class.  I'll illustrate adding the package "aplpack" that we'll need to get the "stem.leaf" function.  Also, I'll show you how to add the "LearnEDA" package that I wrote for the class.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(255, 0, 0);"&gt; Installing the package aplpack from CRAN.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;1.  In R, choose the menu item Packages -&gt; Install Packages.&lt;br /&gt;&lt;br /&gt;2.  You'll be asked to choose a CRAN mirror site -- choose one in the United States.&lt;br /&gt;&lt;br /&gt;3.  Then you'll see a list of all available Packages -- choose aplpack.&lt;br /&gt;&lt;br /&gt;4.  At this point, the package will be downloaded and installed -- if you see the message&lt;br /&gt;&lt;br /&gt;package 'aplpack' successfully unpacked and MD5 sums checked&lt;br /&gt;&lt;br /&gt;you're in good shape.&lt;br /&gt;&lt;br /&gt;5.  The package aplpack is installed but not yet loaded into R.  To load the package, you type&lt;br /&gt;&lt;br /&gt;library(aplpack)&lt;br /&gt;&lt;br /&gt;6.  To check to see if you have the aplpack commands, try constructing a stemplot on a random sample of values from a standard normal distribution:&lt;br /&gt;&lt;br /&gt;stem.leaf(rnorm(50))&lt;br /&gt;&lt;br /&gt;If you get some display, you're set.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(255, 0, 0);"&gt;Installing the class package LearnEDA.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is a package that I wrote that has some special functions and all the datasets we'll use.  This is not available on CRAN yet, since it is still in development.  But it is available from my website.&lt;br /&gt;&lt;br /&gt;1.  Go to the class web folder http://bayes.bgsu.edu/EDA/ and find the zip file&lt;br /&gt;&lt;br /&gt;&lt;a href="http://bayes.bgsu.edu/EDA/LearnEDA_1.0.zip"&gt;LearnEDA_1.0.zip&lt;/a&gt;      &lt;br /&gt;&lt;br /&gt;Download this file to the Windows desktop (or some other convenient place).&lt;br /&gt;&lt;br /&gt;2.  To install this package, select Packages -&gt; Install Package(s) from local zip files&lt;br /&gt;&lt;br /&gt;3.  Select the LearnEDA_1.0.zip file.  In a short time, R will install this package.&lt;br /&gt;&lt;br /&gt;4.  To see if you have successfully loaded this package, load in the package and check to see if you can read in one of the class datasets.&lt;br /&gt;&lt;br /&gt;library(LearnEDA)&lt;br /&gt;data(football)&lt;br /&gt;football&lt;br /&gt;&lt;br /&gt;If you see a lot of football scores, then you have succeeded in loading this package.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-8807988783940213878?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/8807988783940213878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=8807988783940213878' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8807988783940213878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/8807988783940213878'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/08/installing-r-packages.html' title='Installing R Packages'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4375349027250083196.post-6038805502813210429</id><published>2008-08-22T08:09:00.000-07:00</published><updated>2008-08-22T08:22:41.570-07:00</updated><title type='text'>Installing R</title><content type='html'>Welcome to the EDA blog.  I thought this would be a useful way of communicating and giving you advice that will help you succeed in this class.&lt;br /&gt;&lt;br /&gt;My advice will apply to a Windows user which I think will apply to most to you.  R also works fine on a Macintosh, but the precise instructions are a little different.&lt;br /&gt;&lt;br /&gt;1.  Go to The Comprehensive R Achieve Network at http://cran.r-project.org/&lt;br /&gt;&lt;br /&gt;2.  First choose a Mirror (click on Mirrors on the left) site somewhere in the United States -- this will shorten your download time.&lt;br /&gt;&lt;br /&gt;3.  Click on Windows in the Download and Install R section.  Click on base and then click on the R-2.7.1-win32.exe link.&lt;br /&gt;&lt;br /&gt;4.  When this is downloaded on your machine, click on the R-2.7.1-win32.exe file.  At this point, you'll just follow the instructions and the most recent version of R (2.7.1) will be installed.&lt;br /&gt;&lt;br /&gt;5.  To launch R, just double-click on the R icon on your desktop and you'll shortly see the following screen.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_V8g1rNtmHuM/SK7Ziv4jzyI/AAAAAAAAALQ/TcjdBmNpBuo/s1600-h/Rscreen1.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://2.bp.blogspot.com/_V8g1rNtmHuM/SK7Ziv4jzyI/AAAAAAAAALQ/TcjdBmNpBuo/s400/Rscreen1.jpg" alt="" id="BLOGGER_PHOTO_ID_5237362607670939426" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4375349027250083196-6038805502813210429?l=exploratorydataanalysis.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploratorydataanalysis.blogspot.com/feeds/6038805502813210429/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4375349027250083196&amp;postID=6038805502813210429' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6038805502813210429'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4375349027250083196/posts/default/6038805502813210429'/><link rel='alternate' type='text/html' href='http://exploratorydataanalysis.blogspot.com/2008/08/installing-r-and-learneda-package.html' title='Installing R'/><author><name>Jim Albert</name><uri>http://www.blogger.com/profile/12622333572321654094</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_V8g1rNtmHuM/SK7Ziv4jzyI/AAAAAAAAALQ/TcjdBmNpBuo/s72-c/Rscreen1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry></feed>
