Notes on Data Analysis
From StatsJam
In doing the initial data analyses, we've found the following rules of thumb useful to keep in mind:
- Build complex queries step-by-step, showing intermediary results. Use "LIMIT" if necessary to limit the amount of data returned
- Try to show one or more graphs of the data. Scatter plots (e.g., plot_date) and histograms (hist, bihist) are really useful tools to help you understand the data. In most cases, these types of plots should accompany summary statistics, such as mean/median.
- Some of your queries may time out. This may be because it's genuinely a long query. Or it could be because you have a poorly formed join.
- Define the concepts you are attempting to explain. For example, what does the concept "most frequently used command" mean? Is it the most frequently logged command? The command used by the most people? The command, that when used, is used a lot? Be specific about how you are attempting to summarize data
- Use an external editor to format your wiki text and queries. They are unwieldy in browser text editors
- Watch your use of "median". Median is supplied by R, not the DB, and can be really, really slow for thousands of data points
- Include an "n" (number of data points) and standard deviation for any analysis that calculates a mean or median. These are good form and also good sanity checks to make sure you got the query right
- Discussion sections seem to be useful for discussing the implications of the data
Mterry 23:21, 9 May 2008 (UTC)

