Analyzing Datagame data in R
- September 12, 2016
Analyzing Datagame data in R
Digging in to the data after a study is vital to uncovering useful insights about your data. Sometimes, the default views provided in the HTML report are not quite enough and it is necessary to dig deeper. Datagame allows you to export your data to an Excel file. But sometimes, not even Excel is enough. You want to transform your data in R.
The data export file is easy to prepare and import into R for deeper analysis. Here is how to prepare and read Datagame data in R, and a few basic things you can do once you have your data in R.
Preparing the CSV File
Before you read an Excel data file into R, the file needs to be formatted properly. It should be clean, with the same number of columns in each row. The first row should be a header row with column names, and each row should be complete.
For a Datagame export file, the data of interest is found on the worksheet “RAW.” Also, CSV files do not support multiple worksheets, so it is necessary to copy the RAW worksheet to a separate self-contained file.
To prepare this data to be read in R:
- In Datagame, export the data file for the desired project. See the Datagame Knowledge Base.
- Open the exported data file in Microsoft Excel, OpenOffice, or other spreadsheet application that supports XLSX files.
- Right-click on the RAW worksheet tab and select Move or Copy.
- In the Move or Copy dialog, select Create a copy to copy the worksheet.
- In To book, select (new book) to copy the worksheet to a new workbook.
- Delete Rows 1-6 of the RAW worksheet. This is extra information about the project that is not necessary for performing data analysis in R. For a MaxDiff data file, also delete Row 8. This row represents the aggregate total for all responses, and is not needed for analysis in R.
- If the Respondent ID column (Column C) is empty, delete it. Empty fields or missing values can sometimes create errors when importing into R.
- Save the modified worksheet as a CSV file in your R working directory.
Reading the CSV File in R
Now that the Datagame raw data is in an easily consumable CSV file, you can read it in R as a data frame using the read.csv command.
To read Datagame raw data in R:
- Enter MyDG <- read.csv(“filename.csv”, header=TRUE).
In this example, MyDG is the name given to the data frame created by reading the file, and filename.csv is the filename of the file saved in the R working directory. If the file is not in the working directory, you would need to enter the full path and file name for the data file. This reads the CSV file as a data frame with names in the header row.
Referencing Data in R
To reference your data in R, use the data frame name you assigned it when reading the file (in this example, MyDG). You can refer to parts of the file using basic R syntax.
- To reference a column by name, use MyDG$ColumnName. For example, MyDG$PlayerID returns a list of all of the player IDs in the data frame. You can also refer to a column by its index number (leaving the row index empty). For example, MyDG[,4] will return the value from the fourth column of every row.
- You can also reference a data frame by row, using the row index and leaving the column index empty. For example, MyDG[7,] would return the values from the 7th row for all columns, and includes the column names.
- When using indexes, a range of rows or columns can be returned by using the semicolon (:) character. For example, MyDG[1:10,] would return the first 10 rows of the data frame.
- You can refer to a specific data point using its row and column index. For example, MyDG[7,4] would return the value of the 4th column in the 7th row.
Using these references and combining them with commands in R, you can perform nearly any data analytics you desire. Perform basic computations such as the sums, means, or counts for a column. Following are a few examples of things you can do with your Datagame data once it is in a data frame in R.
For these example, we will use data from the Datagame Beer Tasting Room at this past June’s IIeX conference. You can see that, overall, the 420 Extra Pale Ale and Laughing Skull Amber Ale were the most liked. But what else can you see from a deeper dive look at the data in R?
After preparing the data file and reading it into R as a data frame named iiex, we are ready to do some manipulations to the data. Here is what the first five rows of the data frame look like:
The columns X154-X157 are the rank order data for the four attributes, while X154.1-X157.1 are the rating scale data.
Rank order frequency
Suppose you want a count of how many times each unique rank order was selected. What was the most common rank order for all four attributes? R can quickly count the frequency of each unique rank order using the count command. By specifying multiple columns, it counts the frequency of each combination.
count(iiex, c("X154", "X155", "X156", "X157"))
This produces a listing of each combination of rank orders and the frequency count for each combination:
X154 X155 X156 X157 freq 1 1 2 3 4 5 2 1 2 4 3 2 3 1 3 2 4 1 4 1 3 4 2 6 5 1 4 2 3 4 6 1 4 3 2 5 7 2 1 3 4 5 8 2 1 4 3 3 9 2 3 1 4 2 10 2 3 4 1 8 11 2 4 1 3 2 12 2 4 3 1 5 13 3 1 2 4 1 14 3 1 4 2 5 15 3 2 1 4 1 16 3 2 4 1 7 17 3 4 1 2 4 18 3 4 2 1 4 19 4 1 2 3 16 20 4 1 3 2 5 21 4 2 1 3 1 22 4 2 3 1 5 23 4 3 1 2 7 24 4 3 2 1 8
From this, you can quickly see that the most common rank order was combination 19, which was picked 16 times. You could use this data to find segments of respondents that ranked items similarly to size clusters, or identify trends in the rankings.
Stat testing with R
As a data analytics package, naturally R has several functions for various statistical testing, such as calculating standard deviations, standard error, and confidence intervals.
For example, you can use the t.test function to calculate a 95% confidence interval for an attribute. It automatically calculates t, degrees of freedom, p-value, the 95% confidence interval, and the mean value for the specified variable.
In our iiex data, we can calculate the 95% confidence interval for the rating scale score of an attribute as follows:
One Sample t-test data: MyDG$X155.1 t = 37.069, df = 111, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 3.329805 3.705909 sample estimates: mean of x 3.517857
Here, you can see that the mean rating scale score for this attribute was 3.5, and the 95% confidence interval puts the mean somewhere between 3.3 and 3.7.
You can also specify two attributes to perform either a paired or unpaired t-test. These test the difference between the means and give a 95% confidence interval for the mean difference. If the confidence interval spans 0 (i.e., the lower end is negative and the higher end is positive), the difference in means is not significant.
For this data, since we are testing rating scale data from the same sample sets, we want to run a paired t-test:
> t.test(iiex$X155.1, iiex$X156.1, paired=TRUE) Paired t-test data: iiex$X155.1 and iiex$X156.1 t = 1.4088, df = 111, p-value = 0.1617 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.07985641 0.47271355 sample estimates: mean of the differences 0.1964286
The high p-value and confidence interval range that includes 0 (-0.08 to 0.47) would indicate that the difference in the means between these two attributes is not significant.
Datagame and R
Datagame produces data sets that are fully compatible with analytics tools such as R. Export your data and use it in R to discover new insights. Sign up for a free account and start collecting data today.