Pulse Pipeline Construction

Introduction

This document is intended to suggest modifications to the College Pulse archives and data collection practices. These changes will not only make future quantitative data analysis easier (and in some cases possible), but by recording data in a more efficient manner also reduce data storage requirements.

Record Numerical Answers

The primary issue facing the Pulse dataset is the use of raw text to store user responses. Fundamentally, all predictive analysis and statistics are mathematical. Numerical representation of student responses is required for any of these applications, and it is far easier to implement this into the app as opposed to retroactively apply it to the archives. Thus, while the choices presented to a user can still display as text:

Very Likely
Somewhat Likely
Somewhat Unlikely
Very Unlikely

The backend of the app should record a number: 1, 2, 3, or 4. This should be simple to implement for future questions, and will not affect user experience at all.

Separate Ordinal and Categorical Questions

When answers are recorded numerically, attention needs to be paid to whether a question's choices are ordinal or categorical. An ordinal list is one like the above, or like:
$0 -- $20,000
$20,000 -- $50,000
$50,000 -- $100,000
$100,000+

where the answers clearly lie along a conceptual sequence. When options are ordinal, they must be numerically encoded in order if predictive algorithms are to make sensible predictions. "$0 -- $20,000'' must be encoded as a 1, and "$20,000 -- $50,000'' must be encoded as a 2, etc.

Categorical lists do not form an intrinsic conceptual sequence and the precise numerical representation is arbitrary. It does not matter which of the below list is 1, which is 2, etc.

Cherry Pie
Apple Pie
Blueberry Pie
Key Lime Pie

While constructing a numerical encoding scheme for user responses, it is vital to choose a consistent, global representation for missing data. I recommend using ``0'' to represent missing data and using 1--4 (or whatever highest value is appropriate) for actual answers.

Distinguishing between categorical and ordinal questions is trivial to do while implementing new questions, but will be very labor intensive to apply to the archives. Unfortunately, it must be done. A human can easily see that the first two lists in this document form a natural sequence, but that sequence cannot easily be determined by a machine. There are approximately 2,000 questions in the current Pulse dataset that are either ordinal or categorical, and they must all be identified and correctly encoded before a statistical prediction algorithm can be deployed. This is the single largest obstacle to predicting user responses using the existing dataset.

Binary Summations for "all that apply'' Questions

Finally, many questions in the current dataset allow users to select "all that apply.'' For example, question ID 5b2165535fc890000376de3e (Have you ever skipped class for any of the following reasons?) allows a user to select any combination of 13 choices. It is left as an exercise for the reader to show that this yields 2^13 = 8192 possible unique answers. I recommend these questions be stored using a binary flag system.

Each possible option is assigned a value which is a power of 2, and the answer recorded for a given user is the sum of all their choices. A simple example would be:

Option	Number	Binary
Apple Pie	1 (2^0)	0001
Cherry Pie	2 (2^1)	0010
Blueberry Pie	4 (2^2)	0100
Key Lime Pie	8 (2^3)	1000

If a user selects cherry pie and key lime pie, the sum of their answers is 10, which is recorded in the database. Looking at this number in binary (1010) it is trivial to see that they selected the 4th and 2nd options. By using powers of 2 to represent each choice it is guaranteed that every possible combination will have a unique sum and binary representation. In addition to being numerically convenient, this also requires considerably less data storage, as each user only requires a single integer, instead of a string which could be dozens of characters long.

Accurate meta-data needs to be recorded about whether a question is "choose all that apply'' or "select only one from this list.'' These questions must be processed differently, and thus must be identified correctly. Currently, they are both classified as "mc.'' The pipeline provided attempts to separate these based on the number of answer "tokens'' compared to the number of unique answers provided by all users. The assumption is that with a few hundred or thousand responses, users will select all available choices from a "restricted options'' list, and thus the number of unique answers will be equal to the total number of tokens. It is possible that there are questions in the database that have

Thursday, June 28, 2018