Shuffling methodology for sanitizing Afghanistan TCAPF microdata: a working paper

Sometime back in February 2010 I started a working paper titled "Shuffling_Methodology_for_Sanitizing_TCAPF_Microdata" (click to download as PDF) which outlined the methodology I used for data sanitization of TCAPF data.  The sanitization approach I discuss is applicable to cases where its desired to share unclassified data while preserving the privacy (and operational security) inherent in the data.

Essentially the data which was shared with us by USAID, although it was unclassified it had distribution restrictions due to the sensitive nature of the data which was collected by 24th MEU and other units in Afghanistan.  We felt compelled to publish the results from a bayesian analysis we performed on the data and thought it best to sanitize the data first and then publish the results from the cleansed data.  In order to do so, we had to maintain the analytical value of the data by preserving the distributional properties of the dataset for the results obtained to remain valid.  We had to balance this need for preserving analytical value with the privacy needs to withhold or obfuscate data fields deemed too sensitive to disclose.

The discussion in the paper where I go through a thought process of what could go wrong should get you thinking, at least.  I welcome your feedback and ideas in the comments below.