Randomly Select Rows of Data from a List
I was asked recently to provide a script that could select rows of data from a list randomly. The process
needed to be able to select a row of data for a number of times, determined by the user. In this way
it would be possible for instance to randomly select 40 names out of a list of several hundred...
The original use of this script was to be able to randomly select a sample portion of usernames
from a database for emailing in a survey...
My solution was to write this Powershell script.
The script expects to find a text file called "Possible.txt" in the same directory as the script.
It is this text file that contains the lines of data from which the script will select at random.
The script is run with one argument, that is the number of lines to be randomly selected. Once run,
the script creates a second file called "Selection.txt" which will contain the selected lines of
data in the order they were selected. (There are 3 lines of code at the end of the script that can
be "uncommented" to effect a sort of the output if desired.)
The script assigns the lines of data from the input file into a hashtable using numbers for the
index values. By using powershell's random number generator to select a number in the range of the
index values in the hashtable, the script extracts the value that is indexed by the random number.
The process is repeated until all the required lines have been selected.
Used with large lists of data this process can take some time to run so I developed a few
improvements to track progress and optimise performance (details below). Some examples of time to
run are: To randomly extract 20% of the lines (that is 634 lines) in a list of 3170 lines, took just
45 seconds; In a larger sample list, the script was able to extract 2213 lines from a list of 11068 lines,
in 13 minutes 16 seconds. Speed here is of course somewhat relative as the amount of data on the line
can impact the time to write it to disk and the power of the workstation will also play. I used a
P4 3.2GHz with 1GB of RAM to run these tests. Indeed others my do much better...
Improvements
27/07/2007 Inserted a marker to display progress on the console. The display is of the total number
of lines processed and a time stamp together with a progress value expressed as a percentage.
28/07/2007 Adjusted processing to use hashtables instead of writing to a temporary file
- decreasing the processing time significantly.
30/07/2007 Adjusted processing to use a single hashtable - decreasing the processing time further
from what I quote above!
|