Date:       Wed, 12 Oct 94 13:05:59 EST
Errors-To:  Comp-privacy Error Handler <owner-comp-privacy@uwm.edu>
From:       Computer Privacy Digest Moderator  <comp-privacy@uwm.edu>
To:         Comp-privacy@uwm.edu
Subject:    Computer Privacy Digest V5#047

Computer Privacy Digest Wed, 12 Oct 94              Volume 5 : Issue: 047

Today's Topics:			       Moderator: Leonard P. Levine

              Responses to Medical Data Security Questions

----------------------------------------------------------------------

From: Richard Goldstein <richgold@netcom.com>
Date: Wed, 12 Oct 1994 09:13:52 -0700 (PDT)
Subject: Responses to Medical Data Security Questions

Following is a summary of the responses I received to my request for help on
the issue of allowing an outside University-based computer research group
access to an HMO's medical record; responses were received from one person who
wished me not to post their name and from Tom Lincoln, Vicki Rosenzweig,
Rogier Wolff, Richard Threadgill, Matthew Elvey, Bill Ellett, Jeff Hupp, Mark
Durst, Carl Ellison, David Stodolsky, Grant Grundler, Peter Sherwood, and
David Harvey.  My sincerest thanks to all for your responses, which were very
helpful.  In a few cases, I have added a comment of my own to the summary;
these added comments appear inside curly brackets ("{ }").

General responses not necessarily tied to any particular part of my original
note:

1. Systems operators of the computer system will almost surely have access to
   any information contained on the system.  This group probably includes
   student operators.  You might want to express concern to the university in
   how they protect your data from these employees.

   It is, however, likely that operators will have no interest in your data,
   and may well be unaware of exactly what it is.

2. You probably want an advance list and complete veto power over exactly who
   the "researchers designated as members of the [HMO] project" are.  Also,
   you may want a written statement from each of them stating that they are
   aware that the records may contain confidential information, and that they
   will respect that confidentiality.

   Also, I assume that the university researchers will include faculty.  This
   probably means that they will consider their student assistants to be
   extensions of themselves, and that they will give complete access to thier
   records (i.e. your records) to those assistants.  They will have faith in
   those students' confidentiality, but do you?  {It turns out that there are
   faculty on the University team so this was very helpful.}

   You may also want to control how the information is used by professors in
   classes or papers.  You may also want to limit whether they can say that
   the data on which they based their study came from your HMO, by name (or
   maybe require that they give you credit).

3. You will want to know what computers have access to your data.  Is the data
   put onto disks on the main university computer system, or is it put onto
   PC's in professors' offices.

   If the former, then university hackers may gain access.  If the later, then
   professors may not safeguard access as thoroughly as you might wish.

4. When and for how long will your data be mounted on the system?  Frequently,
   data may be loaded onto a system, and at the conclusion of the project,
   unless it is a large file, may be "forgotten".

   Will will it be removed?  Once it has been removed, who will have access to
   the tapes?  Will the university retain a copy of the tapes, or will all
   copies be either returned to you or destroyed?

5. You say that the university will be provided with sample records where the
   records have been sanitized.

   Who is doing the "sanitizing"?  The HMO staff?  The university researchers
   themselves, or others.  {Currently the HMO staff is doing the
   "sanitizing".}

   I couldn't tell if they want your records or if you want their help.  If
   you are doing them the favor, then you might suggest that the university
   provide programs to be run on YOUR computer which will sanitize the records
   before they leave your control.  Otherwise, you want to be very clear on
   exactly who has access to the records before they are sanitized, and on
   exactly what happens to ALL backup tapes and other electronic copies, and
   all paper copies of the original records.  The tapes should be specifically
   erased, not simply re-used.  {Apologies for the unclarity--the HMO wants
   the University's help, but also the University wants our records.}

   If done at the university, you may want to require that the sanitizing not
   be done on the university's main computer system, but instead on a
   stand-alone system.  You are obviously very concerned about any access to
   the original records.

   Also, if you are not doing the sanitizing by HMO staff, you may want to
   require that a random sample of, say, 100 sanitized" records, be provided
   to you in paper so that you can review whether the sanitizing actually was
   complete.  You may also want to require that the records not be loaded onto
   the system tfor the researchers until you have had the opportunity to
   review the sample, and until you have approved that the sanitizing was
   complete.

   Also require that if at any point the HMO comes to believe that the records
   are not sufficiently sanitized, for any reason, then the university will
   "immediately" remove the data from the system and will stop all access
   until you again agree that it has been sufficiently cleaned up.

6. As you say, you may have problems in names which appear in the text
   portions of the records.  One manner to search for these would be to
      a. take each word in the patron's name
      b. Look throughout the text for any occurrance of this text.
      c. Remove the matching text, perhaps replacing it with "XXXXX".

   A problem is that you may remove some non-name text.  For example, if the
   name is "Rich Goldstein", then you'd remove "rich" from a sentence like
   "rich and rosy cheeks".  But you certainly don't want to manually review
   every record, unless they are very few in number. {They are potentially
   voluminous--certainly not "very few in number".}

7. Since they are looking to provide access, I assume that the data will be
   indexed.  To verify that the data has been sanitized, you might search for
   common names such as "smith" and "bill", and verify that any occurrances
   are appropriate.

   You may want to require that once they get some simple indexes built on the
   data, that you will be provided an opportunity to review the data again to
   ensure that it is sanitized.  At that point, you could do these searches.

8. This is probably obvious, but:  ask them whether the system administrator
   for this computer is a member of the project.  Ask about system staff in
   general--they are likely to have access to all sorts of information.  Ask
   about backup procedures:  are tapes left around in unlocked drawers, or
   taken offsite to unsecured locations?  (At my site, the offsite backup is
   my boss's living room, but we have no secure data.)  Is the system
   connected to any kind of network, and what operating system is it on?  For
   that matter, who designates members of the project?

9. One of the rules I'd propose would be to consider the data provided to the
   university as confidential, and under strict orders not to be proliferated,
   as if it were the real data.  Thus the rules for the researchers should be
   the same as for the doctors that need to work with the data.  Off course
   this is not completely feasable, but it should give a nice guideline to
   work towards....

10.  If you have time, I would recommend that rather than sanitizing some
     large number of records from your existing database, that you
     *manufacture* a set of plausible records entirely.  While the university
     group is likely to assume that all of these records pertain to real
     patients somewhere, that doesn't need to be the case.  I would
     particularly recommend this because even if the university's security
     precautions are strong (which I'm willing to accept), the researchers are
     almost certainly going to discuss amongst themselves any particularly
     entertaining cases they come upon in the course of developing and testing
     their technology.  I strongly doubt that any of your patients would like
     to become urban legends, even without their real names attached.

     In the (likely) event that you can't generate an entirely fictitious set
     of data for them to work with, I'd guess that you can probably place a
     high degree of faith in the automatic masking process, but I'd recommend
     that you only give them a large subset of your patient data.  Ideally,
     you'd add to the auto-mask stripping out entirely any records which
     contain the strings 'Mr.', 'Mrs.', or 'Ms.'  - I think that will
     dramatically reduce the number of real proper names which end up in the
     data set you hand them.

12.  If it's a networked unix computer, it's pretty much guaranteed insecure.

13.  I don't have a lot of original material to add, but your comments don't
     make it clear whether you have a copy of:  "Report on Statistical
     Disclosure Limitation Methodology", Subcommittee on Disclosure Limitation
     Methodology, Federal Committee on Statistical Methodology.  Statistical
     Policy Working Paper 22 of the Statistical Policy Office, Office of
     Information and Regulatory Affairs, Office of Management and Budget.
     NTIS Document Sales, PB94-165305.

     Much of the Census work on this question revolves on selecting summaries
     that are not too revealing, with only the summaries being released to the
     public.  By contrast, this report includes considerable material on
     disclosure risk in microdata.  In addition, an Appendix entitled
     "Research Agenda" should make it quite clear what is NOT known in this
     area.

     In your place I would inquire why the University must have genuine data
     records, even if masked.  In a similar case here at LBL we scrambled each
     record so as to have the right marginals, but so that no output record
     was precisely the set of attributes from an input record.  Is the search
     software under consideration so sophisticated that it could not be
     developed using such scrambled records?

     You are also right to look closely into the university computer's access
     control.  If they do not have an active program to deter unauthorized
     access (e.g., running "crack" programs against passwords, having
     automatic timeouts on user terminals) and to detect it (active monitoring
     by a human of system logs and accounting to spot unusual patterns,
     hopefully with pieces different from that supplied by the system
     manufacturer), they should not be trusted with any confidential data.

14.  Is it an option to have the University ask you for statistical
     information and for you to sell them statistica results?  This sounds
     like a source of income for you and a masking like that used for the
     Census. {This comment, the only one in one particular response, raises
     more questions than it answers.}

15.  I will first address your questions on the providing of HMO data to the
     University *strictly* on a computer basis.  No matter how much they say
     the data is off limits from all but approved personnel, I just can't
     believe they have all of the necessary tools to restrict access.  A
     university by definition is an *open* institution.  Rather than moving
     the data off-site, I would be inclined to allow network hookup to
     specified machines with Kerberos encryption.  That way, if anything goes
     awry, you can shut down the machine which is providing the information at
     your site.  It goes without saying that almost any querying of the data
     on that machine *MUST* be logged.  Any attempts at illegal access to the
     data should be followed up by swift and appropriate legal action.

16.  As for the ability to combine the information to obtain a complete
     profile of the person, you have noted it is indeed possible.  No matter
     how much you automate the stripping process, I am inclined to believe the
     information can't always be totally removed.  Thus the confidentiality of
     the patient doctor privilege is compromised.  One rule I would have is
     that no patient's information should be made available without the
     consent of the patient.  By the way, I would *NOT* sign such a consent
     form!  But then I also have caller ID on my phone, and do other things to
     monitor when and how people can have access to me.

Responses tied to particular parts of the original posting:

>I am not aware of any literature dealing specifically with this question
>for medical records (except that I do have a copy of the 9/93 publication
>from the Office of Technology Assessment entitled _Protecting Privacy in
>Computerized Medical Information_; however, this is not a technical
>publication).

Your best and most recent overall review is "Health Data in the Information
Age:  Use, Disclosure and Privacy" by Molla S. Donaldson and Kathleen N. Lore,
Editors, The Institute of Medicine, N1994 ISBN 0-309-o4995-4.  It is an even
handed discussion, fully documented, with an extensive literature.

By and large, access to well secured individual records are gained by
confidence procedures from individuals who have a legitimate access to them,
generally over the phone.  The major financial gain to be derived is to
harvest mailing lists of individuals with a particular illness or a particular
anxiety.  (Imagine what Preparation H could do with a list of those with
active hemoroids!)  Inferential knowledge is clearly a major issue.  I have
written a number of papers on the subject and would be glad to send them to
you. {I sent my address and received a number of helpful papers.}

>1. automated masking or identifiers such as addresses and
>   telephone numbers in ... extract headers as created [at the
>   HMO]
>2. automated masking of medical record numbers
>3. automated masking of each segment of each member's name
>   everywhere these segments occur in the ... extract"

Don't count on this.  Spelling and keying errors alone will leave tacks that
can be followed very easily.  One thing you might consider is the use of a
spell checker, but the overhead this will add is almost as heavy as having
some human read and edit all the records.

>
>There are some known problems with this masking (e.g., regarding the
>occurrence of names in the record other than than of the particular
>patient).  My problem is that I have no idea how much faith, trust,
>etc. to put into the "automated masking" process.  Of particular help
>would be guidance on what questions to ask about this process to help
>make decisions about whether it is sufficient (guidance on literature
>would also be appreciated).

Have them run the production programs on a statistically significant portion
of the data and review the results by hand.

>
>Another question relates to what we should be asking about the security
>of the university computer; we have been told that the center "has
>implemented data access security by granting electronic access to [HMO]
>data only to researchers designated as members of the [HMO] project."
>However, we have been provided with NO details; again, what questions
>should we be asking and how do we interpret the responses.
>

Universities have both advantages and disadvantages when it comes to security.
Lot's of people trying to break in, but that also trains the sysadmins on what
the problems are.

>I should mention that our committee very strongly opposes any movement
>of hmo data outside the hmo, but in rare circumstances we have agreed
>when we were satisfied with the security situation (usually a
>stand-alone computer in a room that could easily be locked).
>

I have to agree here.  That would be the only 'real' security you could count
on.  Be sure that there isn't an internet or dial up connection to that
machine.  Might as well have it in a public area if you do.

> especially with respect to Census information.  However, I am not familiar
> with recent literature on this question or with computer algorithms; further,
> I am not aware of any literature dealing specifically with this question for
> medical records (except that I do have a copy of the 9/93 publication from the
>

Fellegi and Sunter.  A theory for record linkage, JASA 64, 1183-1210, 1969.

Jaro.  Advances in Record-Linkage Methodology...  JASA 84, 414-420, 1989.

| The process includes providing the university with example records (size of
| sample not known), where the records have been 'sanitized'.  "The sanitization
| process has three stages:
|
| 1. automated masking or identifiers such as addresses and
|    telephone numbers in ... extract headers as created [at the HMO]
| 2. automated masking of medical record numbers
| 3. automated masking of each segment of each member's name
|    everywhere these segments occur in the ... extract"
|
| There are some known problems with this masking (e.g., regarding the
| occurrence of names in the record other than than of the particular patient).
| My problem is that I have no idea how much faith, trust, etc. to put into the
| "automated masking" process.  Of particular help would be guidance on what
| questions to ask about this process to help make decisions about whether it is
| sufficient (guidance on literature would also be appreciated).

Also occurance of
- the patient's name in fields other than the field one is masking.
- occurance of care giver's name (MD/RN/OT/PT etc) in reports
- occurance of other personal info (eg. phone numbers to call) in report.

| I note also that the people on the project appear to be unaware of the
| possibility of identifying patients via combinations of coded information.  As
| a statistician, I am aware of some of the large literature on this question,
| especially with respect to Census information.  However, I am not familiar
| with recent literature on this question or with computer algorithms; further,
| I am not aware of any literature dealing specifically with this question for
| medical records (except that I do have a copy of the 9/93 publication from the
| Office of Technology Assessment entitled _Protecting Privacy in Computerized
| Medical Information_; however, this is not a technical publication).

This takes some luck, good insight, and leg work. Not sure this is what
the university is after or has time to verify everything.

| Another question relates to what we should be asking about the security of the
| university computer; we have been told that the center "has implemented data
| access security by granting electronic access to [HMO] data only to
| researchers designated as members of the [HMO] project."  However, we have
| been provided with NO details; again, what questions should we be asking and
| how do we interpret the responses.
|
| I should mention that our committee very strongly opposes any movement of HMO
| data outside the HMO, but in rare circumstances we have agreed when we were
| satisfied with the security situation (usually a stand-alone computer in a
| room that could easily be locked).

The moment you let someone else have a copy of any portion of the data base,
you loose total control of that data.  Some student or bystander will get
access.  Or a professor takes it home over the weekend...IMHO, processes like
this just don't work because they require *everyone* be trustworthy.

| Any help or advice would be greatly appreciated and should, preferably, be
| sent directly to me at "richgold@netcom.com".  If desired, I could post a
| summary of the resulting responses to this group.

please do - I'm curious what kind of systems people use to share sensitive
data and how they protect it.

IMHO, I would only let the University install stand alone HW to access your
systems and set up queries to generate statistics.  No external network or
removable media.  If data needs to migrate back to the university, set up an
operation to verify the data is cumulative in nature and does not contain any
personal info.  You can control who and how the data is accessed.

: The HMO has entered into an agreement with a 'local'
: university (about 90 miles away) to attempt to develop tools for
: exploiting clinical text data (e.g., access, search, extract,
: manipulate the text portion of the record).

: The process includes providing the university with example records
: (size of sample not known), where the records have been 'sanitized'.

In my experience, it is impossible to sanitize databases, for just the reasons
you mention, and also because someone on the project may recognize a specific
case.  You are also correct to be skeptical of university security.

For this reason, a different procedure should be followed for testing.

1) Write a program to generate artificial records.  This takes about the same
   amount of thought as "sanitizing" the database.  It's not trivial, but not
   overwhelmingly difficult.

2) Provide the University group with the artificial records for testing.  When
   the University is satisfied with the results, let them provide you with a
   test release of the software (or whatever portion they are working on).
   HMO personnel then test the software on live data, at the HMO.  This may
   require a loan or rental of hardware.

3) If problems are found (likely), the artificial record generator may need to
   be modified to create records of the problem type.

This method of testing has another advantage besides protection of patient
records:  by creating a random selection of records with characteristics of
real records, you can create a more diverse database and catch more problems.
And, of course, you can make the database as large as you want, include dates
later than 12/31/99, >100-year old patients, patients with large numbers of
visits or diagnoses, and in general stress the system.

{While no decision has been made yet, our Human Studies Committee (IRB)
decided to ask a whole range of questions, largely based on the material
above; I also note that it turns out that our state is one of the many states
that has a law requiring that informed consent from patients be obtained prior
to giving any data with identifiable information to anyone outside the medical
provider group; it is not clear what the criteria are for deciding whether
data contains identifiable information.  Again, thank you all very much. Rich
Goldstein}


------------------------------

End of Computer Privacy Digest V5 #047
******************************
.