Latanya Sweeney
birth:
place:
ALB(1995) in computer science Summa Cum Laude Harvard University; SM(1997) in electrical engineering and computer science from MIT
Ph.D(2001) computer science MIT
Assistant Professor of Computer Science, Technology and Policy, Institute for Software Research International, School of Computer Science, Carnegie Mellon University; Director, Laboratory for International Data Privacy, also known as the "Data Privacy Lab", School of Computer Science, Carnegie Mellon University; Co-Director, PhD Program in Computation, Organizations and Society, School of Computer Science, Carnegie Mellon University.;Director, Privacy Technology Center, School of Computer Science, Carnegie Mellon University.
URL: http://privacy.cs.cmu.edu/people/sweeney/index.html
e-mail: latanya@privacy.cs.cmu.edu
Founder, Journal of Privacy Technology.
Faculty, Center for Automated Learning and Discovery, School of
Computer Science, Carnegie Mellon University.
* Faculty, Aladdin Center, School of Computer Science, Carnegie
Mellon.
Core Semantic Learning Technologies Invented
* Sprees: a finite state orthographic learning
system that recognizes and generates phonologically similar spellings.
This was my Masters thesis. [1995]
* Scout: an algorithm and software program that profiles a dataset
to learn what fields are present and the kind of information contained
within, thereby providing a semantic description of an unknown
dataset. [1995]
* Database Profiling Server: an algorithm and software program
that relates fields across datasets and tables, thereby identifying
fields containing duplicate information across tables and fields
in one table that can be reliably linked to those in another table
--learning is based on the actual values that appear in the tables.
[1996]
* Identifiability Server: a process and related models for determining
how identifiable individuals and other entities within data may
be by utilizing summary and aggregate data, e.g., Census data.
A finding now cited in hundreds of articles is that "87%
of the US population is uniquely identified by {date of birth,
gender, 5-digit ZIP}". This work is the source of
that finding when used with Census data. [1997]
* Risk Assessment Server: a system for determining how identifiable
individuals within data may be by utilizing an inference engine,
a taxonomy and ontology of fields, specific domain knowledge,
knowledge about available datasets, and population specifics.
This server is licensed commercially to perform HIPAA Certifications
(see www.privacert.com); and, can be used in privacy-preserving
bio-terrorism surveillance (see my AAAS Presentation and TAPAC
testimony). It has also been the basis, in part, of my numerous
expert witness consultations and testimonies. [1998]
* Donor Profiling and Solicitation (DPS): an algorithm and software
program that learns a psychological profile of a person based
on giving history, and then uses compiled information to determine
an optimal personalized solicitation strategy for a given solicitation
attempt. Previously licensed to CESS, Inc. and Share Systems,
Inc., who used it with more than 30 leading non-profit organizations
including the Democratic National Party, the National Organization
for Women, the Harlem Boy's Choir, Bishop Tutu's South African
Freedom Campaign, and Greenpeace. [1987]
* DPS Extractor: an algorithm and software program that learns
titles, given names, surnames, suffixes, street names, and other
information about a household by extracting and interpreting information
from mailing labels. This was licensed to CESS, Inc. and to Share
Systems, Inc. [1987]
* Biblio: an algorithm and software program that extracts publication
references from raw text bibliographies, identifying constituent
parts (author names, title, publication, etc.), and then uses
learned information to remove duplicates and to construct a searchable
database. This was licensed to CESS, Inc. and Chitin, Inc. [1988]
* Scrub Extractor: an algorithm and software program that automatically
extracts names, addresses, and other identifying information from
letters, notes, articles, and other free text documents. (Basis,
in part, for Scrub System, which is a privacy de-identification
tool described below.) [1996]
* Iterative Profiler: an algorithm and process by which increasing
amounts of information is related to individuals using inferential
linkages of data fragments across various kinds of data sources.
This has been the basis, in part, of some of my expert witness
experiments and testimonies. [1997]
* RosterFinder: an algorithm and program that uses the Google
API to allow searches for web pages containing rosters (lists
of names) of people. A sample use was to locate web pages containing
rosters of undergraduate students in computer science; about 18,725
students were found in 39 schools. Results are available in a
searchable database on-line (thanks in part to Marshall Warfield).
[2003]
* CameraWatch: a suite of algorithms and programs for locating
IP addresses of live webcams and webpages showing images from
live webcams. A sample of the results is available in an on-line
searchable database (thanks in part to Kishore Madhava). The work
has been highlighted on ABC News, CNN, USA Today, and SlashDot,
and has had more than a million hits to the website. [2003]
* SSNwatch: a method and process by which public information about
Social Security number (SSN) allocations is used to learn past
residential information and current age inferences about the person
to whom the SSN was assigned. This provides a means to match a
person presenting an SSN to the demographics learned about the
SSN, which is useful in combating identity theft. [2004]
Data Privacy Technologies Invented
* k-Anonymity: any algorithm or process
that anonymizes data by insuring each entity in the data is indistinguishable
from at least a specific number of other such entities in the
data. Provisional patent (US and International). Received a recognition
award from the 2004 Workshop on Privacy Enhancing Technologies.
Has been cited, discussed and extended in lots of academic work
by others across numerous communities and applied to all kinds
of data and has inspired other kinds of uses (e.g., k-anonymous
messaging). [1997]
* Datafly: an algorithm and general-purpose software program that
anonymizes field-structured data so that the released data adheres
to a k-anonymity requirement. Specifically, there are k
records that are indistinct over the fields sensitive to re-identification.
Received a recognition award from the American Medical Informatics
Association. A license was provided to Datanon, LLC and now subsequently
to Privacert, Inc. [1997]
* k-Similar: a general-purpose clustering algorithm that
groups the closest information together with the guarantee that
there are at least k members to each cluster. The number
of clusters is not fixed, making the algorithm the converse of
the very well-known k-means cluster algorithm, which guarantees
there are at least k clusters having any number of members. Like
Datafly, k-Similar can be used to anonymize field-structured data
so that the released data adheres to a k-anonymity requirement.
But unlike Datafly, results from k-Similar maintain the
maximum detail possible. A license was provided to Datanon, LLC
and now subsequently to Privacert, Inc. [1998]
* k-Same (with Elaine Newton and Bradley Malin): an algorithm
for de-identifying faces in video surveillance data such that
no face recognition software (no matter how good the software
may get) can reliably recognize the resulting images even though
most facial details are preserved. This is done by averaging image
components, which may be the original image pixels (k-Same-Pixel)
or eigenvectors (k-Same-Eigen) so that k-anonymity is assured.
[2003]
* Scrub: an algorithm and software program that anonymizes unrestricted
text such that the identities of individuals and other entities
in the data cannot be re-identified. (Converse of Scrub Extractor
mentioned earlier.) Based on its use with clinical notes and letters,
it received a recognition award from the American Medical Informatics
Association in 1996. [1996]
* Policy Explorer: an algorithm and process that characterizes
and quantifies data sharing practices by estimating how much information
is made available about each entity that is the subject of the
data and reporting related measures of risks. Provides a tool
for comparing competing policies and performing "what if"
analyses. [1997]
* Privacert: a rule-based system with related language ("PrivaCert
Editing Language") for expressing and enforcing anonymity
requirements to render a specific dataset sufficiently de-identified.
Results satisfy privacy standards established by the Risk Assessment
Server (mentioned above) and not necessarily k-anonymity. This
technology is licensed to Privacert, Inc. for use in rendering
health data sufficently de-identified in accordance to HIPAA (U.S.
medical privacy regulation). [1997]
* PrivaSum (with Samuel Edo-Eket): a real-world protocol that
allows parties to jointly compute an aggregate statistic over
a network such that the result is known by all but the contribution
by each party remains confidential. Performance is improved over
the traditional secret sharing approach, which is deterministic,
by providing probabilistic assurances even in the face of widespread
collusion by some malicious parties. [2004]
Other Semantic Learning Technologies Invented
* Collaboration Wheel: an on-line tool for
supporting group joint work on a common document in a networked
environment. Java prototype built by Charles Shelton in 1997.
[1997]
* Active Tutor (aka Power Learning): an on-line teaching-learning
environment, where the computer plays the role of an expert teacher.
Instructions and practice are seamlessly integrated and adapted
to the personal needs of each student. On-line demonstration available.
[1983-2002]
* Virtual lectures: By taking the expected path of an "average
student" through an Active Tutor (described above), the teaching-learning
material can be condensed into carefully crafted on-line lectures.
Teaching-learning materials for 3 full semesters of the Java Programming
language are available on-line (Java1, Java2, Java3) and have
been used to teach hundreds of students in courses at Carnegie
Mellon (e.g., 15-100) and Harvard University, officially, and
many other schools unofficially. Numerous non-traditional students
have reported using the on-line materials for self-directed learning.
[1998-2002]
* GridCity: a visual programming environment in which students
write Java programs to control vehicles in a GridWorld. This is
an adaptation of Karel the Robot by Richard Pattis, but
this adaptation exploits the fundamental programming constructs
made possible by Java that were not available to Karel in Pascal.
On-line demonstrations and software download available. Used by
hundreds of students at Carnegie Mellon and Harvard University.
[1999-2002]
* Bebe: an algorithm and software program for learning the basics
sounds of a human language based on automated detection of phonemes
in the analog waveform. No high-level knowledge of the language
is used, so performs as well learning the Korean sound system
as the American English sound system. Early implementations done
jointly with Patrick Thompson. [1997]
* Iris Expert System shell: a method and software program for
developing on-line expert systems for diagnostic tasks. A sampler
was built for interviewing people about personal behaviors that
may place them at risk to AIDS; the program received regional
print and radio news coverage in New England. [1989]
* CompuFix: a software program that steps a person through the
repair of a personal computer, assuming no prior knowledge of
computer repairs. Uses the Iris Expert System shell (mentioned
above). Was licensed to American Information Technologies, which
sold copies commercially. [1990]
This website was created by and is maintained
by Dr. Scott Williams, Professor of Mathematics State University of New York at Buffalo |
visitors since opening 5/25/97 |