Latanya Sweeney



ALB(1995) in computer science Summa Cum Laude Harvard University; SM(1997) in electrical engineering and computer science from MIT

Ph.D(2001) computer science MIT

Assistant Professor of Computer Science, Technology and Policy, Institute for Software Research International, School of Computer Science, Carnegie Mellon University; Director, Laboratory for International Data Privacy, also known as the "Data Privacy Lab", School of Computer Science, Carnegie Mellon University; Co-Director, PhD Program in Computation, Organizations and Society, School of Computer Science, Carnegie Mellon University.;Director, Privacy Technology Center, School of Computer Science, Carnegie Mellon University.


Founder, Journal of Privacy Technology.
Faculty, Center for Automated Learning and Discovery, School of Computer Science, Carnegie Mellon University.
* Faculty, Aladdin Center, School of Computer Science, Carnegie Mellon.

Core Semantic Learning Technologies Invented

* Sprees: a finite state orthographic learning system that recognizes and generates phonologically similar spellings. This was my Masters thesis. [1995]
* Scout: an algorithm and software program that profiles a dataset to learn what fields are present and the kind of information contained within, thereby providing a semantic description of an unknown dataset. [1995]
* Database Profiling Server: an algorithm and software program that relates fields across datasets and tables, thereby identifying fields containing duplicate information across tables and fields in one table that can be reliably linked to those in another table --learning is based on the actual values that appear in the tables. [1996]
* Identifiability Server: a process and related models for determining how identifiable individuals and other entities within data may be by utilizing summary and aggregate data, e.g., Census data. A finding now cited in hundreds of articles is that "87% of the US population is uniquely identified by {date of birth, gender, 5-digit ZIP}". This work is the source of that finding when used with Census data. [1997]
* Risk Assessment Server: a system for determining how identifiable individuals within data may be by utilizing an inference engine, a taxonomy and ontology of fields, specific domain knowledge, knowledge about available datasets, and population specifics. This server is licensed commercially to perform HIPAA Certifications (see; and, can be used in privacy-preserving bio-terrorism surveillance (see my AAAS Presentation and TAPAC testimony). It has also been the basis, in part, of my numerous expert witness consultations and testimonies. [1998]
* Donor Profiling and Solicitation (DPS): an algorithm and software program that learns a psychological profile of a person based on giving history, and then uses compiled information to determine an optimal personalized solicitation strategy for a given solicitation attempt. Previously licensed to CESS, Inc. and Share Systems, Inc., who used it with more than 30 leading non-profit organizations including the Democratic National Party, the National Organization for Women, the Harlem Boy's Choir, Bishop Tutu's South African Freedom Campaign, and Greenpeace. [1987]
* DPS Extractor: an algorithm and software program that learns titles, given names, surnames, suffixes, street names, and other information about a household by extracting and interpreting information from mailing labels. This was licensed to CESS, Inc. and to Share Systems, Inc. [1987]
* Biblio: an algorithm and software program that extracts publication references from raw text bibliographies, identifying constituent parts (author names, title, publication, etc.), and then uses learned information to remove duplicates and to construct a searchable database. This was licensed to CESS, Inc. and Chitin, Inc. [1988]
* Scrub Extractor: an algorithm and software program that automatically extracts names, addresses, and other identifying information from letters, notes, articles, and other free text documents. (Basis, in part, for Scrub System, which is a privacy de-identification tool described below.) [1996]
* Iterative Profiler: an algorithm and process by which increasing amounts of information is related to individuals using inferential linkages of data fragments across various kinds of data sources. This has been the basis, in part, of some of my expert witness experiments and testimonies. [1997]
* RosterFinder: an algorithm and program that uses the Google API to allow searches for web pages containing rosters (lists of names) of people. A sample use was to locate web pages containing rosters of undergraduate students in computer science; about 18,725 students were found in 39 schools. Results are available in a searchable database on-line (thanks in part to Marshall Warfield). [2003]
* CameraWatch: a suite of algorithms and programs for locating IP addresses of live webcams and webpages showing images from live webcams. A sample of the results is available in an on-line searchable database (thanks in part to Kishore Madhava). The work has been highlighted on ABC News, CNN, USA Today, and SlashDot, and has had more than a million hits to the website. [2003]
* SSNwatch: a method and process by which public information about Social Security number (SSN) allocations is used to learn past residential information and current age inferences about the person to whom the SSN was assigned. This provides a means to match a person presenting an SSN to the demographics learned about the SSN, which is useful in combating identity theft. [2004]

Data Privacy Technologies Invented

* k-Anonymity: any algorithm or process that anonymizes data by insuring each entity in the data is indistinguishable from at least a specific number of other such entities in the data. Provisional patent (US and International). Received a recognition award from the 2004 Workshop on Privacy Enhancing Technologies. Has been cited, discussed and extended in lots of academic work by others across numerous communities and applied to all kinds of data and has inspired other kinds of uses (e.g., k-anonymous messaging). [1997]
* Datafly: an algorithm and general-purpose software program that anonymizes field-structured data so that the released data adheres to a k-anonymity requirement. Specifically, there are k records that are indistinct over the fields sensitive to re-identification. Received a recognition award from the American Medical Informatics Association. A license was provided to Datanon, LLC and now subsequently to Privacert, Inc. [1997]
* k-Similar: a general-purpose clustering algorithm that groups the closest information together with the guarantee that there are at least k members to each cluster. The number of clusters is not fixed, making the algorithm the converse of the very well-known k-means cluster algorithm, which guarantees there are at least k clusters having any number of members. Like Datafly, k-Similar can be used to anonymize field-structured data so that the released data adheres to a k-anonymity requirement. But unlike Datafly, results from k-Similar maintain the maximum detail possible. A license was provided to Datanon, LLC and now subsequently to Privacert, Inc. [1998]
* k-Same (with Elaine Newton and Bradley Malin): an algorithm for de-identifying faces in video surveillance data such that no face recognition software (no matter how good the software may get) can reliably recognize the resulting images even though most facial details are preserved. This is done by averaging image components, which may be the original image pixels (k-Same-Pixel) or eigenvectors (k-Same-Eigen) so that k-anonymity is assured. [2003]
* Scrub: an algorithm and software program that anonymizes unrestricted text such that the identities of individuals and other entities in the data cannot be re-identified. (Converse of Scrub Extractor mentioned earlier.) Based on its use with clinical notes and letters, it received a recognition award from the American Medical Informatics Association in 1996. [1996]
* Policy Explorer: an algorithm and process that characterizes and quantifies data sharing practices by estimating how much information is made available about each entity that is the subject of the data and reporting related measures of risks. Provides a tool for comparing competing policies and performing "what if" analyses. [1997]
* Privacert: a rule-based system with related language ("PrivaCert Editing Language") for expressing and enforcing anonymity requirements to render a specific dataset sufficiently de-identified. Results satisfy privacy standards established by the Risk Assessment Server (mentioned above) and not necessarily k-anonymity. This technology is licensed to Privacert, Inc. for use in rendering health data sufficently de-identified in accordance to HIPAA (U.S. medical privacy regulation). [1997]
* PrivaSum (with Samuel Edo-Eket): a real-world protocol that allows parties to jointly compute an aggregate statistic over a network such that the result is known by all but the contribution by each party remains confidential. Performance is improved over the traditional secret sharing approach, which is deterministic, by providing probabilistic assurances even in the face of widespread collusion by some malicious parties. [2004]

Other Semantic Learning Technologies Invented

* Collaboration Wheel: an on-line tool for supporting group joint work on a common document in a networked environment. Java prototype built by Charles Shelton in 1997. [1997]
* Active Tutor (aka Power Learning): an on-line teaching-learning environment, where the computer plays the role of an expert teacher. Instructions and practice are seamlessly integrated and adapted to the personal needs of each student. On-line demonstration available. [1983-2002]
* Virtual lectures: By taking the expected path of an "average student" through an Active Tutor (described above), the teaching-learning material can be condensed into carefully crafted on-line lectures. Teaching-learning materials for 3 full semesters of the Java Programming language are available on-line (Java1, Java2, Java3) and have been used to teach hundreds of students in courses at Carnegie Mellon (e.g., 15-100) and Harvard University, officially, and many other schools unofficially. Numerous non-traditional students have reported using the on-line materials for self-directed learning. [1998-2002]
* GridCity: a visual programming environment in which students write Java programs to control vehicles in a GridWorld. This is an adaptation of Karel the Robot by Richard Pattis, but this adaptation exploits the fundamental programming constructs made possible by Java that were not available to Karel in Pascal. On-line demonstrations and software download available. Used by hundreds of students at Carnegie Mellon and Harvard University. [1999-2002]
* Bebe: an algorithm and software program for learning the basics sounds of a human language based on automated detection of phonemes in the analog waveform. No high-level knowledge of the language is used, so performs as well learning the Korean sound system as the American English sound system. Early implementations done jointly with Patrick Thompson. [1997]
* Iris Expert System shell: a method and software program for developing on-line expert systems for diagnostic tasks. A sampler was built for interviewing people about personal behaviors that may place them at risk to AIDS; the program received regional print and radio news coverage in New England. [1989]
* CompuFix: a software program that steps a person through the repair of a personal computer, assuming no prior knowledge of computer repairs. Uses the Iris Expert System shell (mentioned above). Was licensed to American Information Technologies, which sold copies commercially. [1990]

