Similiaridade de texto
Ozlem Uzuner, Randall Davis & Boris Katz Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts 02139 http://www.ai.mit.edu
@ MIT
The Problem: There are a variety of circumstances under which it would be useful to be able to determine that two documents contain similar text, including detecting plagiarism and copyright infringement, and filtering and organizing documents returned as matches to a query by a search engine. The vast amount of digital information available on the Web makes it necessary to deal with all of these issues. The ease of copying facilitates both plagiarism and copyright infringement, while the volume of information available increases the difficulty of finding the right information quickly. Motivation: Automatic text similarity detectors can help identify plagiarism and copyright infringement and help reduce the abuse and misuse of electronic content. In addition, they can make information discovery more intuitive and less time consuming. Related Work in Text Similarity Recognition: Existing text similarity detection systems recognize verbatim similarities between documents but do not pay attention to similarity in expression. SCAM [4, 5], developed in the Stanford Digital Library looks for verbatim copies of text documents by fingerprinting documents and checking these fingerprints against a repository of previously known fingerprints. SCAM looks for overlaps between verbatim text strings to identify partial similarity. We want to detect non-verbatim similarity by measuring similarity of expression. We are particularly interested in identifying documents that are paraphrases of each other and that express the same content in the same way. Related Work in Rhetorical Structure Theory: The main idea of Rhetorical structure theory (RST) [2] is to model the discourse structure of a text with a hierarchical tree diagram that uses rhetorical relations such as sequence,