 |
         |
  |
















|
Business FAQs
What is a near-duplicate?
How common are near-duplicates?
How does Equivio's near-duping fit into the discovery and review business process?
As a law firm, or as a corporate legal department, what are the benefits of finding near-duplicates?
Does the product have any applicability in OCR-based discovery situations?
Technology FAQs
What is the difference between Equivio and standard de-duping technology?
How can Equivio be integrated in existing system environments and third-party products?
Is Equivio another "redline" or "compare" utility?
Is Equivio another litigation support package?
Is Equivio another conceptual search engine or classification utility?
What file formats does Equivio support?
How does Equivio support email?
What languages does Equivio support?
How is file resemblance measured?
Business FAQs
What is a near-duplicate?
Duplicates are exact copies of a file. Near-duplicates are files with differences. Examples of near-duplicate files include:
- Files with a few different words – this is the most common form of near-duplicates, and the most pressing business need
- Files with the same content but different formatting – for example, the documents might have the same text, but use different fonts, bold type or italics
- Files with the same content but different file type – for example, Microsoft Word and PDF copies of a file
From the technological viewpoint, the first scenario – small differences in content – is the most challenging. The ability to handle these cases is the unique innovation of the Equivio's technology. Near-duplicates are especially common in email, business templates, such as proposals, customer letters, and contracts, and forms, such as purchase or travel requests.
How common are near-duplicates?
In live cases, Equivio is consistently finding 30-50% near-duplicates. This is in addition to exact duplicates.
Near-duplicate percentages tend to vary widely between paper-sourced and electronic file discovery situations. Results from recent cases conducted with Equivio software, include:
- Electronic discovery case A: 58% near-duplicates
- Electronic discovery case B: 35% near-duplicates
- Electronic discovery case C: 40% near-duplicates
- Electronic discovery case D: 67% near-duplicates
- Large government agency: 53% near-duplicates
- OCR case: 16% near-duplicates
How does Equivio's near-duping fit into the discovery and review business process?
Equivio detects and groups near-duplicate files. This reduces the time and effort required to review a collection of documents:
- Stage 1: We start out with an unstructured collection of documents that we encounter at the outset of a discovery process.
Typically, 30-50% of the documents are near-duplicates.
- Stage 2: Equivio identifies the near-duplicates and arranges them into sets.
- Stage 3: The lawyer (or paralegal) is presented with a set of near-duplicates and can deal with them together, in a coherent systematic manner.
- Stage 4: Equivio identifies the pivot document, which is the most representative document of the near-duplicate set. The lawyer can choose to review just the pivot document. In many cases, after reading the pivot document, the lawyer will decide that the rest of the documents in the near-duplicate set can be skipped.
- Stage 5: If, however, the set is interesting, the lawyer can zoom in to review the remaining documents in the set. Using a compare utility, such as DeltaView, the lawyer can simply review the differences of each document vis-à-vis the pivot document. This is a lot faster than reading each document from beginning to end. It’s also a lot more effective because there is no chance of critical differences being missed.
- Stage 6: Equivio ensures that near-duplicates can be treated consistently – for example, when coding documents as privileged, responsive and so on.
As a law firm, or as a corporate legal department, what are the benefits of finding near-duplicates?
Equivio generates immediate, concrete benefits:
- Less cost
- Enables a more efficient, systematic review process – case studies show that near-duping can reduce a firm's review costs by 40%
- For law firms, this facilitates a more competitive, reduced cost offering to your customers
- Less time
- Prioritised review to cover more data in time window
- Focus on high value-add activities versus low value-add
- Less risk
- Coherent assignment enhances review quality
- Focus on differences reduces errors and oversights
- Consistent treatment of very similar documents
Does the product have any applicability in OCR-based discovery situations?
In addition to standard electronic discovery processes, Equivio is also invaluable in OCR situations.
Due to OCR errors, exact paper duplicates will no longer be duplicates once they have been scanned and OCR’d. OCR errors are typically in the range of 1-3%. As a result, standard de-duping, using CRC or the MD5 hash algorithm, does not work in OCR situations. In fact, customers report that in many OCR situations they obtain zero exact duplicates.
To overcome the OCR problem, firms are using Equivio's near-duping as a substitute for de-duping in paper-based discovery scenarios. Equivio groups the near-duplicates, and then it is a very simple matter of using a "show differences" utility, such as DeltaView, to identify which differences are OCR errors and which are meaningful content differences. In this way, using Equivio, the reviewer is able to identify the duplicates from a paper collection.
Technology FAQs
What is the difference between Equivio and standard de-duping technology?
Existing de-duplication software is able to detect exact duplicates. This technology uses the MD5 hash algorithm or CRC, and has been available for many years. It is a very efficient way of finding exact duplicates.
Equivio, which also includes exact duplicate detection, solves the far more complex problem of detecting near-duplicates.
How can Equivio be integrated in existing system environments and third-party products?
Equivio can be integrated via:
- Equivio extract utility -- For each file in a collection of documents, Equivio generates near-duping meta-data, such as a list of the documents which are near-duplicates of any given document. This data is stored in a standard ODBC-compliant relational database. Equivio provides an extract utility which builds a file or database of this data structured for straightforward loading into the standard litigation support environments. Equivio is fully integrated with the leading litigation review tools such as Summation, Ringtail and Concordance.
- SDK – The Equivio>SDK allows Equivio's near-duplicate processing capability to be embedded in a solution or product.
Using these tools, we have three key options for integration in litigation support environments:
- Option 1: export the files from the litigation support system, run Equivio and use the extract utility to create a load file which is then imported back into the litigation support system
- Option 2: as in the option 1, export the files and then run Equivio. However, rather than using the extract utility, use Equivio>SDK for the load
- Option 3: use Equivio>SDK for both inserting the files for processing by Equivio and for the load
Is Equivio another "redline" or "compare" utility?
No. Equivio discovers documents that are near-duplicates. Standard redline utilities, such as DeltaView, identify the differences between the documents.
Once you know that document A resembles document B, compare software can be used to identify the differences between them. However, in legal discovery situations, the problem is you usually don't know that document A is similar to document B. This is the problem addressed by Equivio. Once Equivio has discovered that A and B are near-duplicates, you can then go ahead and use compare software to identify the specific differences between the two documents.
Is Equivio another litigation support package?
No. Equivio is near-duplicate detection utility. It is designed to be used as a component, integrated within a discovery processing solution and accessible via the standard litigation support packages.
Is Equivio another conceptual search engine or classification utility?
No. We need to distinguish between two models of "like" documents:
- Related documents model – conceptual search software can group documents which are similar in the sense that they relate to the same subject.
- Similar documents model - Equivio groups similar documents with small differences in content or formatting – e.g. documents that differ by a few words such as form letters, document versions, and boiler plates
To illustrate this, we will consider three documents on Hurricane Katrina – to illustrate the point, we are using very short documents each containing just a few words of text:
- Doc1: President Bush visits New Orleans
- Doc2: George Bush visits New Orleans
- Doc3: Hurricane Katrina makes landfall
Docs 1, 2 and 3 are all related, in that they relate to Hurricane Katrina. As such, they would all be retrieved by a conceptual search engine with the search parameter “Hurricane Katrina”. Similarly, a classification tool would group these three documents together.
However, only Docs 1 and 2 qualify as near-duplicates. They are similar in the sense that they differ by only one word. Within the search results or the classification category, Equivio adds an additional layer of structure by grouping Docs 1 and 2 together.
Search and near-duping represent complementary technologies. In fact, all Equivio's customers are using Equivio in conjunction with their conceptual search tools to provide additional structure in their search results. This is exceptionally valuable for reviewers -- they can skip the near-duplicates (using the above example, read Doc1 and skip Doc2), or review differences (review Doc2 by invoking a document compare utility, rather than reading the entire document).
What file formats does Equivio support?
Equivio supports de-duplication on a broad range of file types, including, for example:
- Microsoft Word
- Microsoft PowerPoint
- Microsoft Excel
- PST
- EML
- MSG
- ZIP
- PDF
- HTML
- TXT
- OCR-generated files
For compound files, such as emails, Equivio deconstructs the file into its granular elements, such as the email body and attachments, enabling detection of near-duplicates for each component. Similarly, for ZIP files containing multiple files, Equivio breaks out the individual component files for separate analysis.
How does Equivio support email?
Equivio analyses multiple dimensions of near-duplication in emails. Each email message is deconstructed into its component parts – email body and attachment. This enables the identification of three modes of near-duplicates in emails:
- Near-duplicate Email Body: This means that the body of the emails is similar
- Near-duplicate Email Attachment: This means that the file attached to the email is similar to other attachments or other files in the input data
- Near-duplicate Email Sum: This means that each component of the email is similar to each of the components of another email – that is, the body is similar, as is each one of the attachments.
What languages does Equivio support?
Equivio can be used to detect near-duplicate documents in any language.
How is file resemblance measured?
Equivio calculates the level of resemblance between two documents. The calculation uses the grams method, which is based on individual words, as in the 1-grams method, or groups of words in the multi-grams methods. For example, the 2-grams method uses groups (or shingles) of two successive words – consider the sentence:
(Jack lives in town and works)
This sentence comprises the following 2-grams:
(“jack lives", "lives in", "in town", "town and", "and works")
For a given shingle size, the resemblance r of two documents A and B is defined as:

The resemblance is a number between 0 and 1. It is always true that r(A, A) = 1 (that is, a document resembles itself 100%).
Consider the following example:
- Document A: (jack lives here and works)
- Document B: (jack lives here and works hard)
Similarity (2-grams) =

= (4/4 + 4/5)/2
= 90% |
 |
 |
 |
|
 |
 |
 |
|