World Library  
Flag as Inappropriate
Email this Article

W-shingling

Article Id: WHEBN0005548352
Reproduction Date:

Title: W-shingling  
Author: World Heritage Encyclopedia
Language: English
Subject: Vector space model, Rabin fingerprint, Bag-of-words model, Rolling hash, Data synchronization
Collection: Natural Language Processing
Publisher: World Heritage Encyclopedia
Publication
Date:
 

W-shingling

In natural language processing a w-shingling is a set of unique "shingles" (n-grams, contiguous subsequences of tokens in a document) that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

The document, "a rose is a rose is a rose" can be tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } = { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }

Contents

  • Resemblance 1
  • See also 2
  • References 3
  • External links 4

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

r(A,B)=

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

See also

References

  • (Manber 1993) Finding Similar Files in a Large File System. Does not yet use the term "shingling". Available as PDF
  • (Broder, Glassman, Manasse, and Zweig 1997) Syntactic Clustering of the Web. SRC Technical Note #1997-015. Available at HTML here

External links

  • An implementation of the shingling algorithm in C++
  • An implementation of the shingling algorithm in Erlang
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.