Similarity measure

1/4/2023 0 Comments

Similarity measure

We want the number of clusters to be the same as the number of categories in order to evaluate the results: a cluster should correspond to a category.Īs we have the similarity between pairs of documents (except for WMD on the 20 Newsgroups corpus), we are using the spectral clustering algorithm. The documents belonging to the same cluster should be more similar than documents belonging to different clusters. The clustering task consists in grouping documents into clusters. In order to confirm (or not) what we have observed in this section, a global evaluation using clustering is performed in the following part. On the snippets dataset, the methods based on word2vec perform well because, even if they don’t always find results from the same category than the query, the results are still related to the query. We can see from these results on a single query that LSI seems to be better than the methods using word2vec on the 20 Newsgroups dataset because more retrieved documents belong to the same category as the query. Number of documents (out of 20 results) in a different category than the query (categories are given in parentheses)Ħ = 2 (politics: guns and mideast) + 4 (motorcycles)Ĥ (religion, electronics, operating systems, sport)Ħ = 3 (engineering) + 2 (computers) + 1 (health) china manufactur product cost design dfma oversea paper redesign save true truecost manufactur product autodesk catalog competit custom design global industri mass mfg outsourc partner partnerproduct reshap servic

$similarity measure$

directori product supplier buyer com commod provid servic sourc trader wand directori export manufactur product supplier allproduct buyer com databas global import marketplac volum wholesal china direcori directori directory- export manufactur product supplier taiwan A lemmatization step has been done, and duplicates are removed to make the table readable. The index of the snippet is given in brackets. The first line is the query document and the terms in bold are those that are in the query. Here is an illustration of the first three results returned for the snippets. The corpus is then a list of bag of words. Each document is represented as a bag of words. We first retrieve the ng20 dataset with scikit-learn library and preprocess it with nltk library to remove meaningless words (called stopwords). The two datasets differ a lot and this allows us to see in which conditions each model is better. The web snippets dataset contains 10060 documents of small size (less than 40 words) but the dataset is unbalanced in term of cluster size. the text that represents each result of a web search.

Web snippets: each document is a web snippet, i.e.
The 20 Newsgroups dataset contains 11314 documents in 20 clusters and is balanced in term of cluster size, but unbalanced in term of document size.
20 Newsgroups dataset: each document is a mail in a newsgroup.
In this part we will collect two datasets and preprocess them. As documents are composed of words, the similarity between words can be used to create a similarity measure between documents. A similarity measure between real valued vectors (like cosine or euclidean distance) can thus be used to measure how words are semantically related.

Explaining in more details how it is done is out of scope of this article, but the key points are that a precomputed model gives us real valued vectors associated to words and that those vectors contain a lot of information because the model is trained on a big dataset. The meaning of a word is learned from its surrounding words in the sentences and encoded in a vector of real values. Word2vec is a shallow neural network trained on a large text corpus. Word2vec, introduced in Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., NIPS 2013), has attracted a lot of attention in recent years due to its efficiency to produce relevant word embeddings (i.e. Computing semantic relationships between textual data enables to recommend articles or products related to a given query, to follow trends, to explore a specific subject in more details, etc.But texts can be very different miscellaneous: a Wikipedia article is long and well written, tweets are short and often not grammatically correct.

0 Comments

YOUR CART

Similarity measure

Leave a Reply.

Author

Archives

Categories