1/4/2023 0 Comments Similarity measure![]() We want the number of clusters to be the same as the number of categories in order to evaluate the results: a cluster should correspond to a category.Īs we have the similarity between pairs of documents (except for WMD on the 20 Newsgroups corpus), we are using the spectral clustering algorithm. The documents belonging to the same cluster should be more similar than documents belonging to different clusters. The clustering task consists in grouping documents into clusters. In order to confirm (or not) what we have observed in this section, a global evaluation using clustering is performed in the following part. On the snippets dataset, the methods based on word2vec perform well because, even if they don’t always find results from the same category than the query, the results are still related to the query. We can see from these results on a single query that LSI seems to be better than the methods using word2vec on the 20 Newsgroups dataset because more retrieved documents belong to the same category as the query. Number of documents (out of 20 results) in a different category than the query (categories are given in parentheses)Ħ = 2 (politics: guns and mideast) + 4 (motorcycles)Ĥ (religion, electronics, operating systems, sport)Ħ = 3 (engineering) + 2 (computers) + 1 (health) china manufactur product cost design dfma oversea paper redesign save true truecost manufactur product autodesk catalog competit custom design global industri mass mfg outsourc partner partnerproduct reshap servic ![]() ![]() directori product supplier buyer com commod provid servic sourc trader wand directori export manufactur product supplier allproduct buyer com databas global import marketplac volum wholesal china direcori directori directory- export manufactur product supplier taiwan A lemmatization step has been done, and duplicates are removed to make the table readable. The index of the snippet is given in brackets. The first line is the query document and the terms in bold are those that are in the query. Here is an illustration of the first three results returned for the snippets. The corpus is then a list of bag of words. Each document is represented as a bag of words. We first retrieve the ng20 dataset with scikit-learn library and preprocess it with nltk library to remove meaningless words (called stopwords). The two datasets differ a lot and this allows us to see in which conditions each model is better. The web snippets dataset contains 10060 documents of small size (less than 40 words) but the dataset is unbalanced in term of cluster size. the text that represents each result of a web search.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |