119 | );
120 | }
121 |
122 | export default App;
123 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # semantic-autocomplete
2 |
3 | semantic-autocomplete is a React component that extends [v6 MUI's autocomplete](https://v6.mui.com/material-ui/react-autocomplete/) and performs **semantic similarity search** using a small, quantized machine learning (ML) model which runs on client side. The model is downloaded once and then taken from browser's cache. The full functionality is provided within this React component!
4 |
5 | ## Demo
6 |
7 | **Sort paragraphs of a webpage by meaning:**
8 |
9 | https://mihaiii.github.io/semantic-autocomplete/
10 |
11 | 
12 |
13 | ## v5 MUI support
14 | This component works with both v5 and v6 MUI. It was not tested by the author on lower MUI versions.
15 |
16 | ## How to install
17 | Install:
18 |
19 | `npm install --save semantic-autocomplete`
20 |
21 | Then import:
22 |
23 | `import SemanticAutocomplete from "semantic-autocomplete";`
24 |
25 | ## Run on local from source code
26 |
27 | ```
28 | npm install
29 | npm run dev
30 | ```
31 |
32 | ## Usage
33 |
34 | Since semantic-autocomplete extends [MUI's autocomplete](https://v6.mui.com/material-ui/react-autocomplete/), the entire [v6 MUI's autocomplete API](https://v6.mui.com/material-ui/api/autocomplete/) will also work on semantic-autocomplete. The only exception is the [filterOptions property](https://mui.com/material-ui/react-autocomplete/#custom-filter).
35 |
36 | **If you're already using `autocomplete` in your project, just replace the tag name and you're done.** 🙌
37 |
38 | You can see the component being used in code [here](https://github.com/Mihaiii/semantic-autocomplete/blob/6d312a6264b7c3b79d053e23d3cdb4cf226196a1/demos/paragraphs_as_options/App.jsx#L26-L34) and [here](https://github.com/Mihaiii/semantic-autocomplete/blob/6d312a6264b7c3b79d053e23d3cdb4cf226196a1/demos/simple_autocomplete/App.jsx#L107-L112).
39 |
40 |
41 | [See this page for how you can use MUI's autocomplete and therefore semantic-autocomplete too](https://v6.mui.com/material-ui/react-autocomplete/).
42 |
43 | Besides the MUI's autocomplete API, the following props are provided:
44 |
45 | - `threshold`: if it has a value, then the component will filter out options below this cosine similarity value. Defaults to no value (meaning no filtering, only sorting). [Click for code example](https://github.com/Mihaiii/semantic-autocomplete/blob/6d312a6264b7c3b79d053e23d3cdb4cf226196a1/demos/simple_autocomplete/App.jsx#L110).
46 |
47 | - `onResult`: callback function once the sorting/filtering of the options is done, using the resulted options array as first param. [Click for code example](https://github.com/Mihaiii/semantic-autocomplete/blob/6d312a6264b7c3b79d053e23d3cdb4cf226196a1/demos/paragraphs_as_options/App.jsx#L29).
48 |
49 | - `model`: the name of the Huggingface ML model repo. It has to have the ONNX embeddings model. The folder structure of the repo has to be the standard one used by transformers.js. If you're interested in changing the default used model, you might find [this filter](https://huggingface.co/models?pipeline_tag=sentence-similarity&library=onnx&sort=trending) useful. [I made a bunch of small models for this component. Try them out and see what works best for your use case](https://huggingface.co/collections/Mihaiii/pokemons-662ce912d64b8a3bee518b7f). Default value: `Mihaiii/Venusaur` (pointing to [this repo](https://huggingface.co/Mihaiii/Venusaur)), which loads the ONNX quantized model having **~15 MB**. [Click here for code example](https://github.com/Mihaiii/semantic-autocomplete/blob/b16115492466eb1502107cf4581a804cb1dcbbe4/demos/simple_autocomplete/App.jsx#L115)
50 |
51 | - `pipelineParams`: the params to be passed to [transformer.js](https://github.com/xenova/transformers.js) when loading the model. Default value: `{ pooling: "mean", normalize: true }`. For more info, please [see this page](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline).
52 |
53 | ## Thanks / credit
54 | - [xonova](https://x.com/xenovacom?t=Mw1h_1joKgfrUXR_wl9Wrg&s=09) for building [transformers.js](https://github.com/xenova/transformers.js), providing clear & in detail documentation, always being willing to help out and for having [lots of demos](https://github.com/xenova/transformers.js/tree/main/examples) on [his HF account](https://huggingface.co/Xenova). The work for this component is based on his tutorial on [how to build a React component using tranaformers.js](https://huggingface.co/docs/transformers.js/en/tutorials/react).
55 | - [andersonbcdefg](https://x.com/andersonbcdefg?t=0Nkr_SRk-fMUrU_Kp0Wm5w&s=09) for building many small models like [gte-tiny](https://huggingface.co/TaylorAI/gte-tiny) or [bge-micro-v2](https://huggingface.co/TaylorAI/bge-micro-v2) and for providing some guidelines to me prior to making [Venusaur](https://huggingface.co/Mihaiii/Venusaur).
56 |
--------------------------------------------------------------------------------
/src/SemanticAutocomplete.jsx:
--------------------------------------------------------------------------------
1 | import React, { useState, useRef, useEffect } from "react";
2 | import { Autocomplete, TextField, CircularProgress } from "@mui/material";
3 | import { cos_sim } from "@xenova/transformers";
4 | import EmbeddingsWorker from "./worker?worker&inline";
5 |
6 | const SemanticAutocomplete = React.forwardRef((props, ref) => {
7 | const {
8 | loading: userLoading,
9 | onInputChange: userOnInputChange,
10 | onOpen: userOnOpen,
11 | onClose: userOnClose,
12 | } = props;
13 | const { onResult, threshold, pipelineParams, model, ...restOfProps } = props;
14 | const [options, setOptions] = useState([]);
15 | const [isOpen, setIsOpen] = useState(false);
16 | const [isLoading, setLoading] = useState(true);
17 | const [parentSize, setParentSize] = useState(0);
18 | const worker = useRef(null);
19 | const optionsWithEmbeddings = useRef([]);
20 | const userInput = useRef("");
21 | const loading = userLoading ? true : isOpen && isLoading;
22 | const getOptionLabel = props.getOptionLabel || ((option) => option.label);
23 |
24 | useEffect(() => {
25 | if (!worker.current) {
26 | worker.current = new EmbeddingsWorker();
27 |
28 | worker.current.postMessage({
29 | type: "init",
30 | pipelineParams: pipelineParams,
31 | model: model,
32 | });
33 | }
34 |
35 | const onMessageReceived = (e) => {
36 | switch (e.data.status) {
37 | case "completeOptions":
38 | optionsWithEmbeddings.current = e.data.optionsWithEmbeddings;
39 | setOptions(props.options);
40 | setLoading(false);
41 | //if user writes text before the embeddings are computed
42 | if (userInput.current) {
43 | worker.current.postMessage({
44 | type: "computeInputText",
45 | text: userInput.current,
46 | });
47 | }
48 | break;
49 |
50 | case "completeInputText":
51 | var sortedOptions = optionsWithEmbeddings.current
52 | .map((option) => ({
53 | ...option,
54 | sim: cos_sim(option.embeddings, e.data.inputTextEmbeddings),
55 | }))
56 | .sort((optionA, optionB) => {
57 | const containsA = includesCaseInsensitive(
58 | optionA.labelSemAutoCom,
59 | e.data.inputText
60 | );
61 | const containsB = includesCaseInsensitive(
62 | optionB.labelSemAutoCom,
63 | e.data.inputText
64 | );
65 |
66 | if (containsA == containsB) {
67 | return optionB.sim - optionA.sim;
68 | }
69 | return containsA ? -1 : 1;
70 | });
71 |
72 | if (threshold && e.data.inputText) {
73 | let index = sortedOptions.findIndex(
74 | (op) =>
75 | includesCaseInsensitive(
76 | op.labelSemAutoCom,
77 | e.data.inputText
78 | ) == false && op.sim < threshold
79 | );
80 | sortedOptions = sortedOptions.slice(0, index);
81 | }
82 | setOptions(sortedOptions);
83 | if (onResult) {
84 | onResult(sortedOptions);
85 | }
86 | break;
87 | }
88 | };
89 |
90 | worker.current.addEventListener("message", onMessageReceived);
91 | return () =>
92 | worker.current.removeEventListener("message", onMessageReceived);
93 | });
94 |
95 | useEffect(() => {
96 | setLoading(true);
97 | worker.current.postMessage({
98 | type: "computeOptions",
99 | options: props.options.map((op) => ({
100 | ...op,
101 | labelSemAutoCom: getOptionLabel(op),
102 | })),
103 | });
104 | }, [props.options]);
105 |
106 | const includesCaseInsensitive = (fullText, lookupValue) => {
107 | return fullText.toLowerCase().includes(lookupValue.toLowerCase());
108 | };
109 |
110 | const handleInputChange = (event, value, reason) => {
111 | userInput.current = value;
112 |
113 | worker.current.postMessage({
114 | type: "computeInputText",
115 | text: value,
116 | });
117 |
118 | if (userOnInputChange) {
119 | userOnInputChange(event, value, reason);
120 | }
121 | };
122 |
123 | const handleOnOpen = (event) => {
124 | setIsOpen(true);
125 |
126 | if (userOnOpen) {
127 | userOnOpen(event);
128 | }
129 | };
130 |
131 | const handleOnClose = (event) => {
132 | setIsOpen(false);
133 |
134 | if (userOnOpen) {
135 | userOnClose(event);
136 | }
137 | };
138 |
139 | const renderLoadingInput = (params) => (
140 |
146 |
147 | {params.InputProps.endAdornment}
148 |
149 | ),
150 | }}
151 | ref={(node) => {
152 | if (node && parentSize == 0) {
153 | const inputElement = node.querySelector('input');
154 | if (inputElement) {
155 | //https://stackoverflow.com/a/62721389
156 | const { clientHeight, clientWidth } = inputElement;
157 | setParentSize(Math.min(clientHeight, clientWidth));
158 | }
159 | }
160 | }}
161 | />
162 | );
163 |
164 | return (
165 | x}
169 | onInputChange={handleInputChange}
170 | loading={loading}
171 | onOpen={handleOnOpen}
172 | onClose={handleOnClose}
173 | ref={ref}
174 | {...(loading ? { renderInput: renderLoadingInput } : {})}
175 | />
176 | );
177 | });
178 |
179 | export default SemanticAutocomplete;
180 |
--------------------------------------------------------------------------------
/demos/paragraphs_as_options/data.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "label": "Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. The primary goal is to capture the semantic meaning of words so that words with similar meanings are located close to each other in this space. This is achieved by transforming sparse, high-dimensional word vectors into lower-dimensional spaces while preserving semantic relationships.",
4 | "value": 1
5 | },
6 | {
7 | "label": "Embeddings are used extensively across various NLP tasks. Some common applications include text classification, sentiment analysis, language modeling, and machine translation. They are also integral to more complex tasks like question-answering systems, chatbots, and content recommendation systems. Beyond NLP, embeddings find applications in image and video analysis, where they help in tasks like image classification and facial recognition.",
8 | "value": 2
9 | },
10 | {
11 | "label": "Embeddings are used because they provide a dense and efficient representation of words, capturing complex patterns in language that are not apparent at the surface level. Unlike one-hot encoding, which treats words as isolated units without any notion of similarity, embeddings map words into a vector space based on their usage and context. This allows models to understand synonyms, analogies, and the overall semantics of text, leading to more nuanced and intelligent processing.",
12 | "value": 3
13 | },
14 | {
15 | "label": "Embeddings are typically created using models like Word2Vec, GloVe, or FastText, which learn representations by analyzing word co-occurrences and relationships in large corpora of text. These models apply algorithms to adjust the position of each word in the vector space, such that the distance between vectors captures semantic relationships between words. For example, similar words are placed closer together, whereas unrelated words are positioned farther apart.",
16 | "value": 4
17 | },
18 | {
19 | "label": "While embeddings are powerful, they also present challenges. One major concern is bias, as embeddings can perpetuate and amplify biases present in the training data. This requires careful consideration and mitigation strategies during model development and deployment. Additionally, creating and fine-tuning embeddings for specific domains or languages with limited resources can be challenging, necessitating innovative approaches to leverage embeddings effectively across diverse contexts.",
20 | "value": 5
21 | },
22 | {
23 | "label": "Traditional word embeddings, like Word2Vec and GloVe, generate a single representation for each word, regardless of its context. This means that words with multiple meanings are represented by the same vector across different uses. Contextual embeddings, introduced by models such as BERT and ELMo, represent words as vectors that vary depending on the word's context within a sentence. This allows these models to capture the nuances of language more effectively, distinguishing between different meanings of a word based on its usage.",
24 | "value": 6
25 | },
26 | {
27 | "label": "While primarily designed to capture semantic relationships between words, embeddings can also encode aspects of syntax and grammar to a certain extent. For example, embeddings can reflect syntactic categories like part of speech, and models trained on sentence-level tasks can learn representations that implicitly encode grammatical structures. However, explicit modeling of syntax and grammar often requires architectures designed specifically for these aspects, such as syntactic parsing models.",
28 | "value": 7
29 | },
30 | {
31 | "label": "Embeddings are a cornerstone of transfer learning in NLP. Pre-trained embeddings, generated from large-scale language models on extensive corpora, can be used as the starting point for training on specific tasks. This approach allows models to leverage general linguistic knowledge learned from the broader language use, significantly improving performance on tasks with limited training data. Transfer learning with embeddings accelerates model development and enhances capabilities in domain-specific applications.",
32 | "value": 8
33 | },
34 | {
35 | "label": "Evaluating the quality of embeddings involves assessing how well they capture semantic and syntactic relationships. This is often done through intrinsic methods, like analogy solving (e.g., \"king\" is to \"man\" as \"queen\" is to \"woman\") and similarity assessments, or through extrinsic methods, where embeddings are evaluated based on their performance in downstream tasks like text classification or sentiment analysis. Both approaches provide insights into the effectiveness of embeddings in encoding linguistic properties.",
36 | "value": 9
37 | },
38 | {
39 | "label": "Significant efforts are underway to develop and refine embeddings for a wide range of languages beyond English. This includes both multilingual models, which learn embeddings capable of representing multiple languages in a single vector space, and language-specific models that cater to the unique characteristics of individual languages. Challenges in this area include dealing with low-resource languages and adapting models to capture linguistic features unique to each language.",
40 | "value": 10
41 | },
42 | {
43 | "label": "Future developments in embeddings may focus on several areas, including improving the handling of polysemy and context, reducing biases in embeddings, and enhancing the efficiency and scalability of embedding models for large-scale applications. Additionally, there's a growing interest in cross-modal embeddings, which can represent data from different modalities (e.g., text and images) in a unified vector space, opening up new possibilities for multimodal applications and AI systems.",
44 | "value": 11
45 | },
46 | {
47 | "label": "Graph embeddings aim to represent nodes, edges, and possibly whole subgraphs of a graph in a continuous vector space. These embeddings capture the structure of the graph as well as node-level and edge-level properties. Applications of graph embeddings include social network analysis, where they can predict connections or recommend content; knowledge graph completion, where they can infer missing relations; and in bioinformatics, for example, to predict protein interactions.",
48 | "value": 12
49 | },
50 | {
51 | "label": "Embeddings can be adapted for time-series data by creating representations that capture temporal dynamics in addition to the underlying patterns. This involves training embeddings not just on the static features of data points but also on their changes over time, enabling models to understand periodic trends, anomalies, and long-term shifts in data. Applications include financial market analysis, weather forecasting, and predictive maintenance, where understanding the temporal dimension is crucial.",
52 | "value": 13
53 | },
54 | {
55 | "label": "Scaling embedding models presents several challenges, including computational demands, memory requirements, and maintaining the quality of embeddings as the size of the data and the model increases. Solutions to these challenges include more efficient model architectures, quantization techniques to reduce the size of embeddings, and distributed computing strategies. Addressing these issues is key to enabling the application of embeddings to ever-larger datasets and more complex problems.",
56 | "value": 14
57 | }
58 | ]
--------------------------------------------------------------------------------