Website: Use Generative AI Effectively: How does AI work?

How does AI work?

When using generative AI (GenAI), you will use it more effectively if you have a basic understanding of what it does, what it does not do and how it works. Watch this short video for a non-computer science, simplified explanation of how generative AI works and why it sometimes makes mistakes:

More details...

If you'd like a thorough, but still non-computer science explanation of how AI works, we highly recommend: Large language models, explained with a minimum of math and jargon by Timothy Lee.

What does generative AI do?

Basically, generative AI produces text, sounds, images, etc. by identifying patterns in the existing data it was trained on. What do we mean by "patterns?" A very simple example would be something like this: if you were asked to complete the sentence, "I would like a cup of _______" you might complete that sentence in a number of different ways depending on the context. If you were in a restaurant, you might say "coffee" or "tea." If you were reading a recipe book, you might expect "flour" or "sugar." The ability to predict what word will come next in a given context is a sort of pattern AI might learn.

Let's break that down further.

How is generative AI "trained?"

The creators of a generative AI model have to "train" the model to generate output. They do this by providing it with as many samples of input as possible. Input might consist of any sort of data, such as text, images, audio, or video. The model then uses its algorithm to identify relationships and patterns within the data. This is often referred to as learning. Generative AI models may be trained with unsupervised and supervised learning.

Unsupervised learning: Involves training the model on unlabeled data, allowing it to discover patterns and structures within the data without explicit guidance. Generative AI often starts with unsupervised learning to understand the underlying structure of the data.

Supervised learning: Involves providing the AI with labeled data, where the output is known, allowing the model to learn to map inputs to specific outputs.

Where does the training data come from?

Companies creating generative AI models may train their models on company-specific data, but most commonly available generative AI models, like ChatGPT, are trained on "public data," or, in other words, data freely available on the Internet. To do this, they use web crawlers to find sources and web scrapers to download those sources for use in training. Web crawlers and scrapers access data from any site that is not behind a login page, including blogs, social media pages that aren't set to private, personal web pages, government and company sites. This includes, for example anything on Flickr, Instagram, X, Facebook, Reddit, Wikipedia, research repositories (like arXiv), public scholarly sources (like PubMed Central), news outlet webpages, voter-registration and property databases from government sites and academic institution websites. There are numerous ethical considerations related to the data generative AI is trained on. These are discussed on the AI ethics tab.

What does the generative AI model do with its training input?

To "learn" patterns and relationships, AI breaks up the input it receives into tokens, which it then converts into vectors and embeddings. For text input, tokens might be:

words (ex. farm)
parts of words (ex. farm-, -er for farmer)
characters (ex, f, a, or !, ?)
special computer notations (ex. <|endoftext|>).

Once it has tokenized the input, each token is converted to a numerical representation, or vector, that the computer can understand (since computers natively work with numbers, not words). For example, the word farm might be converted to the vector 101.

Embeddings then define the relationships between vectors. For example, the words farm and ranch are related. Farms are for crops and ranches are for animals, but all these terms are related to the production of food. Problematically for English speakers, ranch may also refer to a salad dressing. That is not a problem for computers. They simply give ranch, the place to raise food animals, and ranch, the salad dressing, very different numbers, so the two meanings are not confused.

Generative AI may create an embedding that looks like this (as represented for human viewers):

(Note: Real vectors are much longer and more complicated. These are an extreme over-simplification to make human understanding easier).

In the representation above, the ranch (100) and its associated animals - cows (100.1), pigs (101.2) and chickens (101.3) all have very similar numerical (or vector) representations. A little less closely related, but still very similar numerically, are the farm (101) and its crops. But the different types of salad dressings, including the ranch dressing (1000), have very different numerical representations (the numbers 100 and 1000 are much further apart, when counting, than 100 and 101). Using the similarities and differences between these assigned numbers, a generative AI model using contextual embedding knows that ranches are closely related to animal production and also that ranch is a type of salad dressing.

How does generative AI use vectors and embeddings to respond to prompts?

Embeddings are stored in complex tables within a generative AI model. As the model receives more and more training input, the table is adjusted to reflect the patterns and relationships the model has "learned" between the words (represented by vectors/numbers). Using these tables, the generative AI model can do many things, including:

Generate text by predicting sentence patterns

ex. Generative AI is basically a super-powered autocomplete machine. Look at this sentence: Cows are raised on _______. AI will complete it by predicting "ranches," due to the similarity between cows (101.1) and ranch (100) and the patterns it learned during its training.

Conduct semantic searches instead of index searches

ex. With index searches, articles are assigned keywords. So, an article about the largest cattle ranch might be assigned the keywords "cattle" and "ranch." To find that article using an index search, you must use exactly those keywords. If you search using the question: What is the largest cow farm? your search will fail if the index does not have a connection between cattle/ranch and cow/farm. A semantic search uses vectors to understand that cows/cattle are related and farms/ranches are related, so it makes the connection and returns the result you need.

Why does generative AI hallucinate?

When generative AI makes a mistake, that is called a hallucination. AI may hallucinate for many reasons, including:

incomplete training data
biased training data
inadequate model complexity (i.e. the model is not sufficiently sophisticated)
flawed embeddings (i.e. lack of contextual understanding of a term or incorrect assumptions based on patterns learned)
the fact that AI is not designed to search for facts; it is designed to and will produce a response based on the connections between words that it has learned

For example, imagine the image above, related to ranches and farms, represents all the training data a model received. In that case, if it were asked: "Are ducks raised on farms or ranches?" it would respond: "Ducks are not raised on farms or ranches," because it never learned about ducks. That would be an example of incomplete training data.

In another example, if an employer uses an AI-based recruiting tool trained on historical employee data in a predominantly male industry, it would likely penalize resumes with female-related keywords or language. (This actually happened at Amazon).

In a final example, if an attorney asks generative AI to produce a case brief citing case law on a particular topic, the AI will write the requested content, including, potentially, hallucinated sources and quotations. (This also actually happened in Mata v. Avianca).

What are ways designers of generative AI prevent hallucinations?

Some chatbots hallucinate less because they employ techniques like Retrieval Augmented Generation (RAG), which allows them to search beyond their training data and pull information from defined, verified sources to answer user queries. This enhances their ability to identify gaps in knowledge and admit uncertainty when they don't have data (rather than confidently making up an answer).

So what does all this mean for us?

Remember the following when using AI:

Even when it is using RAG, it is designed to generate a response, not to find information / sources
Even when it is using RAG, the response is greatly influenced by its training data, which may be incomplete or biased

So when using AI always:

Try a variety of chatbots to identify one that is best for your needs.
Follow any links it provides to verify they are real and that AI hasn't misinterpreted the source.
Verify any statements AI provides by finding additional, reliable sources on the same topic using traditional search methods.

Do be aware that when AI does search for information (such as when using a semantic search tool):

If the AI is a public tool (like Perplexity), it can only access public data, so it can't retrieve articles behind a firewall.
If its training model is not adequately sophisticated, your results may be poor.
For both these reasons, you still must search in traditional library databases to find most peer-reviewed articles.

Sources:

How does Generative AI work. Microsoft AI. (n.d.). https://www.microsoft.com/en-us/ai/ai-101/how-does-generative-ai-work

Leffer, L. (2025, February 19). Your personal information is probably being used to train generative AI models. Scientific American. https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/

Kashyap, P. (2025, January 18). Tokens and Embeddings. Medium. https://medium.com/@piyushkashyap045/tokens-and-embeddings-5d65c7543dea

Bergmann, D., & Stryker, C. (2025, April 17). Vector embedding. IBM. https://www.ibm.com/think/topics/vector-embedding

DiFabrizio, E. (2024, July 3). Demystifying Vectors and Embeddings in AI: A Beginner’s Guide. Sidecar. https://sidecar.ai/blog/demystifying-vectors-and-embeddings-in-ai-a-beginners-guide

Alkhaldi, N. (2024, August 27). What is AI bias really, and how can you combat it? Itrex. https://www.telusdigital.com/insights/ai-data/article/generative-ai-hallucinations

Jonker, A., & Rogers, J. (2025, April 17). Algorithmic bias. IBM. https://www.ibm.com/think/topics/algorithmic-bias

Try & Reflect

As you develop your knowledge of how AI works, think critically! Here is a worksheet with some prompts and reflection questions you might use as you begin to explore how generative AI works:

AI Try and Reflect - the basics