Where ChatGPT Excels and Where it Fails
ChatGPT is a brilliant piece of technology. To the contrary of what many think, ChatGPT is not the result of a new breakthrough innovation. It is the convergence of many existing machine learning models, beautifully assembled in a powerful and simply to use interface.
Like many others, I have been experimenting with ChatGPT over the last couple of weeks trying to understand where it excels and where it fails. The aim of this article is not to criticise where it shows its weaknesses; it is to highlight where it shows its superiority. Most of the articles written on ChatGPT focus on challenging its ‘knowledge’, i.e. the number of questions it can answer correctly. The truth of the matter is that:
ChatGPT knows by far more than any human being. Challenging its knowledge is utterly pointless.
Questions like “what is Barack Obama’s birth date” or “where did he go to college” are trivial to answer. The fact is that very few factual based questions will go unanswered. Even less well-known topics are mastered by ChatGPT well beyond humans’ ability.
That said, it does have limitations too. When asked whether “Barack Obama is right-handed or left-handed?”, its answer is “right-handed”. But a simple Google search shows that Barack Obama is, in fact, left-handed. Another, maybe more serious, weakness is that it can easily be persuaded that its answers are wrong, even when they are not. When asked what the result of “5 + 2” is, it correctly replies 7. But after challenging its response twice by saying that the result should be 8 (and not 7), it replies the following: “I apologise, you are correct. The sum of 5 and 2 is 8. My previous response was incorrect. Thank you for pointing out the error.” But again, we should focus on those aspects that make ChatGPT unique and incredibly powerful, and not on its faults.
The real interesting test is to check ChatGPT’s reasoning capabilities. The results are by far more surprising.
Before ChatGPT’s fame, all artificial intelligence systems excelled at solving problems that were well defined and for which known solutions existed. Think about Garry Kasparov’s chess game against Deep Blue, Ke Jie’s go game against AlphaGo, or Sebastian Thrun’s victory of the 2005 DARPA Grand Challenge with the autonomous car Stanley. These are three different technologies that went beyond human capabilities, but they remain technologies that are still far away from the breadth of reasoning that humans are able to do.
ChatGPT’s potential is that it has widened the spectrum of reasoning and thinking that a machine can perform.
Humans are extraordinary in adapting their reasoning process depending to the problem they are facing and depending on the available solutions. OpenAI's ChatGPT shows similar capabilities.
I have tested ChatGPT on two different fronts. First, I determined its effectiveness across a variety of reasoning processes and compared it to human capabilities. This allowed me to explore and determine what we will call ChatGPT’s breadth of reasoning. Second, I evaluated its performance by exploring multiple layers of a given problem. This allowed me to evaluate what we will define as ChatGPT’s depth of reasoning.
Results
We should not be ‘afraid’ of ChatGPT's rise to fame. Fears that one day it - or some similar technology - will replace us and take over the world are exaggerated. Instead, we should look at ways in which ChatGPT can complement our capabilities by empowering us to do more and to do better (listen to Microsoft's Satya Nadella and his vision on this point).
ChatGPT’s performance, when testing for human-like reasoning capabilities, are beyond expectation. It has a strong knowledge base, it performs well with false prompts, it can carry out highly professional tasks (such as writing software, emails, resumés, contracts, etc.), it can understand context, and it can give contextualised answers too. Nevertheless, it lacks some fairly basic knowledge, it suffers from ‘hallucinations’ (i.e. the responses generated are, at times, nonsensical in the context of the data it has been trained on), and it can be easily persuaded that its answers are wrong.
That said, its potential outweighs by far its drawbacks.
Breadth of reasoning
The below 3x3 matrix shows the results of the nine reasoning styles on which I have evaluated ChatGPT’s performance. Problems (the columns) can be of three types. They can be either a) defined, the end goal is known and can be described, b) undefined, the end goal is not known and therefore cannot be articulated, or c) redefined, the end goal is somehow modified from the original problem.
Similarly, solutions (the rows) can be divided in three categories. They can be either a) defined, these are known solutions with which we are guaranteed to solve the problem, b) undefined, these are solutions that are unknown and need to be inferred, or c) redefined, these are solutions that are ‘borrowed’ from another context or industry and adapted to the specific circumstances.
The colours in the matrix show how well ChatGPT performed in the nine different problem-solving exercises. Out of nine problem-solving exercises, it performed very well in 33% of the cases; it performed well in 56% of the cases; and it performed poorly in just 11% of the case.
Despite all its weaknesses, ChatGPT really made a significant step forward compared to previous machine learning models.
More details on the type of questions and responses given to ChatGPT, and the methodology used to come to the above conclusions can be found further below.
Depth of reasoning
To understand another aspect of ChatGPT’s problem-solving capability, we forced it to ‘stay within a chosen problem’ by analysing the problem from different perspectives (divergent thinking), finding related data to back up the findings (convergent thinking), putting the results in a broader context (contextualisation), and ultimately summarising the findings.
The starting point was to define a new business model for a cutlery manufacturer. ChatGPT suggested a subscription-based cutlery service. I then asked a series of divergent and convergent thinking questions:
- Divergent: determine value propositions
- Convergent: dig into one value proposition
- Divergent: find data to back up the value proposition
- Divergent: estimate missing data for the value proposition
- Contextualisation: Contextualise value proposition
- Convergent: summarise all the previous results
Considering the complexity of the tasks, the quality of the responses, and the time spent on completing every task, ChatGPT showed an incredible performance in terms of depth or reasoning capabilities. ChatGPT performed well in 67% of the cases for divergent thinking, whereas it performed very well in 33% of the cases for convergent thinking.
More details on the type of questions, the responses and the methodology followed for the above results, can be found in the below details.
Details and Methodology (the boring part!)
Breadth of reasoning
1. [defined problem] x [defined solution]
Problem-solving exercise #1 is a task that requires deductive reasoning. These are exercises where the problem is well defined and there are known techniques to solve the problem.
Here is an example of a math problem in which ChatGPT correctly used the ‘principle of inclusion and exclusion’. The explanation is correct and the mathematical solution provided is correct too.
Here is a slightly more complex problem involving no mathematical calculation but geometric interpretation and representation. In solving this problem ChatGPT is showing its tendency to ‘think’ by analogies with problems it has previously been trained on. To solve this problem, it is easy to verify that since BD = AC. Since BD = radius, and since the radius is equal to 5, BD should also be equal to 5, but ChatGPT wrongly applies the Pythagorean theorem which is not necessary to find the solution.
2. [defined problem] x [undefined solution]
Problem-solving exercise #2 is a task that uses inductive reasoning. These are exercises where the problem is well defined but the techniques to solve the problem are not known and where the rules to solve the problem need to be induced.
In the following example the goal is to determine the next number of a given sequence by inducing the rules that determine the sequence itself. ChatGPT correctly identified the sequence as being a Fibonacci series even though I omitted the double number “1” at the beginning of the sequence. The explanation is spot on and considers the omitted number “1” as seen below.
I then showed ChatGPT another sequence. In this sequence each element [xi] can be calculated as [xi-1 + i -1]. As an example, for i = 4, the 4th element in the sequence (i.e. 7) can be calculated as 3d element in the sequence (i.e. 4) plus [i-1] (i.e. 4 – 1 = 3). Therefore 4 + 3 = 7.
As you can see from the below answer, ChatGPT has correctly identified that there is some sort of link between current and the previous element in the sequence. It has also correctly calculated the missing element in the sequence as being 16. When asked to explain the logic of the sequence, it wrongly identifies the sequence as being a Fibonacci series (which it is not). In fact 16 is not equal to 7 + 11.
When confronted with its mistake, ChatGPT recognises the mistake (as expected because it can be easily persuaded) but it is not able to recognise the logic behind the sequence.
3. [defined problem] x [redefined solution]
Problem-solving exercise #3 is a task that can be completed by using knowledge-transfer reasoning. These are exercises where the problem is well defined but the techniques to solve the problem are not known and the rules to solve the problem can be found by ‘borrowing’ the solutions from other domains.
ChatGPT’s showed deep weaknesses when carrying out this type of reasoning and therefore its performance was low. In the below example we first described a hypothetical scenario with a problem and a set of constraints. ChatGPT’s answers are plausible but ineffective and complex.
As even humans may have difficulties solving this problem, I gave ChatGPT a little hint. The hint made things worse, and the response became non-sensical.
The solution is, in fact, very simple. To recover the ping pong ball, one could pour water in the tube. The ball would then float on the water. The more water the man pours in the tube, the closer the ball gets to the top rim of the tube until it can be recovered.
4. [redefined problem] x [defined solution]
Problem-solving exercise #4 is a task that requires a combination of inductive and deductive reasoning. These are exercises where the problem is well defined but for which an accurate answer is either too complex or for which we lack all the data. Problems need to be redefined by contextualising them, simplifying them, or making assumptions about possible missing information. The reasoning technique then follows a pure deductive reasoning to find the solution.
The question asked is to “estimate the number of piano tuners in Chicago”. This problem is a so-called ‘Fermi problem’. These are problems that Enrico Fermi, an Italian physicist, used to ask his students. Trying to find the exact answer to the question would require the prior knowledge of many parameters. If these parameters were known, we can mathematically calculate the answer. The aim of this kind reasoning is not to have an exact answer but to have an idea of the order of magnitude (whether the number is in the tens, hundreds, or thousands). ChatGPT’s answer relies on official data provided by the United States Bureau of Labor Statistics.
Fermi used to solve these kinds of problems through a series of approximation. Chicago has approximately three million inhabitants, an average household has three members, one third of every household has one piano, each piano is tuned once every 10 years, each piano tuner can tune a maximum of four pianos a day and can work for 250 days a year. This means that there must be approximately 33 piano tuners in Chicago. Not a bad approximation. It is in fact a better approximation than the one ChatGPT can do when asked not to use the data provided by United States Bureau of Labor Statistics.
5. [redefined problem] x [undefined solution]
Problem-solving exercise #5 is a type of task that requires abductive reasoning. These are exercises where the problem needs to be redefined by contextualising them, simplifying them, or making assumptions about possible missing information. The reasoning technique then follows a pure abductive process by finding the most plausible solution.
This reasoning is most commonly used in situations where we need to make sense of an ambiguous information. Several hypotheses are formulated based on the information available and the most plausible of the theories is then selected as being the most likely solution to the problem.
In the below example, I asked ChatGPT to give the most plausible explanation of an ambiguous situation. ChatGPT answers by formulating medical hypotheses based on the corpus it has been trained on.
When given a little hint (“the man is wearing glasses”), the search for possible solutions becomes much broader. That is, ChatGPT formulates possible explanation linked to the glasses: “dirty, smudged, or foggy glasses” or “damaged or bent glasses”. The answers become more accurate and more plausible.
Ultimately, when I asked what the most likely explanation could be if steam was coming out of the teacup, ChatGPT formulates a plausible scenario, saying tat the glassed could have been “dirty, smudged, or fogged up”.
6. [redefined problem] x [redefined solution]
Problem-solving exercise #6 is a type of task that requires to look at an existing problem from another perspective and then ‘borrow’ a known solution from another industry (for example).
Defining a new business model is an example of situation in which one needs to redefine the problem and look at other businesses to develop an innovative solution. In the below example we asked ChatGPT to develop a new business model for a cutlery manufacturer. As you can see ChatGPT’s answer is surprisingly good. The answer contains many hints to possible solution to innovate the cutlery business. It has, of course, applied the XaaS business model (which was somehow expected) but it has also given some cues on other value propositions such as recycling services and maintenance services.
7. [undefined problem] x [defined solution]
Problem-solving exercise #7 is a task that requires a combination of inductive and abductive reasoning. These are ill-defined problems that do not have a unique solution or where the solution can be ambiguous (even though the solution is well defined).
An example of such a problem is colour and pattern matching in fashion. I asked ChatGPT whether a red tie or a blue tie is a better match for a blue suit. The below answer shows a striking ‘knowledge’ of colour and pattern matching and the ability to understand the need to wear different clothing depending on the specific occasion.
Here is another example showing ChatGPT’s ability to make choices depending on the context:
That said, ChatGPT has shown its limitations when asked what the best tie match for a red suit would be. It should have pointed out that red is an inappropriate colour for a suit. Instead, it gives an answer because the answer is logically consistent with the information it has learned (the colour matching rules).
8. [undefined problem] x [undefined solution]
Problem-solving exercise #8 is a type of task that requires abductive reasoning. In these tasks the problem is ill-defined and the there is no clear and unique solution either. The reasoning technique are very complex and require a high level of abstraction.
I asked ChatGPT to compose a one-sentence poem expressing the fragility of humankind. Below is what it came up with. These are two very emotional and human-like poems. When I googled them, I also realised that they have not been copied from existing authors. ChatGPT has somehow composed them from scratch.
9. [undefined problem] x [redefined solution]
Problem-solving exercise #9 is a type of task that requires to define an otherwise undefined problem and then deduce a series of consequences.
In the below example I asked what a breathable atmosphere on Mars would entail for humans. This is a broad question and the answers ChatGPT gives are of high accuracy and good quality.
Depth of reasoning
1. New business model
To test ChatGPT’s depth of reasoning capabilities, I started by asking to find an alternative business model for a cutlery manufacturer (same question as previously seen). It is interesting to highlight that the answer it gave me (after 24 hours from asking the question the first time), is slightly different. It provided some additional hints such as a) considering environmental aspects and b) partnering with restaurants.
2. Determine value propositions
I then asked to focus on a subscription-based service (for cutlery) that is specifically targeting restaurants. More specifically I asked ChatGPT to come up with a strategy to market this service to restaurants. ChatGPT developed seven different value propositions:
3. Dig into one value proposition
I then asked ChatGPT to focus on the cost-saving aspect and suggest some comparison charts that could be used to highlight the ‘cost-saving’ aspect of a subscription-based cutlery service.
4. Find data to back up value proposition
To build the charts, I asked ChatGPT to provide some statistics on the cutlery usage in the United States. As ChatGPT cannot find any accurate source for the requested statistics, it offers an alternative set of data for ‘disposable cutlery’ market in the United States.
5. Estimate missing data for value proposition
In the next step, I asked ChatGPT to find a solution to a ‘Fermi problem’ and estimate some statistics on the ‘metal cutlery’ market in the United States. It gave 2 estimates found from two different sources that were broadly consistent with each other.
6. Contextualise value proposition
In the next question, I asked ChatGPT how we could use the data provided on the ‘disposable cutlery’ market in the United States to highlight the benefits of a subscription-based cutlery service:
7. Summarise
Lastly, I asked ChatGPT to summarise all the previously found information and create a summary for a potential business plan:
Business Plan
Executive Summary:
Our company, [Company Name], is proposing a new business model for a cutlery manufacturer: a subscription-based service for restaurants. The service will provide high-quality, durable metal cutlery to restaurants on a regular basis through a subscription model.
The use of disposable cutlery in the United States is a major contributor to plastic pollution, with an estimated 50 billion plastic utensils used per year. Our subscription-based service aims to reduce this environmental impact by providing a sustainable solution for restaurants. Additionally, by subscribing to the service, restaurants will be able to save money in the long run, as the cost of purchasing cutlery outright can be significant over time.
The subscription service will offer different tiers of service, including monthly, quarterly, and annual subscriptions. Restaurants will be able to choose the option that best fits their needs and budget. Our service will also include regular maintenance and cleaning of the cutlery to ensure they always look like new, and replacement cutlery will be provided as needed.
Our market research shows that the global metal cutlery market was valued at around $9 billion in 2019 and is projected to reach around $14 billion by 2026. Additionally, the US flatware market is expected to reach USD 7.5 billion by 2022. This presents a significant opportunity for our subscription-based cutlery service to capture a portion of this market.
In summary, our subscription-based cutlery service is an eco-friendly and cost-effective solution for restaurants looking to reduce their environmental impact and save money. We believe that our service will be well-received by the market and that it has the potential for strong growth in the coming years.
Service Description:
Our subscription-based service includes the following:
- Delivery of metal cutlery to the restaurant on a regular basis
- Pick-up of used cutlery
- Cleaning and maintenance of the cutlery between uses
- Replacement of any damaged or lost cutlery
- Cost savings: By subscribing to our service, restaurants can save money on cutlery costs in the long run
- Eco-friendly: Our service helps reduce the environmental impact of disposable cutlery
Background:
The use of disposable cutlery has become increasingly prevalent in recent years, with Americans using an estimated 50 billion plastic utensils per year. This has led to a significant amount of plastic waste ending up in landfills and the natural environment. Additionally, the global metal cutlery market is valued at around $9 billion in 2019 and is projected to reach around $14 billion by 2026, growing at a CAGR of around 6%.
Our company, XYZ Cutlery, aims to provide an eco-friendly and sustainable solution to this problem by offering a subscription-based cutlery service for restaurants. The service would provide high-quality metal cutlery, which is more durable and long-lasting than disposable cutlery, and can be washed and reused.
Target Market:
Our target market is primarily restaurants and catering businesses. We will focus on reaching out to independent and franchise restaurants, hotels and event planners. We believe that this market will be particularly receptive to our service because they are seeking ways to reduce their environmental impact while also looking for cost-effective solutions.
By subscribing to our service, restaurants can reduce the amount of disposable cutlery they use and lower their environmental impact while also saving money in the long run. Additionally, by providing the service to hotels and event planners, we can help them to reduce the environmental impact of their events and offer a more sustainable option for their guests.
Furthermore, our target market also includes households that are looking to reduce their environmental impact by using reusable cutlery. We will offer our service as a home subscription, providing them with high-quality metal cutlery, which is more durable and long-lasting than disposable cutlery, and can be washed and reused.
Overall, by providing a subscription-based cutlery service, XYZ Cutlery aims to help restaurants, hotels and households to reduce their environmental impact while also saving them money in the long run, by providing eco-friendly, high-quality and durable cutlery, which can be washed and reused.
Market Analysis:
The global metal cutlery market is valued at around $9 billion and is projected to reach around $14 billion by 2026, growing at a CAGR of around 6%. In the United States, the flatware market is expected to reach USD 7.5 billion by 2022, growing at a CAGR of around 2.6%. There is a clear demand for durable and long-lasting cutlery in the market, and our subscription-based service aims to meet this demand while also addressing the environmental impact of disposable cutlery.
Marketing Strategy:
- Highlight the environmental benefits: Use the statistics on disposable cutlery usage in the United States to communicate the significant environmental impact of disposable cutlery and position the subscription-based cutlery service as a solution to this problem. Emphasise the eco-friendly and sustainable aspect of the service and communicate how it can help restaurants reduce their carbon footprint and environmental impact.
- Communicate the cost savings: Use comparison charts to demonstrate the long-term cost savings of subscribing to the service over purchasing cutlery outright. Highlight the benefits of the subscription service, such as free maintenance and cleaning, and how it can help restaurants save money in the long run.
- Offer a variety of subscription tiers: Offer different subscription tiers (e.g. monthly, quarterly, yearly) to cater to the needs and budgets of different restaurants. Use a subscription tiers chart to clearly communicate the costs and benefits associated with each option.
- Use social proof: Leverage customer testimonials and case studies to showcase the success and benefits of the service. Use these to demonstrate the service's value and reliability and to build trust with potential customers.
- Target the right audience: Focus on restaurants and catering businesses as the primary target market for the service. Use industry-specific language and imagery in marketing materials to appeal to this audience.
- Use a multi-channel approach: Use a combination of online and offline marketing channels to reach potential customers. This could include social media, email marketing, trade shows, and targeted advertising.
- Monitor and evaluate the results: Regularly monitor and evaluate the results of the marketing efforts to see what is working and what needs to be improved. This will allow you to adjust the strategy as needed to achieve better results.
- Leverage the market size: Use statistics on the metal cutlery market size as a leverage point to showcase the potential and growth of the subscription service.
The original article was published on LinkedIn