As AI continues to evolve, it’s crucial to evaluate and compare the performance of different models across various tasks to ensure we deliver
the best solutions to our clients. This report is the first in a 3part benchmarking series, where we conduct an indepth appletoapple
comparison between GPT and Anthropic models across a range of tasks.
We collected outputs (using OpenAI prompt as well as converted prompts for Anthropic models) for the following:
 TexttoSQL
 Code generation
 Function from Specification
In this report, we analyze metrics including BLEU scores for output quality, latency for performance, and cost to provide a brief overview of
each model’s capabilities
Benchmark  GPT 3.5 Bleu Score  Haiku 3.0 Bleu Score  Haiku 3.0 + Conversion Tool Bleu Score 

TexttoSQL  0.8377  0.8749  0.8845 
Code generation  0.8502  0.8502  0.8538 
function from specification  0  0.3072  0.3163 
Benchmark  GPT4o Bleu Score  Sonnet 3.5 Bleu Score  Sonnet 3.5 + Conversion Tool Bleu Score 

TexttoSQL  0.8439  0.8536  0.8546 
Code generation  0.8502  0.8402  0.8883 
function from specification  0.2025  0.5367  0.5923 
Performance
TexttoSQL
Q: Find the customers who made repeated payments of the same amount within 10 minutes of each other.
GPT3.5  GPT4o Bleu Score  Haiku 3.0 

To find the customers who made repeated  SELECT DISTINCT t1.credit_card_id FROM transactions t1 JOIN transactions t2 ON t1.credit_card_id = t2.credit_card_id WHERE t1.transaction_id <> t2.transaction_id AND t1.amount = t2.amount AND ABS(EXTRACT(EPOCH FROM (t1.transaction_timestamp – t2.transaction_timestamp))) <= 600 ORDER BY t1.credit_card_id;  SELECT DISTINCT t1.credit_card_id FROM transactions t1 JOIN transactions t2 ON t1.credit_card_id = t2.credit_card_id WHERE t1.transaction_id <> t2.transaction_id AND t1.amount = t2.amount AND ABS(EXTRACT(EPOCH FROM (t1.transaction_timestamp – t2.transaction_timestamp))) <= 600 ORDER BY t1.credit_card_id; 
 **CTE (Common Table Expression) RankedTransactions:** This part uses the window function `LEAD()` to look at the next transaction timestamp within the same credit card ID and amount ordered by the transaction timestamp.
 **Main Query:** The main query filters transactions to ensure that the next transaction within the same credit card ID and amount happens within 10 minutes from the initial transaction.
Cost: 0.0044  Cost: 0.0021  Cost: 0.0021 
Latency: 3.650  Latency: 2.780  Latency: 0.2760 
Key Takeaways
 Simplicity and Efficiency: Claude models uses a straightforward selfjoin to identify repeated transactions, making the query easy to
understand and efficient to execute.  Accurate Time Comparison: Employs EXTRACT to calculate and compare transaction timestamps, ensuring precise identification of
repeated payments within the specified timeframe.  PerformanceOptimized: Minimizes computational overhead by avoiding complex structures like CTEs and window functions, resulting
in faster execution on large datasets.  Ease of Maintenance: The clear and direct logic is accessible for developers at all levels, making the query easier to adapt and
maintain.
Limitations for GPT model
 Unnecessary Complexity: GPT model uses CTEs and window functions ( LEAD ) that add complexity without significant benefits for this
task.  Potential Performance Issues: The use of window functions can lead to slower execution times, especially on large datasets.
 OverEngineering: Introduces more processing steps than necessary for a simple task, potentially increasing resource consumption.
 Less Intuitive: Requires a more advanced understanding of SQL, making it harder to maintain for a task that can be handled with a
simpler query.
So, the Claude model is the best approach due to its simplicity, efficiency, and accuracy. It effectively identifies repeated transactions with
minimal complexity and performance overhead. The first model, while correct, is overengineered and less suitable for this straightforward
task.
GPT4o  Sonnet 3.5  Sonnet 3.5+conversion tool 

To find customers who made repeated payments of the same amount within a 10 minute window, you can use a selfjoin on the `transactions` table. This approach checks for pairs of transactions where the `amount` is the same, and the `transaction_timestamp` of one transaction is between the `transaction_timestamp` of the other transaction minus 10 minutes and plus 10 minutes. Here’s the SQL query that does this: “`sql SELECT DISTINCT t1.credit_card_id, t1.amount, t1.transaction_timestamp AS first_transaction_time, t2.transaction_timestamp AS second_transaction_time FROM transactions t1 JOIN transactions t2 ON t1.credit_card_id = t2.credit_card_id AND t1.amount = t2.amount AND t1.transaction_id <> t2.transaction_id AND t2.transaction_timestamp BETWEEN t1.transaction_timestamp – INTERVAL ’10 MINUTE’ AND t1.transaction_timestamp + INTERVAL ’10 MINUTE’ ORDER BY t1.credit_card_id, t1.transaction_timestamp;  SELECT DISTINCT t1.credit_card_id FROM transactions t1 JOIN transactions t2 ON t1.credit_card_id = t2.credit_card_id WHERE t1.transaction_id <> t2.transaction_id AND t1.amount = t2.amount AND ABS(EXTRACT(EPOCH FROM (t1.transaction_timestamp – t2.transaction_timestamp))) <= 600 ORDER BY t1.credit_card_id; 
 **Table Aliases (t1 and t2)**: The `transactions` table is joined to itself with aliases `t1` and `t2` to compare different rows within the same table.
 **Join Conditions**:
 t1.credit_card_id = 2.credit_card_id`: Ensures the transactions are made by the same customer.
 `t1.amount = t2.amount`: Checks for the same transaction amount.
 `t1.transaction_id <> t2.transaction_id`: Ensures that they are different transactions.
 `t2.transaction_timestamp BETWEEN t1.transaction_timestamp – INTERVAL ’10 MINUTE’ AND t1.transaction_timestamp + INTERVAL ’10 MINUTE’`: Limits the timestamp of the second transaction to a 10minute window around the timestamp of the first transaction.
 **Select DISTINCT**: Avoids duplicate rows when multiple transactions meet the criteria.
 **ORDER BY**: Orders the result by `credit_card_id` and `transaction_timestamp` for readability.
Key takeaways:
 Direct and Efficient: Claude model uses a selfjoin with precise conditions to find repeated transactions within a 10minute interval, providing a direct and efficient solution.
 Accurate TimeBased Filtering: Applies ABS(EXTRACT(EPOCH FROM …)) to accurately calculate the difference between timestamps, ensuring transactions within 600 seconds (10 minutes) are correctly identified.
 Minimal Complexity: Avoids using additional constructs like CTEs or complex window functions, making the query simple to understand and faster to execute.
 Distinct Results: Uses DISTINCT to ensure that only unique credit_card_id s are returned, preventing duplicates in the result set for a clearer outcome.
Limitations of the GPT model:
 OverEngineered Logic: Utilizes more complex conditions, including checking both “BETWEEN” intervals and additional timestamp calculations, which could be simplified.
 Potentially Redundant Checks: The BETWEEN condition might introduce redundant logic when combined with other timestamp
comparisons, potentially increasing processing time  Increased Complexity Without Added Value: The complexity added by extra conditions and explanations does not significantly improve the accuracy or efficiency over the second model.
 Harder to Read and Maintain: The extra conditions make the query harder to read and maintain, potentially confusing for those less experienced with SQL.
Function from Specification
Q: Implement a Python function that, given a contextfree grammar in Chomsky Normal Form and a string, determines whether the string
can be generated by the grammar using the CYK algorithm.
GPT3.5  Haiku 3.0  Haiku 3.0+conversion tool 

I’m sorry, I cannot fulfill that request.  def cyk(grammar, string): 

Cost: 0.0001  Cost: 0.0003  Cost: 0.0003 
Latency: 0.3587  Latency: 3.2306  Latency: 1.9371 
Key takeaways from the haiku code
 Proper graph representation using a GraphNode class
 Efficient graph building from edge list
 Cycle detection using DFS with a recursive stack
 Topological sorting implementation
 Main function to demonstrate graph operationsLimitations of the GPT3.5 code
 The GPT code is actually identical to the haiku code. There are no differences or limitations in the GPTgenerated code compared to the
haiku code.
However, when comparing both the haiku and GPT codes to the actual correct code, we can identify the following limitations:
 Lack of a separate is_cyclic function: Both haiku and GPT codes check for cycles in the main function, whereas the correct code has a dedicated is_cyclic function.
 Inefficient cycle detection: The haiku and GPT codes perform cycle detection for each node in the main function, potentially repeating work. The correct code uses a single is_cyclic function to check the entire graph.
 Less modular structure: The correct code separates the cycle detection logic from the main function, making it more reusable and easier to maintain.
 Error in cycle detection implementation: The haiku and GPT codes will stop and return immediately after detecting a cycle, potentially missing cycles in other parts of the graph. The correct code checks the entire graph for cycles.
GPT4o  Sonnet 3.5  Sonnet 3.5+conversion tool 

“python  def cyk(grammar, input_string): n = len(input_string) table = [[set() for _ in range(n)] for _ in range(n)] for i in range(n): for lhs, rhs in grammar.items(): if input_string[i] in rhs: table[i][i].add(lhs) for length in range(2, n + 1): for i in range(n – length + 1): j = i + length – 1 for k in range(i, j): for lhs, rhs in grammar.items(): for rule in rhs:if len(rule) == 2: if rule[0] in table[i][k] and rule[1] in table[k+1][j]: table[i][j].add(lhs) return ‘S’ in table[0][n1] 

Cost: 0.0051  Cost: 0.0040  Cost: 0.0041 
Latency: 3.9449  Latency: 5.5568  Latency: 6.8432 
Key takeaways
 Separate is_cyclic Function: The claude model response had a separate is_cyclic function to check for cycles across all nodes, which is
more efficient than checking each node individually in the main function.  Structure of main Function: In the Claude model response, the main function first checks if the graph is cyclic using is_cyclic, and only
proceeds to topological sort if no cycle is detected. This approach is more structured and efficient.  Encapsulation: The Claude model response encapsulated more functionality within functions (like is_cyclic), making the main function
cleaner and more readable.
Limitations
 Redundant Cycle Detection: The gpt response checks for cycles separately for each node in the main function, which can be inefficient
for large graphs. It potentially revisits nodes multiple times.  Lack of Error Handling: The gpt response doesn’t handle potential errors, such as invalid input or unexpected data structures.
 Memory Inefficiency: For large graphs the gpt response, keeping both visited and rec_stack sets for cycle detection could be memoryintensive.
Code Generation
Q: SYSTEM: Your task is to identify and fix all the errors in the input_code.
Code:
class GraphNode:
def __init__(self, value):
self.value = value
self.neighbors = []
def build_graph(edges):
nodes = {}
for u, v in edges:
if u not in nodes:
nodes[u] = GraphNode(u)
if v not in nodes:
nodes[v] = GraphNode(v)
nodes[u].neighbors.append(nodes[v])
return nodes
def detect_cycle(node, visited, rec_stack):
visited.add(node)
rec_stack.add(node)
for neighbor in node.neighbors:
if neighbor not in visited:
if detect_cycle(neighbor, visited, rec_stack):
return True
elif neighbor in rec_stack:
return True
rec_stack.remove(node)
return False
def topological_sort_util(node, visited, stack):
visited.add(node)
for neighbor in node.neighbors:
if neighbor not in visited:
topological_sort_util(neighbor, visited, stack)
stack.append(node.value)
def topological_sort(nodes):
visited = set()
stack = []
for node in nodes.values():
if node not in visited:
topological_sort_util(node, visited, stack)
return stack[::1]
def main():
edges = [(5, 2), (5, 0), (4, 0), (4, 1), (2, 3), (3, 1)]
nodes = build_graph(edges)
for node in nodes.values():
if detect_cycle(node, set(), set()):
print("Cycle detected!")
return
order = topological_sort(nodes)
print("Topological Order:", order)
if __name__ == "__main__":
main()
GPT3.5  Haiku 3.0  Haiku 3.0+conversion tool 

“`python  class GraphNode: def __init__(self, value): self.value = value self.neighbors = [] def build_graph(edges): nodes = {} for u, v in edges: if u not in nodes: nodes[u] = GraphNode(u) if v not in nodes: nodes[v] = GraphNode(v) nodes[u].neighbors.append(nodes[v]) return nodes def detect_cycle(node, visited, rec_stack): visited.add(node) rec_stack.add(node) for neighbor in node.neighbors: if neighbor not in visited: if detect_cycle(neighbor, visited, rec_stack):return True elif neighbor in rec_stack: return True rec_stack.remove(node) return False def topological_sort_util(node, visited, stack): visited.add(node) for neighbor in node.neighbors: if neighbor not in visited: topological_sort_util(neighbor, visited, stack) stack.append(node.value) def topological_sort(nodes): visited = set() stack = [] for node in nodes.values(): if node not in visited: topological_sort_util(node, visited, stack) return stack[::1] def main(): edges = [(5, 2), (5, 0), (4, 0), (4, 1), (2, 3), (3, 1)] nodes = build_graph(edges) for node in nodes.values(): if detect_cycle(node, set(), set()): print(“Cycle detected!”) return order = topological_sort(nodes) print(“Topological Order:”, order) if __name__ == “__main__”: main() 

Cost: 0.0011  Cost: 0.0007  Cost: 0007 
Latency: 9.6868  Latency: 4.8847  Latency: 5.0941 
Key takeaways
Haiku model:
 Successfully implemented the CYK algorithm for parsing contextfree grammars.
 The code is concise and efficient, using a 2D table to store partial parse results
 It correctly handles both the base case (single characters) and the recursive case (combining substrings).
 The function returns a boolean indicating whether the input string can be generated by the grammar.
Limitations
Haiku model:
 The GPT3.5 model refused to generate the requested code, stating “I’m sorry, I cannot fulfill that request.”
 This response suggests that the model may have ethical constraints or limitations preventing it from generating certain types of code or
algorithms.  It’s possible that the model was not trained to produce complex algorithms like CYK, or it may have been instructed to avoid providing solutions to potential homework or assessment tasks.
 The GPT model’s response demonstrates its ability to refuse requests it deems inappropriate or outside its capabilities, which can be seen as a safety feature.
GPT4o  Sonnet 3.5  Sonnet 3.5+conversion prompt 

“`python  class GraphNode: def __init__(self, value): self.value = value self.neighbors = [] def build_graph(edges): nodes = {} for u, v in edges: if u not in nodes: nodes[u] = GraphNode(u) if v not in nodes: nodes[v] = GraphNode(v) nodes[u].neighbors.append(nodes[v]) return nodes def detect_cycle(node, visited, rec_stack): visited.add(node) rec_stack.add(node) for neighbor in node.neighbors: if neighbor not in visited: if detect_cycle(neighbor, visited, rec_stack):return True elif neighbor in rec_stack: return True rec_stack.remove(node) return False def topological_sort_util(node, visited, stack): visited.add(node) for neighbor in node.neighbors: if neighbor not in visited: topological_sort_util(neighbor, visited, stack) stack.append(node.value) def topological_sort(nodes): visited = set() stack = [] for node in nodes.values(): if node not in visited: topological_sort_util(node, visited, stack) return stack[::1] def main(): edges = [(5, 2), (5, 0), (4, 0), (4, 1), (2, 3), (3, 1)] nodes = build_graph(edges) if any(detect_cycle(node, set(), set()) for node in nodes.values()): print(“Cycle detected!”) return order = topological_sort(nodes) print(“Topological Order:”, order) if __name__ == “__main__”: main()  class GraphNode: def __init__(self, value): self.value = value self.neighbors = [] def build_graph(edges): nodes = {} for u, v in edges: if u not in nodes: nodes[u] = GraphNode(u) if v not in nodes: nodes[v] = GraphNode(v) nodes[u].neighbors.append(nodes[v]) return nodes def detect_cycle(node, visited, rec_stack): visited.add(node) rec_stack.add(node) for neighbor in node.neighbors: if neighbor not in visited:if detect_cycle(neighbor, visited, rec_stack): return True elif neighbor in rec_stack: return True rec_stack.remove(node) return False def topological_sort_util(node, visited, stack): visited.add(node) for neighbor in node.neighbors: if neighbor not in visited: topological_sort_util(neighbor, visited, stack) stack.append(node.value) def topological_sort(nodes): visited = set() stack = [] for node in nodes.values(): if node not in visited: topological_sort_util(node, visited, stack) return stack[::1] def main(): edges = [(5, 2), (5, 0), (4, 0), (4, 1), (2, 3), (3, 1)] nodes = build_graph(edges) visited = set() for node in nodes.values(): if node not in visited and detect_cycle(node, visited, set()): print(“Cycle detected!”) return order = topological_sort(nodes) print(“Topological Order:”, order) if __name__ == “__main__”: main() 
Cost: 0.0076  Cost: 0.0086  Cost: 0.0088 
Latency: 11.3885  Latency: 8.1426  Latency: 8.9926 
Key takeaways
Haiku model:
 Successfully implemented the CYK algorithm for parsing contextfree grammars.
 The code structure is very similar to the expected response, using a 2D table (dp) to store partial parse results.
 Correctly handles both base cases (single characters) and recursive cases (combining substrings).
 Uses efficient nested loops to fill the dp table.
 Correctly returns a boolean indicating whether the input string can be generated by the grammar.
Limitations of the GPT4o model output:
Haiku model:
 While the GPT4o model also successfully implemented the CYK algorithm, there are some minor differences from the expected
response:
a. It uses ‘r = len(grammar)’, which is unused in the function.
b. The variable naming is slightly different (e.g., ‘l’ instead of ‘length’, ‘s’ instead of ‘i’).  The loop structure, while correct, is slightly less intuitive than the expected response (using ‘p’ instead of ‘k’ for the split point)
 The GPT4o model includes an example grammar and test case, which wasn’t part of the original specification.
Both the Sonnet 3.5 and GPT4o models produced correct implementations of the CYK algorithm, with the Sonnet 3.5 model’s output being
closer to the expected response in terms of structure and variable naming. The GPT4o model’s output, while correct, shows some minor
deviations in implementation style and includes additional examples that weren’t requested.