Programming Challenge: English Speaking Calculator

ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.

Exclusive for LQ members, get up to 45% off per month. Click here for more info.

Programming Challenge: English Speaking Calculator

Things have been relatively quiet in the Programming forum of late! So it is a good time for a fresh Programming Challenge!

While outfitting a new laptop over recent months I have revisited an exercise from my personal archives which had its start some years ago, but has been interesting enough to explore and extend more than once over the intervening time: a simple calculator which accepts english language descriptions of arithmetic expressions, symbolic notation, or both as input, and writes an english language result. For example:

Code:

> twelve times seven plus three
eighty seven
> twelve times 7 + 3
eighty seven
> 12 * seven plus 3
eighty seven
> eighty four million three hundred twenty one thousand seven hundred thirty six plus negative eighty four million seven hundred thirty six
three hundred twenty one thousand

I have explored several ways of doing this using different programming languages and approaches to the problem, all of which have been enjoyable and educational, but I have never actually produced what I consider a full and complete implementation - which I am now of a mind to do. But most things, especially ideas, are always more enjoyable when shared so I invite everyone to share the fun and their own ideas in the...

English Speaking Calculator Programming Challenge

The challenge here is to write an application which evaluates simple signed integer arithmetic expressions, whether expressed as numeric symbols, english words or both, and shows the result as a properly formed english word description and optionally as the numeric representation.

I do not want to constrain others by the methods I have already explored so I suggest the following only as a minimal program requirement, subject to change and agreement by participants:

* Must accept input expressed as english words, common math symbols, or both using infix notation
* Must reject invalid or nonsensical input with a suitable error message
* Must perform signed integer addition, subtraction, multiplication and division operations
* Must observe correct order of operations, grouping and associativity in all evaluations
* Must produce the result as properly formed english word descriptions, optionally as numeric values
* Must specify and enforce the range of values over which it will produce valid results
* All results must be mathematically correct, or must produce an error message

Additionaly, and optionally ...

* May support parenthesized expressions
* May support additional operations, exponentiation, etc.
* May support logical operators
* May support common english variations such as 'eleven hundred' for 1,100 (but not 'one hundred two hundred' for 10,200!)
* May be interactive or operate as a filter
* May translate input from, or output to a language other than english

I suggest we limit to infix input notation mostly to make it easier for us to compare examples between implementations.

Also, an internet search for 'english language calculator' does produce at least a few hits, but I have strictly avoided following any of them because the whole point of the original exercise was to work out my own solutions. You may choose your own starting points and references, but please refrain from simply posting links to other online solutions here - in other words, let's constrain this thread to our own ideas and variations, please!

I also suggest use of each member's LQ blog space as the place to post complete code, and limit forum posts to specific implementation ideas and examples of interest.

The main idea here is to have fun and share ideas and methods as we explore this particular problem space - and see where it leads! Let's have fun and learn some new tricks!
(Note: As I have a bit of a head start I'll refrain from posting my own code until everyone has a chance to form their own ideas and join the discussion!)

Last edited by astrogeek; 09-21-2021 at 08:51 PM.
Reason: Added forgotten multiple language in/out option

I guess we need something like a translator to be able to feed bc (or a similar tool).

That would be one possible way to model the problem. Using that view, and substituting "math engine" for bc so as to not lock anyone into thinking this is bc-centric, you might expand that idea into something like this:

I will point out that the stated math processing requirements are intentionally very simple - addition/subtraction and multiplication/division, and easily handled by the native math functions of any programming language, so bc may be overkill unless you intend to support additional operations (encouraged as an optional exercise).

The enduring appeal of this problem for myself has been the richness of the modeling <--> implementation exercise, which is why I have posted this challenge. As you explore understanding of the full requirements of the "translator" functions in your model above and transform those into code in a chosen implementation language you may find a few, hopefully pleasant, surprises, as I have! Some of the surprises have been of simplicity, others of subtlety still others of complexity, but all rewarding.

The point is the journey more than the destination. Good first step!

Let's consider the second part of the translator function, output translation. A little thought will show it to be a relatively simple task in isolation.

First, note that its input will always be an integer value - nothing else (our input translator and math engine trap out any upstream errors, right?). So we only have to imagine a single function to which we pass an integer value and which returns as a character string or prints out that value in words, and maybe checks for range errors.

So what is required to perform that function? (Note: Your chosen implementation language may provide some library or module which performs that function for you, and feel free to use it - but like the Ferengi you will not learn anything if you do!).

Think about how you would do this for small numbers, single digits first. One that should come quickly to mind is as easy as using the integer, the digit to be translated, as index into an array of strings representing the digit names, arr[0]="zero", arr[1]="one",... arr[9]="nine".

Now ask...
Can you extend that to larger numbers?
Teens?
Tens?
Hundreds?
How many array elements do you actually need?
What are the limits, and does it become more difficult with larger numbers, or become simpler?
Are there any natural groupings which might allow reuse of some code?
Is this a good place to use recursion?

Finding answers to some of those, and a little flow control should be sufficient seed to get your thoughts started. Choose an implementation language and see how simply you can put together a small program that will translate integer values into words and do nothing else. You will then have 1/3 of the model above ready to go!

Code:

>1
one
>0
zero
>-1
negative one
>117
one hundred seventeen
>1000001
one million one
>8736
eight thousand seven hundred thirty six
>1000000000
Out of range error!

My current code has a range of +/-999,999,999 ... why might that be?

Can you think of better ways to perform this function?

Last edited by astrogeek; 09-15-2021 at 10:53 PM.
Reason: better example, comment

I have not looked at that, but as noted in my previous post number to text is a relatively simple conversion so reasonable to do in a browser applet. Perhaps you could have a look and give us a summary?

Indeed, the opposite direction is a much different, but very interesting animal!

no, unfortunately I could not get the sources of those pages, but here is a python equivalent: https://github.com/jaraco/inflect
that inflect module has a number_to_words function, it looks quite good for me. And also there is a num2words module where you can even specify the language. https://pypi.org/project/num2words/. I guess something similar is available in C (or java) too.

We need to parse the input text somehow (tokenizer?). That would be a more interesting question I think.

no, unfortunately I could not get the sources of those pages, but here is a python equivalent: https://github.com/jaraco/inflect
that inflect module has a number_to_words function, it looks quite good for me. And also there is a num2words module where you can even specify the language. https://pypi.org/project/num2words/. I guess something similar is available in C (or java) too.

We need to parse the input text somehow (tokenizer?). That would be a more interesting question I think.

Well, as noted in the original challenge post, finding existing solutions is not the point of the exercise anyway.

Using prewritten modules is certainly one option, at least for the output translation function, but see my comment about the Ferengi above (Star Trek reference in case you do not recognize it) and I will repeat, this challenge is more about discovering the path to a solution and sharing in that discovery, than it is about the solution itself. Hopefully, those not familiar with the landscape will enjoy the discovery of something new, and those already familiar will find it an enjoyable refresher!

Discovering the actual requirements of a solution will lead us through the subtask of tokenizing for sure, but we will also see that tokenizing is not enough! A complete solution will require, in one form or another, all the functions performed by a modern compiler which may be modeled in a simple way as:

Collectively, these tasks are usually referred to as a parser or compiler, or by a name more specific to the application which they support, such as a translator or as in our current application, a calculator. But the tasks they perform and the ideas on which they are based are no less amazing for being mostly well understood, and shared across every aspect of computing as we know it, from the most simple to the most complex.

As already noted, the output translation function for our calculator is relatively simple as it must only translate an integer value (the intermediate representation of input expressions) into words. But note that the numeric value may be translated into any other human language just as easily as into english by adding one or more such output language options, without any changes to the preceeding stages! In this way we may easily turn our calculator into an english to any language numerical translator! (This same idea is what allows a C compiler to generate code for multiple very different computer architectures from the same input source - all that need change, conceptually at least, is the output translation function.)

Working backward, the intermediate representation for our calculator is just the numerical result of evaluating a simple input integer arithmetic expression - a single integer value. The calculator function itself is almost trivial in any programming language and will be performed by the semantic action code of the parser in most implementations, so there really isn't much to see here unless you want to implement additional math capabilities!

Which brings us back to the parser itself, and the input sent to it from the tokenizer or lexical analyzer, also called a lexer or scanner. These two functions are at the heart of this programming challenge, and the real object of our discovery! My own complete implementation will be based on Flex as the lexer or scanner generator and GNU Bison as the parser generator, as I might expect others to also use for theirs - both should be included with or available for your Linux distro. But the the ideas to be used are independent of any application and it is often useful to write simple examples from scratch or explore ideas using other applications - whatever helps you understand!

And behind it all, almost invisible until now is the language being parsed and its specification, or grammar. Many texts about parsing begin by implementing an example programming language then writing a compiler for it. To follow the examples you must first learn the new imaginary programming language then how to parse it. I have chosen instead to use a subset of the english language which describes integers and simple arithmetic operations on them. This gives us an almost universally known language to work with, something we already understand so we can get immediately down to the business of specifying the grammar from which both our lexer and parser will derive their specifications.

The input language I have used corresponds to spoken english descriptions of base ten integer numbers and the four basic operations of arithmetic working on them. The words of this language are:

Code:

The digits: zero, one, two, ... nine
Named multiples of ten: ten, twenty, ... ninety
Named numbers, the "teens": eleven, twelve, ... nineteen
Named powers of ten: hundred, thousand, million, billion
Symbolic number representations: [0-9]+ 0, 1, 12, 3782, etc.
Sign names: positive, negative
Named operators: plus, minus, times, divided by, over
Symbolic operators: +, -, *, /, (, )

The newline character is used as the end terminator of an expression, so one complete expression per line.

The sentences of this langauge are the descriptions of any valid math expression as would normally be spoken (we may find minor regional or cultural differences here, use your own), which implies infix ordering. The range of expressions is currently +/-999,999,999 in my most recent implementation, but I intend to extend that to the full range of signed 32-bit integers for this challenge, be sure to specify your own range.

We may begin by listing a few examples of both valid and invalid numeric expressions in this language. I consider these valid:

Code:

zero
one
seven hundred
seven hundred twenty eight
eleven hundred seventy seven
one thousand one hundred seventy seven
one hundred thousand nine hundred
seventeen hundred thousand twelve
one million seven hundred thousand twelve
negative ten thousand five

These are examples of a few invalid numeric expressions in this language:

Code:

zero hundred
ten hundred
one hundred thousand eleven hundred
twenty ten
seventeen hundred thousand twelve hundred

Math operations are expected to be evaluated according to commonly used rules of precedence and associativity, not shown here.

From this I will derive an initial grammar description (specification) and begin by implementing and testing a lexer - in a post to come! Meanwhile, add your own ideas and questions - share in the fun of discovery!

We may begin by listing a few examples of both valid and invalid numeric expressions in this language. I consider these valid:

Code:

seventeen hundred thousand twelve
one million seven hundred thousand twelve

I'm curious about this one. In the US, at what point is the convention to stop using "hundred" and start using "thousand"? In British English we'd generally only use hundred for 1-9. "Eleven hundred" would be seen as an Americanism¹ — at least by older generations.

We'd also always include an " and " between the the final multiplier and the final component value.

As a Brit, the wording I've highlighted above grates on my ears something terrible.

---
1) though we do seem to use hundred the American way in some contexts, such as when referring to the cc of an engine, though even there, it's more common to say 1.8 litre than "eighteen hundred" these days.

seventeen hundred thousand twelve
one million seven hundred thousand twelve

I'm curious about this one. In the US, at what point is the convention to stop using "hundred" and start using "thousand"? In British English we'd generally only use hundred for 1-9. "Eleven hundred" would be seen as an Americanism¹ — at least by older generations....

As a Brit, the wording I've highlighted above grates on my ears something terrible.

I do not think I can give a definitive answer. In my experience the convention is much more variable, or shoud we say less conventional, than we might at first expect. (Raising these questions is one reason I selected the examples above.)

I clearly remember encountering what seemed to me at the time unconventional usage in the first years of transition from school and local usage, to professional environments and in other regions. It is hard to say whether the differences in any case originated by region, or by discipline (i.e. engineering or scientific vs commercial or logistical) or even by the personal preferences of those involved.

Usage of a **qualified hundreds multiplier as prefix to the leftmost "third power of ten" (thousand, million, etc.) is one I am sure I picked up after school, but has been common enough in my experience and does not seem awkward to me...

Code:

seventeen hundred thousand
eleven hundred million three hundred thousand two hundred

Such usage in any position other than the leftmost seems ambiguous, eleven hundred thousand eleven hundred for example. I think the problem here can be seen to result from splitting the thousands position into two parts - eleven hundred thousand plus one thousand which is shifted into the hundreds multiplier. This is fixed by further restricting the hundreds multiplier to a single digit giving eleven hundred one thousand one hundred.

I have seen the difference between eleven hundred and one thousand one hundred described as informal vs formal usage, respectively, but that seems like nonsense to me - what is a formal or informal number? Canonical might be a better word to use, but whether one is actually correct and the other incorrect in some sense, I do not know.

**By qualified hundreds multiplier I mean having the same restriction that causes eleven hundred and twenty one hundred to be generally acceptable but excludes ten hundred or twenty hundred. The best way I can express that is to say the hundreds digit itself must be non-zero, although I yield to any better thought out definition!

Quote:

Originally Posted by GazL

We'd also always include an " and " between the the final multiplier and the final component value.

I assume always means that you consider this usage is required, is that correct? Is it optional, or required, in any other position?

Here I clearly remember from my school days being taught that same usage of "and", and I also know where I learned to not use it in a particular professional group where some preference or other discouraged its use, and it has mostly stuck with me in technical settings. But I am happy either way and probably more inclined to adapt to the usage of others in any given interaction.

But a question comes to mind for my evolving grammar definition: Would you say the more acceptable usage is to insert "and" before any trailing component after the hundreds position, or to insert it before the trailing component of any hundreds multiplier? For example would either of these be preferred over the other in your experience:

Code:

one hundred three thousand one hundred and three
one hundred and three thousand one hundred and three

All of that said, I did not intend to present my examples as being authoratative, even for the purpose of this challenge. Only that they represent the grammar definition currently used in my own implementation. The use of what I have called qualified hundreds multiplier (is there another name for that?) was introduced as a way to force a variation into the language from the start as it leads to a more interesting grammar description and more opportunity for discovery.

I will now add the trailing component "and" into my grammar as well, whether optionally or required as yet undecided. Doing so also forces a change on the output generation to give it the same usage as the input language.

I guess we would need to specify and use only one dialect. A general solution would be extremely hard (including the translation back at the end).
It looks like the " and " can be simply ignored or replaced by + .

Yes, the 'and' is required. "six hundred three" just doesn't sound right to british ears. I'm not sure how this works in other non-US english speaking nations like Canada or AUS though.

It's interesting, this sort of stuff you tend to just take for granted and use without thinking about or understanding the actual rules.

I wasn't very clear in that post. Brits seem to follow this pattern.

Three hundred and seven.

Six hundred thousand two hundred and twenty one.

Two hundred and thirty eight thousand four hundred and six.

307
600,221
238,406

So, it seems to be used to combine the additive components on each part that would be separated by a comma when written in numerals, except that "twenty and one" seems to have become an atomic "twenty one" somewhere down the line.

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.