Code base here:


Back in 2018 – 2019 I worked on training Machine Learning models to predict which sentences in company conference call transcripts were likely to be “guidance” or forward looking statements.

The most challenging part of the project was editing Excel files with 50,000 lines of text and flagging each sentence as “guidance” or “not guidance”. This was done to have a training set that could then be used to train the ML model.

It was a mind numbing job that took several weeks. In general, data tagging for machine learning is both incredibly important to get right, and incredibly expensive in terms of time, particularly if domain knowledge is required.

Recently I ran across an article by Piero Paialunga who used ChatGPT for substitute data – i.e. substitute tagging – for sentiment analysis:

Often in the real world we’re trying to train on no-label data, particularly in specialized fields like investing. There are techniques for dealing with this like zero-shot classification and data augmentation, but I found the ChatGPT angle of interest – could it live up to the hype?

So I set about trying to use ChatGPT to skip the 50,000 lines of tagging part of the “is it guidance” project, and see if the process Piero outlined could render useful results in a more specific problem than simple sentiment scoring.

User Story:

AS A financial analyst
I WANT TO score guidance sentences in transcripts 
SO THAT I can save time reading the whole transcript and don’t miss market moving statements 

As an example, the system we built at Sentieo is still in use today:


I used two sentences from a recent Discover call (DFS) … one guidance and one not.


I then used

completion = openai.Completion.create(engine=”text-curie-001″, prompt=guidance_example,max_tokens=240)

to¬†generate 500 “guidance” sentences and 500 “not guidance” sentences using the two sentences as seeds for ChatGPT.

This took about a half hour to run, and repeated runs ended up costing me about $17.00 for my API developer key usage for the day.
But you can save $17.00 by skipping this step and using the generated_snippets.csv file in this repo!

Interestingly, the results were, I thought, fairly realistic. ChatGPT is impressive. All of the snippets below are generated text.


Next I trained two classifiers, one Random Forest and one Logistic Regression. Piero used RF which is probably more accurate in training but I find in the real world the predict_proba() ability of LR to assign a stochastic value to the sentence is of use – since binary classifiers often don’t offer much value as much as a gradient of probabilties.

As you might expect from ChatGPT the fits were good for both models:


Next I wanted to try these models on raw data.  Into the wild we go:


I used anther credit card company’s – Capital One (COF) – most recent conference call transcript, filtered (by hand) down to just management talking to see if our DFS based classifier would carry over.


As I feared, the RF model classed a majority of management statements as “guidance”, likely more than would be validated by a domain expert looking at the text.


Therefore, I then used the LR model in order to rank order the snippets, and the results were more useful.


The results filtered out via Linear Regression were ok, though I’d want to spend a LOT more time thinking about the problem.

Is this really a SUMMARIZATION problem not a CLASSIFICATION problem per se?

Might it make sense to evaluate statements in context vs as individual tokens? Would a HUGGINGFACE transformer maybe auto gen text better than ChatGPT?

All these questions to be explored in future!

Note: Bias towards Type 1 Errors probably falls out from the domain-specific issue of “financial guidance” vs ChatGPT being a more generally trained language model, that is, while this approach is probably good for broad sweeps like “sentiment”, something more domain specific is challenging. Also, the English language doesn’t actually have a future tense, a strange quirk which bedeviled my work in this area in prior years and which makes simple “tense” classifiers (if that’s what this process was yielding under the hood) inaccurate in practice.