WebWord has a good write up about Professor Feng Li who data mined the annual reports of 34,180 companies with some interesting results. Li counted the number of times words like “risk” and “uncertain” showed up in the reports and compared the data to previous years.

Professor Li discovered that a “big jump in words related to risk is usually followed by poor share performance” which makes a ton of sense. He built a model portfolio based on this data. The punch line is that he would have outperformed the S&P 500 index by 6% per year since 1995. Smashing!

Has anyone seen other analysies like this one? What else can we “data mine” to read the market?

We linked to WebWord before, for John Rhodes’ opinions on Microsoft and Web 2.0. (our article)

The business idea is pretty simple. I’m giving it away! 😉

First, you need your data set. Use RSS.

Second, you need an aggregation device. Tweak an open source RSS aggregator.

Third, filter the streams of data coming in based on two factors. Factor one is what you care about, for example, MSFT stock price. Factor two is a set of relevant valence words.

Fourth, build a basic neural network to pattern match the inputs described above with an ouput. In our case, the output is stock price.

Fifth, use said mechanism is pull in recent news, filter it, gain insight, and find out how the news in the world impacts or reflects the price of a stock.

If you do this mechanically and programmatically, you now can understand the market before the vast majority of people making trades. The knowledge is the same but unequally balanced in your favor.

I have more to say on this but I’m tired …

(For a primer in neural nets, check out http://en.wikipedia.org/wiki/Neural_nets. Refer back to that page if you ever feel your eyes gloss over while reading below.)

Ok, this sounds like fun. Any ideas on what would be good keywords to look for? We could start out with Li’s words.

Here is my neural net setup:

[input: keyword1, … keyword9, timespan1, timespan2, stock price]
[hidden: 3 nodes?] (would need tweak this)
[output: ending stock price]

keyword1-keyword9 would be integers representing the number of times that word appeared in blogs associated with the ticker over the selected time-span.

I have the time-span into the input layer too. This would likely be trained into a multiple to reflect the fact that the change in a stock’s price is greater after 1 year than 1 day.

Okay, now I have two time-spans in there. One is the time-span of blog posts that were searched. The other is how far into the future to predict the stock. I think we need to make this distinction, because otherwise we’ll end up with a NN that is good at predicting the price of a stock today based on the past years blog posts. Something that will work better I think is to predict the price of a stock 1 year from now based on the past 1 year’s blog posts. So one time-span is how far into the past to search blog posts. The other time-span is how far into future are we predicting. These numbers are something else that can be tweaked.

The output would be the (predicted) stock price after the (future) time-span.

Some other notes.
– I would think stocks with a lot of coverage would be more predictable this way than stocks with a little coverage.

– Stocks/companies with unique names would be easier to analyze. For example, your search for “Apple” would return blogs about the red fruit. To combat this, you could try to narrow your blog search to finance blogs which would help focus your results.

– In general, how do you tell when a blog is talking about a stock. Or what if a blog post talks about multiple stocks? Your analysis can take into account keywords that are only locally associated with the stock (e.g. “MSFT sucks” vs. “Post about MSFT with the word sucks in it somewhere”). So search for keywords in the same sentence, rather than the same blog. The fluctuations caused by false positives like this may balance out with a large enough data set. But I’m guessing we won’t often have such a large data set.

Now, use a genetic algorithm to breed the best neural net. Your GA would try to breed NNs that predict a stocks price best. (Try not to be tempted into pruning by the predicted returns. You want good predictions in the output layer, not good returns.)

Once you had a NN that predicted well, you would run it against a bunch of stocks/tickers and make note of the stocks with the highest predicted returns. Or the highest predicted losses if you’re open to shorting stocks.

Jason, you’ve got it mostly nailed. The weakness is in the language, words, and parsing. In other words, in the data set itself. You can’t programmatically attack what you don’t fully understand.

Fortunately, there is possible level of clarity. There is some research I’ve seen on text analysis, valence and word affect, and related. This stuff is a mix of sociology, psychology, linguistics, and the like. I’ll have to dig up my research material for a better answer!

Naturally, the search engines are also interested in this, but to my knowledge, no one is doing what we’ve discussed here.

I can dig up some research stuff too. I studied automated text summarization senior year of undergrad for a little while before I switched my thesis topic to some more obscure areas of linguistics.

http://www.summarization.com/ is always a good place to start for this stuff.

This kind of analysis is very difficult (if you’re trying to simulate an expert human), but it’s also possible that we can get decent results using just the simplest of techniques. If we can build the infrastructure, we can always plug in smarter parsing algorithms later.

Comments are closed.