Episode 72: The Social Media Hacker's Guide to Better Data Science
Download MP3[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science, the podcast that helps data scientists transform their technical expertise into tangible business value, career autonomy, and financial reward. I'm Dr. Genevieve Hayes, and today I'm joined by Tim O'Hearn. Tim is a software engineer who spent years gaining millions of social media followers for clients by circumventing Otting measures on social networks.
[00:00:30] He is also the author of the new book Framed A Villain's Perspective on social media. However, we're not here to promote questionable data science tactics. Instead, in this episode, we'll explore what ethical data scientists. Can learn from understanding the dark side of social media. So get ready to boost your impact, earn what you're worth, and rewrite your career algorithm.
[00:00:55] Tim, welcome to the show.
[00:00:57] Tim O'Hearn: Hey Genevieve, thanks for having me.
[00:00:59] Dr Genevieve Hayes: Social media platforms have become an integral part of modern life and business with algorithms silently shaping what we see and how we interact. For example, 58% of Australians now use social media platforms to research brands and products, and 43% engage with brand content before making purchasing decisions.
[00:01:22] Social media data represents a goldmine for data scientists looking to extract insights and create value through algorithm optimization. However, while most data scientists work to optimize business value within the rules and constraints of these platforms, there's much to learn from those who've explored their vulnerabilities and limitations.
[00:01:47] And this is something you've done, Tim, and your account of this is detailed in your new book, framed what initially drew you to the world of social media manipulation.
[00:01:59] Tim O'Hearn: My journey began with playing video games. Many people in my generation, I think we were introduced to. The digital world to the world and maybe of software engineering through this childhood obsession with video games. And every video game has something resembling a high score table. So maybe we can think of arcade games and then we can think to more modern types of RPGs and other games that people play today.
[00:02:27] There's always this concept of keeping score. And what I found was that these same things that we were looking at in video games started to permeate through other aspects of the internet. So for me, playing video games, I was thinking maybe one day I'll design my own game in the same way. I was thinking, well, what is a cheat code and how might that cheat code help me?
[00:02:50] Find these advantages to get onto the high score table or to beat people in a way that is akin to taking a shortcut. I found that a lot of these things did transfer to my career as a software engineer when I was first learning about it, and then also the opportunities that were available specifically on social media.
[00:03:08] A lot of it began with wanting to play games and wanting to beat other people at the game. When we look at social media today, I think that notion of high scores as it relates to follower account or views or likes is quite easy to understand.
[00:03:21] Dr Genevieve Hayes: Basically social media platforms have gamified making friends.
[00:03:25] Tim O'Hearn: Absolutely.
[00:03:27] Dr Genevieve Hayes: So you took your love of video games and then that translated into what you. Pursued as a career. How did you learn all the stuff about how to beat social media platforms? Because that's not the sort of thing they teach at university.
[00:03:43] Tim O'Hearn: There's no college course on how to beat the algorithm or how to violate terms of service. And in many cases, some of these topics are more advanced things that you might find in more master's level, graduate level type courses. For me, I went pretty deep into the internet in, communities that were dedicated to rule breaking and.
[00:04:06] In those communities, if you have a technical background, it's at least somewhat easier to differentiate between what's, a scam or what's just complete rubbish and what actually might be a pathway for exploitation. I. A huge part of it for me was looking through these communities and thinking, okay, well what's actually going on behind the scenes on Instagram?
[00:04:29] A lot of my experience was from the web scraping background. So someone who was saying, Hey as you said, there's value in this data. And of course if we can scrape all the pages on a website and we can do it every day, we might be able to create a product based around that. I had been. A web developer, and then naturally during web development you might learn how to scrape because that actually helps you test some of your own creations.
[00:04:52] And then eventually I found that there was maybe some profit to be, had there some value to learn these skills. And eventually one thing leads to another and leads to another where you're able to use web scraping or this client side, domain type of traversal to scrape sites like Instagram. Then additionally you have the cybersecurity angle, which is based on API hacking. So actually looking into what some of the web services underpinning services like Instagram are doing.
[00:05:21] Dr Genevieve Hayes: So what are some of the things that you've used your social media manipulation skills to achieve?
[00:05:27] Tim O'Hearn: I would say a lot of it started on MySpace, where at first I was finding ways to get more, friends in a way that was just social engineering maybe at its infancy. And then on Instagram being involved with this business called Shark social building, this industrial grade software as a service business where.
[00:05:48] I generated millions of followers for customers. And really what that achieved was it helped them get social proof. So it helped them show that they were maybe somebody worth following and they were more important, but also because we believed many of these followers were real. They were also potential leads.
[00:06:06] They were also people who might have been visiting a website or purchasing from that person's business if they had a business. A lot of the usefulness of my skillset was in that realm.
[00:06:17] Dr Genevieve Hayes: I don't wanna encourage any of our listeners to do anything unethical, but I do think there's a lot that ethical data scientists can learn from your experiences. Sort of like how white hat hackers can learn from the exploits of black hat hackers.
[00:06:31] And that's why I'd like to explore. So if you've spent enough time on social media, you've probably heard people talking about the algorithm. I post on LinkedIn and I'm always reading articles about how to make the LinkedIn algorithm work for me. And if you read those articles, you eventually come to the conclusion that most people don't really know how these social media algorithms work under the hood.
[00:06:58] Everyone's just making an educated guess. From your experience exploiting platform vulnerabilities, what fundamental insights did you gain about how social media algorithms actually work?
[00:07:12] Tim O'Hearn: This is an interesting one because not only did I exploit these platforms from the outside but much later in my career I worked at a social media startup where I actually built a. The algorithm and I actually built the push notification infrastructure and I built something resembling a very primitive recommendation engine.
[00:07:33] These are the fundamental parts of any persuasive technology system. And they're the parts of all of this, fake expertise that you refer to all across cyberspace where people are saying, oh, here's how you beat the algorithm. Here's how you do this, and here's how you do that.
[00:07:48] Even as someone who has built these systems, we can't know for sure exactly how Instagram's. Works, or we can't know exactly how Facebook's works. We do have to take educated guesses. But with the benefit of data science or understanding how data works at scale it's much easier to make guesses that are educated, meaning we have some statistical significance in what we're either testing or what were the conclusions that we're drawing from it.
[00:08:16] For me, I. A lot of it was based around the different signals, and that was one of the biggest learnings for me. So how does a algorithm work? We would think that the traditional ways are, well, somebody gives a thumbs up or they give a thumbs down, and that's how any type of recommendation system might work on,
[00:08:35] amazon, it might say, Hey, here's some suggested products. We can understand at some level how that would relate to just generic content. But the real learning is that it's not just about the thumbs up or thumbs down. There's also a lot of implicit feedback that's being drawn from users in ways that they maybe don't realize.
[00:08:54] And the main thing is, what is center in your phone screen the longer that that stays centered it generates a better and better positive feedback value for the recommendation engines. I can remember personally when this started to come up, at least five years ago, people were saying, Hey, I haven't liked anything.
[00:09:12] I haven't tapped anything, I haven't left any comments. How does it know? And it's because, they're starting to take into account these implicit feedback systems. And so to kind of get ahead of it, or at least get on the same level as these engineers there are really good books.
[00:09:28] So I think Kim Faulk wrote one of the better ones, which is probably called Practical recommender Systems. I believe it's an O'Reilly book and that really helped me connect the dots. Going from this blue collar kind of black hat approach and getting closer to the academic basis of it.
[00:09:46] Dr Genevieve Hayes: So you basically sang. Social media algorithms are just recommendation engines. Is that right?
[00:09:51] Tim O'Hearn: To a large extent, they are recommendation engines with quite a few more parameters than we would expect. Yeah.
[00:09:58] Dr Genevieve Hayes: 'Cause they're just recommending posts or whatever to the social media users based on their prior use.
[00:10:07] Tim O'Hearn: Yeah, and, and there's many layers to it. When I built my own feed, I realized that you needed filler content. So a user might not give you enough information about who they are. To start showing them things. So you could think the first time you log into TikTok, if you skip the, tell us about your interests section, TikTok still has to decide what video to show you, and in a lot of cases, that's a separate layer of the recommendation system where it says, okay. Maybe we have some demographic information or maybe we have this general notion of what will be exciting content. And that's what we think about these layers.
[00:10:46] So we think the general case and then the specific case, and then the highly individualized case where this kind of comprises the feed. And, yeah, to answer your question more directly, we've come a long way from just chronological feeds filled with what your friends have posted.
[00:11:02] Dr Genevieve Hayes: Now I've heard that the social media algorithms change all the time. One of the biggest problems faced by data scientists is that the assumptions underpinning their data changes, and if they don't realize that fast enough, they can be using models that don't reflect the current situation.
[00:11:22] I know you work in quantitative finance, so that's something you'd also experience in that. How did you identify when platforms had changed their algorithms, and what can legitimate data scientists learn from your approach?
[00:11:36] Tim O'Hearn: One concept that I've seen both in the social media evil doing space, as well as quantitative finance. Might be called a canary. And the idea of a canary is that you have some aspect of your system that is continually pinging or it's continually flying into the mine specifically to extract.
[00:11:58] Negative feedback, so to say, oh, hey, now the model is failing. Or Now an assumption of the model is broken and why? So rather than saying, oh, our parameter's a little off let's say. Are our base assumptions still true? And might there be a direct way that we can test this on a regular basis?
[00:12:18] I can't really speak about what I've done in quantitative trading, but I can say that in the Instagram space this was a huge component. I think many people would be, they would say, oh, I hate this guy when they learn that many times. My canary. Was my customer accounts.
[00:12:32] So the customer account might face some type of, feedback either they would get a temporary ban or they would say, Hey, I'm not getting the same amount of engagements, and we would use that as a leading indicator that something had changed. So for us, we weren't thinking about the algorithm because we were not selling a way to beat the algorithm what we were beating.
[00:12:53] Was bot detection. So my models were specifically around what is the universe of behaviors that influence the bot detection system. Either our accounts getting identified as bots, as disruptive accounts as low trust accounts
[00:13:09] Dr Genevieve Hayes: So would you quarantine? Off a certain number of bots and use those for your tests? Is that what you're saying with the canary concept?
[00:13:17] Tim O'Hearn: effectively. Yes. I would say. I ranked them in different tiers and at the tip of the spear, or you would say maybe the smallest tier, I had burner accounts. And the burner accounts were also addressed in one of the chapters of the book called Free Lunch where Chris Boetti he had also spoken about these burners and their purpose in his scraping system.
[00:13:39] The burner accounts were throwaways, they were not that valuable. So if you lost the account, you'd say, okay, and you'd move on to the next one. I would greatly prefer to have used them over customer accounts, but also due to the way that some of the customer accounts were acting or some of the, I.
[00:13:56] Implicit trust that they had. It was more useful to use customer accounts over burners, and I didn't write this in my book because it's a very unethical practice, but it's the truth. If I create an Instagram account tomorrow and I decide, okay, I'm gonna phone verify it and I'm gonna use it for testing, it's gonna have zero followers, no content, no post history.
[00:14:19] Instagram will know immediately that it is a less trustworthy account. Compared to a customer account that might've been around for five years. So sometimes the most useful data points were coming from situations where we pushed customer accounts to the brink. Now, all this being said, of thousands of customers, I never had a single customer account banned outright.
[00:14:40] Plenty of shadow bans, plenty of blocks but never an actual ban.
[00:14:44] Dr Genevieve Hayes: So what an ethical data scientist could take from this is if they've got data, be it customer data or some other data quarantine off, say, 1% of that data and use that for testing whether. The processes that generate the data are what they expect them to be. Is that right?
[00:15:05] Tim O'Hearn: Definitely, and it doesn't have to be done in a way where you have true statistical significance. So those in a more academic context. They don't need to worry about it as much. I think it's more a test of can I even turn the system on today and have the same results as I had yesterday.
[00:15:21] Dr Genevieve Hayes: And this sort of approach would've been very useful when the Covid pandemic showed up because the whole world changed overnight. And a lot of data scientists didn't change their models fast enough.
[00:15:32] Tim O'Hearn: Yeah, that's very true. And I think this adaptability is this very, very nebulous topic in data science where. You want as much data as possible. You want as many dimensions as possible. Especially now if we talk about, we'll use the models to train something. It's really important to think about when the model is either misbehaving or when the quality just isn't there anymore.
[00:15:56] So we were able to use some of these practices. Quite successfully. It doesn't mean we were able to solve or actually adapt, but it means that we had a pretty good, at least day over day indication of what might have changed. So every major bot busting event that I recount in the later parts of my book also at Shark Social, we were aware of them pretty much to the day or even to the hour because we not only had tons of customers, but because we had these types of quarantine accounts as well.
[00:16:25] Dr Genevieve Hayes: Are there any other techniques that you made use of while identifying and exploiting social media platforms that. Could also be of use to ethical data scientists.
[00:16:35] Tim O'Hearn: Yes, and I will credit Chris Ti with some of this original research where if you're building a scraper, which likely violates the terms of service you might still be thinking, well, I don't want to scrape low quality or bought accounts. It then becomes necessary to go on this side quest where you identify what about bot account actually is.
[00:16:59] So like what is an automated account? What does one of these or fake accounts look like? And that was an approach that I think is very useful for ethical data scientists or even. Internal employees at places like meta because data cleaning is a huge portion of any data task. And unless you're specifically looking for bots, you wanna clean out these low quality data points.
[00:17:23] And to do that, you actually have to develop your own notion of what a bot might look like.
[00:17:28] Dr Genevieve Hayes: And that's something we're going to get into in next week's episode. So I won't ask you too many questions about that right now.
[00:17:34] So if you could share just one technique from your experience that data scientists could apply ethically in their work, what would it be?
[00:17:43] Tim O'Hearn: When it comes to social media, I think simply. Gathering data points over time is one of the most simple but most effective practices that there is. We see websites such as social blade and social blade really is, and you would say maybe an aggregator of data points where it will scrape at regular intervals.
[00:18:04] Things such as follower count, like count, whatever. I think for most people who are thinking about, especially their first couple. Data science projects, maybe on social media platforms. This can provide a lot of the basics of answering some of those questions such as, are the followers fake or was this ad campaign effective?
[00:18:25] During the course of writing the book, I explored a lot of both the data points I had gathered, as well as those from sites like Social Blade, which is free to kind of notice different trends and make these calls, which is to say. This business is using social media well, and we can kind of see that reflected in very regular follower growth numbers.
[00:18:45] In other cases, we can say this. Activity over any number of months is highly suspicious. So I think just being able to plot it is really the foundation of anyone doing this for the first time. But that goes for scraping in general. If you can do that on social media, there are plenty of sites where if you're able to scrape every, which in most cases the scraping does violate terms of use.
[00:19:08] But if you are able to scrape. Every day or every hour and aggregate something over that time series, you probably have a viable business and it might not be a million dollar business, but there are tons of side projects out there today that are just completely based on that. So I think time series type data retrieval is still very viable.
[00:19:28] And it has to be one of the main things in anyone's toolbox.
[00:19:31] Dr Genevieve Hayes: Then once you have that time series data, you can start conducting experiments. So AB tests type things.
[00:19:37] Tim O'Hearn: For sure.
[00:19:38] Dr Genevieve Hayes: So what final advice would you give data scientists who want to better understand the systems they're working with without crossing ethical lines?
[00:19:46] Tim O'Hearn: This is very difficult because I find that my introduction to this was during a time where. APIs themselves were open. Every social media platform was so proudly proclaiming how open they were and how easy it was to integrate with. And you had hackathons where college kids were able to interact with the data points and build these cool projects.
[00:20:08] As of today, almost every one of those, platform APIs has been closed off and either highly restricted or you need to pay to use it, which to me kind of defeats the purpose of this positive hacker ethos. So for me it's hard to give advice knowing that so much of it is restricted.
[00:20:28] So I think today the valuable thing is to start with a pre curated data set. So going to a website or even on GitHub you can find some certain data sets and playing around with the data.
[00:20:41] It's less fun. But it's structured in a way and it's likely labeled in a way that does not violate terms of use. Assuming you don't use it for commercial purposes, so for research purposes it's probably fine, and I'm sure we're seeing researchers using it for that purpose.
[00:20:57] Dr Genevieve Hayes: For listeners who wanna get in touch with you, Tim, what can they do?
[00:21:01] Tim O'Hearn: As much as I dislike social media, I still have to use it. I'm most active on LinkedIn, where a couple times a month I'll post various summaries of things and happenings in my life. I also have a newsletter, which is pretty frequent, about once per week. That's tim O'Hearn dot beehive.com.
[00:21:17] For long form posts, I am still hosting my website for more than a decade now at tj O'Hearn dot com. I.
[00:21:25] Dr Genevieve Hayes: There you have it. Another value packed episode to help turn your data skills into serious clout, cash, and career freedom. If you enjoyed. This episode, why not make it a double? Next week catch Tim's value boost, a five minute episode where he shares one powerful tip for getting real results real fast.
[00:21:47] Make sure you're subscribed so you don't miss it. Thanks for joining me today, Tim,
[00:21:51] Tim O'Hearn: Thank you.
[00:21:53] Dr Genevieve Hayes: and for those in the audience, thanks for listening. I'm Dr. Genevieve Hayes, and this has been Value Driven Data Science.
Creators and Guests
