Episode 73: [Value Boost] How to Trust Social Media Data When You Can't Trust Social Media

Download MP3

[00:00:00] Dr Genevieve Hayes: Hello and welcome to Your Value Boost from Value Driven Data Science, the podcast that helps data scientists transform their technical expertise into tangible business value, career autonomy, and financial reward. I. I am Dr. Genevieve Hayes, and I'm here again with Tim O'Hern software engineer and author of Framed a Villain's perspective on social media to turbocharge your data science career in less time than it takes to run a simple query.
[00:00:33] In today's episode, you'll learn how to extract trustworthy insights from social media data, even when that data can't be trusted. Welcome back, Tim.
[00:00:44] Tim O'Hearn: Hi, Genevieve. Thanks for having me back on.
[00:00:46] Dr Genevieve Hayes: In our previous episode, we discussed some of your adventures on the dark side of social media which you've also chronicled in your book Framed and the lessons Ethical Data Scientists can learn from your experiences.
[00:01:00] One of the things that came up in that episode was the need to clean data and remove bot activity from social media data. As a data scientist, I'm used to dealing with dirty data, but I don't have the first idea of how to deal with data that people have deliberately taken steps to manipulate, for example, through creating bots.
[00:01:27] That's completely next level to me. So that's what I'd like to explore in today's episode. Based on your experience, what percentage of social media engagement would you estimate is artificial or manipulated?
[00:01:42] Tim O'Hearn: During what I would describe as the golden age of Instagram botting it was probably as high as 40%. So nearly half of the activity on the platform was either directly triggered by bots or the result of bot activity.
[00:01:58] Dr Genevieve Hayes: Are there any patterns in the data that should immediately raise red flags for possible bot or artificial activity?
[00:02:05] Tim O'Hearn: Yes. By 2019, researchers at Meta had collaborated with researchers from UCSD, so that's University of California San Diego, to address some of these patterns of manipulation and finding that if one bot did this, then maybe thousands or millions of them do. So I think anybody attempting to clean data today, or anyone simply attempting to clean up these platforms, can look to that research to reach these same conclusions, which is to say every one of these bots can be fully autonomous, totally unique.
[00:02:41] There must be commonalities between them. And that's regardless of what platform we're on.
[00:02:46] Dr Genevieve Hayes: So the same software is used to create the bots.
[00:02:49] Tim O'Hearn: So you might see the same type of metadata or the same type of profile data in these bot accounts. Or you would see the same patterns of behavior. For example, if you have a large enough data scope, you could say, oh, the event stream clearly shows maybe an inhuman or non-human gap of one to two seconds per action.
[00:03:10] So you'd think between a like and a follow, maybe one to two seconds, a elapses. If I'm coding a bot, that would be my best guess or my attempt. We know that humans consistently would never have that narrow of a time band behind activities. So as we get more and more and as we have these things available as a time series, if we're lucky enough, we can definitely see those types of behaviors.
[00:03:34] Dr Genevieve Hayes: So when data scientists encounter manipulated data, which from what you're saying, it sounds like you have to assume any social media data is manipulated data. What's the first step they should take? In salvaging usable insights from it.
[00:03:50] Tim O'Hearn: It depends on the resources at hand. So first of all, how rich and how many dimensions there are to the data set that they're dealing with is very, very important. For example, if someone says, Hey, here's a thousand of our potential leads or potential customers, I. Is there much you can do there as a fledgling data scientist with no other reference data?
[00:04:11] The answer is it would be tough to build anything beyond your own intuition. So saying like, okay, well I. At least this account, it doesn't seem like English is its primary language, so we don't care about it. That's not to say it's manipulated, but just to say it's low quality. Beyond checks like that, it's very difficult.
[00:04:29] However, if you have true scale and you have a team and you have compute resources, it is possible to build models from scratch that would identify either like spam comments or then in the high resource case irrelevant comments. So thinking. What if the comment doesn't match the context of the picture?
[00:04:51] Well, to make that determination, we would also have to load and do computer vision stuff on the picture. That's very expensive to do at scale. But that is an effective practice and that is something that people have done at the micro level.
[00:05:04] Dr Genevieve Hayes: Okay, so it sounds like what you're saying is if you have the data and the resources, you could actually be. Old, some sort of machine learning classifier to classify accounts that you believe to be bot accounts versus ones that you believe to be human.
[00:05:20] Tim O'Hearn: Absolutely. There are so many dimensions and I think there's enough knowledge. People have seen enough bots to know how to get started, so yeah, that's totally, totally possible.
[00:05:31] Dr Genevieve Hayes: And these sorts of techniques probably underpin bot detection software that you find on the internet.
[00:05:36] Tim O'Hearn: Absolutely. I would say it begins with static rule sets and then it's to this more model and classifier based approach. Absolutely.
[00:05:44] Dr Genevieve Hayes: How big an impact would that potentially have on their results if they could implement something like that?
[00:05:50] Tim O'Hearn: Anyone in the marketing space or anyone who has to attribute some value or opportunity cost. To each account would see a pretty massive boost in being able to avoid irrelevant accounts and also accounts that are either manipulated or will never click. We've kind of explored this a little bit in online advertising communities where to advertise on a platform like Meta or Google, you just have to take their word for it, that every viewer impression is actually a human.
[00:06:21] But we know that in practice that's not true. All we know is that. You're getting some sales when you run ads, you're not getting this guarantee that 100% of the traffic is legitimate. So for me to put a number on it, I would say the benefit that someone could see is probably in the 10 to 20% range.
[00:06:41] If you were able to cut out bots or cut out most bots, you would see that ROI on marketing type efforts would probably increase by that much.
[00:06:49] Dr Genevieve Hayes: Okay, so this is definitely worth the effort.
[00:06:52] Tim O'Hearn: It's definitely worth the effort if you have access to the resources and the preexisting data sets to do it. And it also depends on. Who you're targeting.
[00:07:00] So this would be a case where you're more or less scraping leads from the platform. You can't of course implement your own targeting algorithm within META'S ads platform, but yeah, I am here to say that if you were able to run your own targeting code.
[00:07:16] If you knew your niche really well and you had the data to back it up a very sophisticated engineering team could absolutely beat out Meta's own targeting.
[00:07:25] Dr Genevieve Hayes: Even if we're not dealing with social media data, if you're dealing with any data set, there could be potential value in building some sort of machine learning algorithm to identify dodgy data records.
[00:07:39] Tim O'Hearn: And there's a lot of freedom there to identify what exactly dodgy means.
[00:07:44] Dr Genevieve Hayes: So what final advice would you give to data scientists who need to extract business value from social media despite knowing it may contain deliberate manipulation? I.
[00:07:55] Tim O'Hearn: I think it's very important to consider the ethics of what one is doing. So hopefully someone who's approaching these problems is not trying to identify opportunities to further exploit platforms, to generate any type of spam. My advice is that there are great data sets out there that someone can use to get their feet wet and to better design their own systems.
[00:08:15] But it's important now to consider the ethics of the research that's being done. And that there aren't human consequences to it. And also what the true benefit is, whether it's business value or whether it's something more sociological for the researchers.
[00:08:30] Dr Genevieve Hayes: That's a wrap for today's value boost. But if you want more insights from Tim, you are in luck. We've got a longer episode with Tim. Where we dive deeper into what legitimate data scientists can learn from understanding the dark side of social media, and it's packed with no nonsense advice for turning your data skills into serious clout, cash, and career freedom.
[00:08:55] You can find it now wherever you found this episode or at your favorite podcast platform. Thanks for joining me again, Tim.
[00:09:03] Tim O'Hearn: Thanks.
[00:09:04] Dr Genevieve Hayes: And for those in the audience, thanks for listening. I'm Dr. Genevieve Hayes, and this has been Value-Driven Data Science.

Episode 73: [Value Boost] How to Trust Social Media Data When You Can't Trust Social Media
Broadcast by